lingvo.core.cluster module

Specification of a training cluster.

class lingvo.core.cluster.InfeedContext(infeed_host_index, num_infeed_hosts)

Bases: tuple

_asdict()

Return a new dict which maps field names to their values.

_field_defaults = {}
_fields = ('infeed_host_index', 'num_infeed_hosts')
classmethod _make(iterable)

Make a new InfeedContext object from a sequence or iterable

_replace(**kwds)

Return a new InfeedContext object replacing specified fields with new values

infeed_host_index

Alias for field number 0

num_infeed_hosts

Alias for field number 1

lingvo.core.cluster.InfeedContextScope(infeed_host_index, num_infeed_hosts)[source]
lingvo.core.cluster.GetInfeedContext()[source]
lingvo.core.cluster.MakeDeviceString(job_name, replica_id, task_id, device_name, device_id)[source]
class lingvo.core.cluster._Cluster(params)[source]

Bases: object

The whole training cluster from a single task’s point of view.

classmethod _JobSpec(replicas)[source]

Construct a job spec param with the given number of replicas.

classmethod Params()[source]

Defaults parameters for a cluster.

InitDevices(sess)[source]
InitDevicesEager()[source]
ListDevices(job_spec)[source]

Lists devices in the job.

Parameters

job_spec – A param object specifying a job in a training cluster.

Returns

Returns a 2D np string array. ret[i, j] is the i-th replica’s j-th devices.

Raises

RuntimeError – the cluster configuration does not match actual devices.

static Top()[source]
ExportMetrics(*args, **kwargs)[source]

Export metrics externally.

_CheckInvariants()[source]

A set of invariants about the setup of the cluster.

NOTE. Two job specs can be identical. E.g., if p.worker.name is the same as p.ps.name, that means ps is colocated with worker.

property params
property mode
property job
property logdir
property task
property is_executor_tpu
property job_spec

Returns the current job specs.

property asynchronous

Returns True if configured for asynchronous training.

property synchronous

Returns True if configured for synchronous training.

property num_replicas
property tpus_per_replica
property num_tpu_hosts
property num_devices_per_replica
property total_worker_devices

Return the total number of discrete worker devices in the cluster.

property num_devices_per_split

Return number of accelerators to use per split.

property num_splits_per_replica
property num_splits_per_client

The number of splits visible by one trainer client.

property available_devices

Returns all compute devices available in a 2D array.

Returns

A 2D array (python list of python lists) of strings. ret[i, j] is the j-th visible device on i-th visible replica.

property input_device

Returns the tensorflow device name to place input op on.

property all_worker_names
property input_targets

Returns a list of network addresses of the input job.

WorkerDeviceInModelSplit(device_index)[source]

Returns the device to use for ‘device_index’ for the current model split.

Parameters

device_index – An int, the device index within ‘model_split’.

Returns

A string. The device to place ops onto.

Raises

ValueError – if split_id of cluster is incorrectly set.

GetPlacer(strategy=None)[source]

Returns a device function for placing ops within the cluster.

Parameters

strategy – A string. Identifier for a placement strategy. By default, we use a least loaded policy to place variables.

Returns

Returns a device function can be used in tf.device().

Raises

ValueError – when strategy is not supported.

property tf_data_service_address
property add_summary
property do_eval
property in_unit_test
property require_sequential_input_order
property worker_cluster_def

Returns a tf.train.ClusterDef representing the worker cluster.

property reporting_job
class lingvo.core.cluster.VarPlacer(cluster)[source]

Bases: object

Placer which places variables across a set of devices.

VarPlacer places non-variable ops on the worker device.

_AssignVar(_)[source]
DeviceFunction(op)[source]

Choose a device for ‘op’.

Parameters

op – an Operation.

Returns

The device to use for the Operation.

class lingvo.core.cluster._LeastLoadedPlacer(cluster)[source]

Bases: VarPlacer

Placer which places a variable on the least loaded var device.

We use total byte sizes of variables placed on a device to indicate the device’s load.

_AssignVar(var_op)[source]
lingvo.core.cluster.ParseDeviceString(device_str)[source]

Parse a device string and return a NestedMap.

Parameters

device_str – a device string in the format of that may contain up to 4 parts: job, replica, task, and device.

Returns

a NestedMap that maps job, replica, task, and device to their

corresponding value.

Return type

parsed_device