lingvo.core.cluster module
Specification of a training cluster.
- class lingvo.core.cluster.InfeedContext(infeed_host_index, num_infeed_hosts)
Bases:
tuple
- _asdict()
Return a new dict which maps field names to their values.
- _field_defaults = {}
- _fields = ('infeed_host_index', 'num_infeed_hosts')
- classmethod _make(iterable)
Make a new InfeedContext object from a sequence or iterable
- _replace(**kwds)
Return a new InfeedContext object replacing specified fields with new values
- infeed_host_index
Alias for field number 0
- num_infeed_hosts
Alias for field number 1
- lingvo.core.cluster.MakeDeviceString(job_name, replica_id, task_id, device_name, device_id)[source]
- class lingvo.core.cluster._Cluster(params)[source]
Bases:
object
The whole training cluster from a single task’s point of view.
- classmethod _JobSpec(replicas)[source]
Construct a job spec param with the given number of replicas.
- ListDevices(job_spec)[source]
Lists devices in the job.
- Parameters
job_spec – A param object specifying a job in a training cluster.
- Returns
Returns a 2D np string array. ret[i, j] is the i-th replica’s j-th devices.
- Raises
RuntimeError – the cluster configuration does not match actual devices.
- _CheckInvariants()[source]
A set of invariants about the setup of the cluster.
NOTE. Two job specs can be identical. E.g., if p.worker.name is the same as p.ps.name, that means ps is colocated with worker.
- property params
- property mode
- property job
- property logdir
- property task
- property is_executor_tpu
- property job_spec
Returns the current job specs.
- property asynchronous
Returns True if configured for asynchronous training.
- property synchronous
Returns True if configured for synchronous training.
- property num_replicas
- property tpus_per_replica
- property num_tpu_hosts
- property num_devices_per_replica
- property total_worker_devices
Return the total number of discrete worker devices in the cluster.
- property num_devices_per_split
Return number of accelerators to use per split.
- property num_splits_per_replica
- property num_splits_per_client
The number of splits visible by one trainer client.
- property available_devices
Returns all compute devices available in a 2D array.
- Returns
A 2D array (python list of python lists) of strings. ret[i, j] is the j-th visible device on i-th visible replica.
- property input_device
Returns the tensorflow device name to place input op on.
- property all_worker_names
- property input_targets
Returns a list of network addresses of the input job.
- WorkerDeviceInModelSplit(device_index)[source]
Returns the device to use for ‘device_index’ for the current model split.
- Parameters
device_index – An int, the device index within ‘model_split’.
- Returns
A string. The device to place ops onto.
- Raises
ValueError – if split_id of cluster is incorrectly set.
- GetPlacer(strategy=None)[source]
Returns a device function for placing ops within the cluster.
- Parameters
strategy – A string. Identifier for a placement strategy. By default, we use a least loaded policy to place variables.
- Returns
Returns a device function can be used in tf.device().
- Raises
ValueError – when strategy is not supported.
- property tf_data_service_address
- property add_summary
- property do_eval
- property in_unit_test
- property require_sequential_input_order
- property worker_cluster_def
Returns a tf.train.ClusterDef representing the worker cluster.
- property reporting_job
- class lingvo.core.cluster.VarPlacer(cluster)[source]
Bases:
object
Placer which places variables across a set of devices.
VarPlacer places non-variable ops on the worker device.
- class lingvo.core.cluster._LeastLoadedPlacer(cluster)[source]
Bases:
VarPlacer
Placer which places a variable on the least loaded var device.
We use total byte sizes of variables placed on a device to indicate the device’s load.
- lingvo.core.cluster.ParseDeviceString(device_str)[source]
Parse a device string and return a NestedMap.
- Parameters
device_str – a device string in the format of that may contain up to 4 parts: job, replica, task, and device.
- Returns
- a NestedMap that maps job, replica, task, and device to their
corresponding value.
- Return type
parsed_device