lingvo.tools.gke_launch module

Launch script for GKE jobs.

This script generates the GKE deployment configs for TPU training, GPU/CPU decoding, and TensorBoard jobs.

It assumes you have:

  1. Copied any input data to GCS.

  2. Docker and GKE/GCP tools installed locally.

  3. Created the TPU and GPU clusters using gcloud containers create.

This script launches jobs by:

  1. Building a lingvo docker image built from –base_image, copying the directory pointed to by –build, writing that image to –image (with an automatically generated date-based tag for versioning).

  2. Identifying the full name of the GKE cluster that each accelerator job runs in based on –trainer_cell and –decoder_cell

  3. Writing out .yaml configuration files based on –name, –model, –logdir, that will launch the docker images for each job type.

Usage looks something like:

python3 lingvo/tools/gke_launch.py

–model=$MODEL –base_image=tensorflow:lingvo_lib_gpu –image=$DOCKER_IMAGE –logdir=$LOGDIR –tpu_type=$TPU_TYPE –trainer_cell=$TPU_CLUSTER_NAME –decoder_cell=$GPU_CLUSTER_NAME –decoder_gpus=1 –gpu_type=$GPU_TYPE –decoder=dev –extra_envs=KITTI_DIR=$GCS_PATH –name=$EXP_NAME –build=$YOUR_CODE_DIR $ACTION $TARGETS

ACTION specifies whether to start (up), stop (down) or reload the target jobs. One can also specify “print” to just print out the .yaml configuration files.

TARGETS specifies whether the action affects all jobs (“all”) or just an individual job (“trainer”, “decoder”, “tensorboard”).

See the flags definition below for details on the arguments.

lingvo.tools.gke_launch._get_or_add(cfg, name)[source]

Gets cfg[name], or adds ‘name’ with an empty dict if not present.

lingvo.tools.gke_launch.add_gpu_to_pod(cfg, gpu_type, num_gpus)[source]

Sets the appropriate GPU fields to cfg.

Parameters
  • cfg – The YAML-based dictionary to update.

  • gpu_type – The type of GPU to launch on GKE.

  • num_gpus – The number of GPUs to launch in the task.

lingvo.tools.gke_launch.set_pod_cpu_memory(cfg, cpu_memory)[source]

Sets the amount of CPU memory to request in the container.

lingvo.tools.gke_launch.decoder_template(job_name, model, image, logdir, decoder_type, decoder_gpus)[source]

Constructs the base yaml config for the decoder.

lingvo.tools.gke_launch._tpu_resource(tpu_type)[source]
lingvo.tools.gke_launch.tpu_training_template(job_name, model, image, logdir, tpu_type)[source]

Constructs the base yaml config for the TPU trainer.

lingvo.tools.gke_launch.tensorboard_template(job_name, logdir, port)[source]

Constructs the tensorboard YAML template.

lingvo.tools.gke_launch.build_docker_image(image, base_image, code_directory, extra_envs)[source]

Build a docker image and push it to the location specified by image.

Parameters
  • image – String name of tag to use, e.g., ‘gcr.io/foo/bar:version’

  • base_image – String name of base lingvo image to build from.

  • code_directory – Location of directory whose contents will be copied into the image.

  • extra_envs – A comma-separated list of key=value environment variables to be built into the docker.

lingvo.tools.gke_launch.get_gke_cluster(gke_cluster_spec)[source]

Get the full name of the GKE cluster given shorthand gke_cluster_spec.

For example, gcloud container cluster list produces:

NAME LOCATION … p100-europe-west4-a-nh16 europe-west4-a … test-df-europe europe-west4-a …

Then a gke_cluster_spec of ‘p100’ or ‘df’ will produce the fully-qualified cluster.

A gke_cluster_spec of ‘europe’ will raise a ValueError because there are two active clusters that have the string ‘europe’ in them.

Parameters

gke_cluster_spec – A string specifying a filter on active clusters.

Returns

The fully qualified GKE cluster name, or None if not found.

Raises

ValueError – If gke_cluster_spec does not uniquely identify an active cluster.

lingvo.tools.gke_launch.create_tpu_cluster(cluster_name, zone)[source]
lingvo.tools.gke_launch.validate_args(argv)[source]

Validates the input arguments. Raises a UsageError if invalid.

lingvo.tools.gke_launch.main(argv)[source]