lingvo.core.adagraft module

AdaGraft optimizer https://arxiv.org/abs/2002.11803 .

class lingvo.core.adagraft.AdaGraftOptimizer(learning_rate, magnitude_optimizer, direction_optimizer, diagnostic=False, use_global_norm=False, name='AdaGraft')[source]

Bases: Optimizer

Optimizer which combines per-layer direction and magnitude from two optimizers.

Disentangling Adaptive Gradient Methods from Learning Rates Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang https://arxiv.org/abs/2002.11803

_create_slots(var_list)[source]

Create all slots needed by the variables.

Parameters

var_list – A list of Variable objects.

_prepare()[source]

Create all needed tensors before applying gradients.

This is called with the name_scope using the “name” that users have chosen for the application of gradients.

_apply_dense(grad, var)[source]

Add ops to apply dense gradients to var.

Parameters
  • grad – A Tensor.

  • var – A Variable object.

Returns

An Operation.

_resource_apply_dense(grad, var)[source]

Add ops to apply dense gradients to the variable handle.

Parameters
  • grad – a Tensor representing the gradient.

  • handle – a Tensor of dtype resource which points to the variable to be updated.

Returns

An Operation which updates the value of the variable.

_internal_apply_dense(grad, var, magnitude_optimizer_apply_fn, direction_optimizer_apply_fn)[source]

Main optimization logic of AdaGraft, which calls the child optimizers.

Parameters
  • grad – Tensor containing gradients.

  • var – Tensor containing parameter values.

  • magnitude_optimizer_apply_fn – Apply magnitude optimizer.

  • direction_optimizer_apply_fn – Apply direction optimizer.

Returns

The final update op, which increments var by the grafted step.

Pseudocode: - Copy weights into scratch space ‘scratch_copy’. - Run magnitude_optimizer in-place. - Use scratch copy to figure out how far we moved (‘magnitude_step’). - Copy weights back. - Run direction_optimizer in-place. - Move weights along the line segment with scratch_copy.

_finish(update_ops, name_scope)[source]

Do what is needed to finish the update.

This is called with the name_scope using the “name” that users have chosen for the application of gradients.

Parameters
  • update_ops – List of Operation objects to update variables. This list contains the values returned by the _apply_dense() and _apply_sparse() calls.

  • name_scope – String. Name to use for the returned operation.

Returns

The operation to apply updates.

_resource_apply_sparse(grad_values, var, grad_indices)[source]

Add ops to apply sparse gradients to the variable handle.

Similar to _apply_sparse, the indices argument to this method has been de-duplicated. Optimizers which deal correctly with non-unique indices may instead override _resource_apply_sparse_duplicate_indices to avoid this overhead.

Parameters
  • grad – a Tensor representing the gradient for the affected indices.

  • handle – a Tensor of dtype resource which points to the variable to be updated.

  • indices – a Tensor of integral type representing the indices for which the gradient is nonzero. Indices are unique.

Returns

An Operation which updates the value of the variable.

_apply_sparse(grad, var)[source]

Add ops to apply sparse gradients to var.

The IndexedSlices object passed to grad in this function is by default pre-processed in _apply_sparse_duplicate_indices to remove duplicate indices (see its docstring for details). Optimizers which can tolerate or have correct special cases for duplicate sparse indices may override _apply_sparse_duplicate_indices instead of this function, avoiding that overhead.

Parameters
  • gradIndexedSlices, with no repeated indices.

  • var – A Variable object.

Returns

An Operation.