lingvo.core.adagraft module
AdaGraft optimizer https://arxiv.org/abs/2002.11803 .
- class lingvo.core.adagraft.AdaGraftOptimizer(learning_rate, magnitude_optimizer, direction_optimizer, diagnostic=False, use_global_norm=False, name='AdaGraft')[source]
Bases:
OptimizerOptimizer which combines per-layer direction and magnitude from two optimizers.
Disentangling Adaptive Gradient Methods from Learning Rates Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang https://arxiv.org/abs/2002.11803
- _create_slots(var_list)[source]
Create all slots needed by the variables.
- Parameters
var_list – A list of
Variableobjects.
- _prepare()[source]
Create all needed tensors before applying gradients.
This is called with the name_scope using the “name” that users have chosen for the application of gradients.
- _apply_dense(grad, var)[source]
Add ops to apply dense gradients to
var.- Parameters
grad – A
Tensor.var – A
Variableobject.
- Returns
An
Operation.
- _resource_apply_dense(grad, var)[source]
Add ops to apply dense gradients to the variable
handle.- Parameters
grad – a
Tensorrepresenting the gradient.handle – a
Tensorof dtyperesourcewhich points to the variable to be updated.
- Returns
An
Operationwhich updates the value of the variable.
- _internal_apply_dense(grad, var, magnitude_optimizer_apply_fn, direction_optimizer_apply_fn)[source]
Main optimization logic of AdaGraft, which calls the child optimizers.
- Parameters
grad – Tensor containing gradients.
var – Tensor containing parameter values.
magnitude_optimizer_apply_fn – Apply magnitude optimizer.
direction_optimizer_apply_fn – Apply direction optimizer.
- Returns
The final update op, which increments var by the grafted step.
Pseudocode: - Copy weights into scratch space ‘scratch_copy’. - Run magnitude_optimizer in-place. - Use scratch copy to figure out how far we moved (‘magnitude_step’). - Copy weights back. - Run direction_optimizer in-place. - Move weights along the line segment with scratch_copy.
- _finish(update_ops, name_scope)[source]
Do what is needed to finish the update.
This is called with the
name_scopeusing the “name” that users have chosen for the application of gradients.- Parameters
update_ops – List of
Operationobjects to update variables. This list contains the values returned by the_apply_dense()and_apply_sparse()calls.name_scope – String. Name to use for the returned operation.
- Returns
The operation to apply updates.
- _resource_apply_sparse(grad_values, var, grad_indices)[source]
Add ops to apply sparse gradients to the variable
handle.Similar to
_apply_sparse, theindicesargument to this method has been de-duplicated. Optimizers which deal correctly with non-unique indices may instead override_resource_apply_sparse_duplicate_indicesto avoid this overhead.- Parameters
grad – a
Tensorrepresenting the gradient for the affected indices.handle – a
Tensorof dtyperesourcewhich points to the variable to be updated.indices – a
Tensorof integral type representing the indices for which the gradient is nonzero. Indices are unique.
- Returns
An
Operationwhich updates the value of the variable.
- _apply_sparse(grad, var)[source]
Add ops to apply sparse gradients to
var.The IndexedSlices object passed to
gradin this function is by default pre-processed in_apply_sparse_duplicate_indicesto remove duplicate indices (see its docstring for details). Optimizers which can tolerate or have correct special cases for duplicate sparse indices may override_apply_sparse_duplicate_indicesinstead of this function, avoiding that overhead.- Parameters
grad –
IndexedSlices, with no repeated indices.var – A
Variableobject.
- Returns
An
Operation.