lingvo.core.adagraft module
AdaGraft optimizer https://arxiv.org/abs/2002.11803 .
- class lingvo.core.adagraft.AdaGraftOptimizer(learning_rate, magnitude_optimizer, direction_optimizer, diagnostic=False, use_global_norm=False, name='AdaGraft')[source]
Bases:
Optimizer
Optimizer which combines per-layer direction and magnitude from two optimizers.
Disentangling Adaptive Gradient Methods from Learning Rates Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang https://arxiv.org/abs/2002.11803
- _create_slots(var_list)[source]
Create all slots needed by the variables.
- Parameters
var_list – A list of
Variable
objects.
- _prepare()[source]
Create all needed tensors before applying gradients.
This is called with the name_scope using the “name” that users have chosen for the application of gradients.
- _apply_dense(grad, var)[source]
Add ops to apply dense gradients to
var
.- Parameters
grad – A
Tensor
.var – A
Variable
object.
- Returns
An
Operation
.
- _resource_apply_dense(grad, var)[source]
Add ops to apply dense gradients to the variable
handle
.- Parameters
grad – a
Tensor
representing the gradient.handle – a
Tensor
of dtyperesource
which points to the variable to be updated.
- Returns
An
Operation
which updates the value of the variable.
- _internal_apply_dense(grad, var, magnitude_optimizer_apply_fn, direction_optimizer_apply_fn)[source]
Main optimization logic of AdaGraft, which calls the child optimizers.
- Parameters
grad – Tensor containing gradients.
var – Tensor containing parameter values.
magnitude_optimizer_apply_fn – Apply magnitude optimizer.
direction_optimizer_apply_fn – Apply direction optimizer.
- Returns
The final update op, which increments var by the grafted step.
Pseudocode: - Copy weights into scratch space ‘scratch_copy’. - Run magnitude_optimizer in-place. - Use scratch copy to figure out how far we moved (‘magnitude_step’). - Copy weights back. - Run direction_optimizer in-place. - Move weights along the line segment with scratch_copy.
- _finish(update_ops, name_scope)[source]
Do what is needed to finish the update.
This is called with the
name_scope
using the “name” that users have chosen for the application of gradients.- Parameters
update_ops – List of
Operation
objects to update variables. This list contains the values returned by the_apply_dense()
and_apply_sparse()
calls.name_scope – String. Name to use for the returned operation.
- Returns
The operation to apply updates.
- _resource_apply_sparse(grad_values, var, grad_indices)[source]
Add ops to apply sparse gradients to the variable
handle
.Similar to
_apply_sparse
, theindices
argument to this method has been de-duplicated. Optimizers which deal correctly with non-unique indices may instead override_resource_apply_sparse_duplicate_indices
to avoid this overhead.- Parameters
grad – a
Tensor
representing the gradient for the affected indices.handle – a
Tensor
of dtyperesource
which points to the variable to be updated.indices – a
Tensor
of integral type representing the indices for which the gradient is nonzero. Indices are unique.
- Returns
An
Operation
which updates the value of the variable.
- _apply_sparse(grad, var)[source]
Add ops to apply sparse gradients to
var
.The IndexedSlices object passed to
grad
in this function is by default pre-processed in_apply_sparse_duplicate_indices
to remove duplicate indices (see its docstring for details). Optimizers which can tolerate or have correct special cases for duplicate sparse indices may override_apply_sparse_duplicate_indices
instead of this function, avoiding that overhead.- Parameters
grad –
IndexedSlices
, with no repeated indices.var – A
Variable
object.
- Returns
An
Operation
.