Optimizer¶
This module includes a set of optimizers for updating model parameters. It replaces the old optimizers from optimizer.py
-
class
singa.opt.
Optimizer
(config)¶ Bases:
object
Base optimizer.
- Parameters
config (Dict) – specify the default values of configurable variables.
-
update
(param, grad)¶ Update the param values with given gradients.
-
step
()¶ To increment the step counter
-
register
(param_group, config)¶
-
load
()¶
-
save
()¶
-
class
singa.opt.
SGD
(lr=0.1, momentum=0, dampening=0, weight_decay=0, nesterov=False)¶ Bases:
singa.opt.Optimizer
Implements stochastic gradient descent (optionally with momentum).
Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.
- Args:
lr(float): learning rate momentum(float, optional): momentum factor(default: 0) weight_decay(float, optional): weight decay(L2 penalty)(default: 0) dampening(float, optional): dampening for momentum(default: 0) nesterov(bool, optional): enables Nesterov momentum(default: False)
- Typical usage example:
>> > from singa import opt >> > optimizer = opt.SGD(lr=0.1, momentum=0.9) >> > optimizer.update()
Note
The implementation of SGD with Momentum / Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.
Considering the specific case of Momentum, the update can be written as
\[v =\]- ho * v + g
p = p - lr * v
where p, g, v and: math: `
- ho` denote the parameters, gradient,
velocity, and momentum respectively.
This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form
\[v =\]- ho * v + lr * g
p = p - v
The Nesterov version is analogously modified.
-
update
(param, grad)¶ Performs a single optimization step.
-
backward_and_update
(loss)¶ Performs backward propagation from the loss and parameter update.
From the loss, it performs backward propagation to get the gradients and do the parameter update.
- Parameters
loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –
-
class
singa.opt.
DistOpt
(opt=<singa.opt.SGD object>, nccl_id=None, gpu_num=None, gpu_per_node=None, buffSize=4194304)¶ Bases:
object
The class is designed to wrap an optimizer to do distributed training.
This class is used to wrap an optimizer object to perform distributed training based on multiprocessing. Each process has an individual rank, which gives information of which GPU the individual process is using. The training data is partitioned, so that each process can evaluate the sub-gradient based on the partitioned training data. Once the sub-graident is calculated on each processes, the overall stochastic gradient is obtained by all-reducing the sub-gradients evaluated by all processes. The all-reduce operation is supported by the NVidia Collective Communication Library (NCCL).
- Parameters
opt (Optimizer) – The optimizer to be wrapped.
nccl_id (NcclIdHolder) – an nccl id holder object for a unique communication id
gpu_num (int) – the GPU id in a single node
gpu_per_node (int) – the number of GPUs in a single node
buffSize (int) – the buffSize in terms of number of elements used in nccl communicator
-
world_size
¶ total number of processes
- Type
int
-
rank_in_local
¶ local rank of a process on the current node
- Type
int
-
rank_in_global
¶ global rank of a process
- Type
int
- Typical usage example:
>> > from singa import opt >> > optimizer = opt.SGD(lr=0.1, momentum=0.9) >> > optimizer = opt.DistOpt(sgd)
-
update
(param, grad)¶ Performs a single optimization step.
-
all_reduce
(tensor)¶ Performs all reduce of a tensor for distributed training.
- Parameters
tensor (Tensor) – a tensor to be all-reduced
-
fused_all_reduce
(tensor, send=True)¶ Performs all reduce of the tensors after fusing them in a buffer.
- Parameters
tensor (List of Tensors) – a list of tensors to be all-reduced
send (bool) – When send is False, the tensor won’t be send to the
device immediately (target) –
will be copied to the buffer first (it) –
-
all_reduce_half
(tensor)¶ Performs all reduce of a tensor after converting to FP16.
- Parameters
tensor (Tensor) – a tensor to be all-reduced
-
fused_all_reduce_half
(tensor, send=True)¶ Performs all reduce of the tensors after fusing and converting them to FP16.
- Parameters
tensor (List of Tensors) – a list of tensors to be all-reduced
send (bool) – When send is False, the tensor won’t be send to the
device immediately (target) –
will be copied to the buffer first (it) –
-
sparsification
(tensor, accumulation, spars, topK)¶ Performs all reduce of a tensor after sparsification.
- Parameters
tensor (Tensor) – a tensor to be all-reduced
accumulation (Tensor) – local gradient accumulation
spars (float) – a parameter to control sparsity as defined below
topK (bool) – When topK is False, it sparsifies the gradient with absolute
>= sparsWhen topK is True (value) –
sparsifies a fraction of total gradient (it) –
equals to spars (number) –
when spars = 0.01 (E.g.) –
sparsifies 1 % of the (it) –
gradient elements (total) –
-
fused_sparsification
(tensor, accumulation, spars, topK)¶ Performs all reduce of the tensors after fusing and sparsification.
- Parameters
tensor (List of Tensors) – a list of tensors to be all-reduced
accumulation (Tensor) – local gradient accumulation
spars (float) – a parameter to control sparsity as defined below
topK (bool) – When topK is False, it sparsifies the gradient with absolute
>= sparsWhen topK is True (value) –
sparsifies a fraction of total gradient (it) –
equals to spars (number) –
when spars = 0.01 (E.g.) –
sparsifies 1 % of the (it) –
gradient elements (total) –
-
wait
()¶ Wait for the cuda streams used by the communicator to finish their operations.
-
backward_and_update
(loss, threshold=2097152)¶ Performs backward propagation from the loss and parameter update.
From the loss, it performs backward propagation to get the gradients and do the parameter update. For gradient communication, it fuses all the tensor smaller than the threshold value to reduce network latency.
- Parameters
loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold (the) –
are to (they) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value (of) –
are to be reduced directly (they) –
fusion. (without) –
-
backward_and_update_half
(loss, threshold=2097152, clipping=False, clip_Value=100)¶ Performs backward propagation and parameter update, with FP16 precision communication.
THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. For gradient communication, it fuses all the tensor smaller than the threshold value to reduce network latency, as well as converting them to FP16 half precision format before sending them out. To assist training, this functions provide an option to perform gradient clipping.
- Parameters
loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold (the) –
are to (they) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value (of) –
are to be reduced directly (they) –
fusion. (without) –
clipping (bool) – a boolean flag to choose whether to clip the gradient value
clip_value (float) – the clip value to be used when clipping is True
-
backward_and_partial_update
(loss, threshold=2097152)¶ Performs backward propagation from the loss and parameter update using asychronous training.
THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. It fuses the tensors smaller than the threshold value to reduce network latency, as well as performing asychronous training where one parameter partition is all-reduced per iteration. The size of the parameter partition depends on the threshold value.
- Parameters
loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold (the) –
are to (they) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value (of) –
are to be reduced directly (they) –
fusion. (without) –
-
self.
partial
¶ A counter to determine which partition to perform all-reduce.
- Type
int
-
This counter resets to zero automatlly after an update cycle of the full parameter
-
set.
-
backward_and_spars_update
(loss, threshold=2097152, spars=0.05, topK=False, corr=True)¶ Performs backward propagation from the loss and parameter update with sparsification.
THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. It fuses the tensors with size smaller than the threshold value to reduce network latency, as well as using sparsification schemes to transfer only the gradient elements which are significant.
- Parameters
loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold (the) –
are to (they) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value (of) –
are to be reduced directly (they) –
fusion. (without) –
spars (float) – a parameter to control sparsity as defined below
topK (bool) – When topK is False, it sparsifies the gradient with absolute
>= sparsWhen topK is True (value) –
sparsifies a fraction of total gradient (it) –
equals to spars (number) –
when spars = 0.01 (E.g.) –
sparsifies 1 % of the (it) –
gradient elements (total) –
corr (bool) – whether to use the local accumulate gradient for correction
-
self.
sparsInit
¶ A counter to determine which partition to perform all-reduce.
-
self.
gradAccumulation
¶ Local gradient accumulation