后续继续补充!
继续看张量量化函数,代码位于:tools\pytorch-quantization\pytorch_quantization\tensor_quant.py
ScaledQuantDescriptor
量化的支持描述符:描述张量应该如何量化。QuantDescriptor和张量定义了量化张量。
class ScaledQuantDescriptor():
def __init__(self, num_bits=8, name=None, **kwargs):
if not isinstance(num_bits, int):
raise TypeError("num_bits must be an integer, not {}.".format(type(num_bits)))
if num_bits < 0:
raise ValueError("num_bits must be >= 0, not {}.".format(num_bits))
if num_bits == 0:
logging.error("num_bits is 0. This will result in the tensor being quantized to all zeros."
" This mode should only be used for debugging purposes.")
self._num_bits = num_bits
if not isinstance(name, str) and name is not None:
raise TypeError("name must be a string or None, not {}.".format(type(name)))
self._name = name
self._fake_quant = kwargs.pop('fake_quant', True)
self._axis = kwargs.pop('axis', None)
if self._axis is not None:
logging.debug("Meaning of axis has changed since v2.0. Make sure to update.")
self._learn_amax = kwargs.pop('learn_amax', False)
if self._learn_amax and self._axis is not None:
raise TypeError(
"axis is ignored and must be None when learn_amax is true, got {}.".format(type(self._axis)))
amax = kwargs.pop('amax', None)
if amax is not None:
if not isinstance(amax, float) and not isinstance(
amax, list) and not isinstance(amax, np.ndarray):
raise TypeError("amax must be float, list or ndarray, not {}".format(type(amax)))
# Make it single precision array
self._amax = np.array(amax, dtype=np.float32)
else:
self._amax = amax
self._scale_amax = kwargs.pop('scale_amax', None)
self._calib_method = kwargs.pop('calib_method', "max")
self._unsigned = kwargs.pop('unsigned', False)
self._narrow_range = kwargs.pop('narrow_range', False)
if kwargs:
raise TypeError("Unused keys: {}".format(kwargs.keys()))
参数:
- num_bits:int,量化位数,用于计算比例因子。默认值8。
- name:看起来很不错
关键字参数:
-
fake_quant
:布尔值。如果为True,则使用fake
量化模式。默认为True -
axis
:None, int
或整数的tuple
,轴将利用自己的最大值以计算缩放因子,默认None。- 如果None(默认值),则使用
per tensor scale
。
确保在范围[-rank(input_tensor),rank(输入_tensor))内。
例如,对于KCRS权重张量,quant_axis=(0)
将产生per channel scaling
。
- 如果None(默认值),则使用
-
amax
:用户指定的绝对最大范围的float
或list/ndarray
。如果提供,忽略quant_axis
并使用它进行量化。如果learn_amax
为True,将用于初始化可学习的amax
。默认None
-
learn_amax
:boolean
,如果为True,学习amax。默认为False。 -
scale_amax
:float
,如果提供,将amax
乘以scale_amax
,默认无。 -
calib_method
:string
,[“max”,“histogram”]
中的一个校准要使用的指标。除了
max calibration
,其他都是基于hisogram
的。默认值“max”。 -
unsigned
:Boolean
,如果为True
,则使用无符号。默认为False
。
Raises:
- TypeError:如果传入了不支持的类型。
Read-only properties:
fake_quant:
name:
learn_amax:
scale_amax:
axis:
calib_method:
num_bits:
amax:
unsigned:
QuantDescriptor定义了张量应该如何量化。预定义的QuantDescriptor张量描述符如下:
QuantDescriptor = ScaledQuantDescriptor
# Predefined descriptors
QUANT_DESC_8BIT_PER_TENSOR = QuantDescriptor(num_bits=8)
QUANT_DESC_UNSIGNED_8BIT_PER_TENSOR = QuantDescriptor(num_bits=8, unsigned=True)
QUANT_DESC_8BIT_CONV1D_WEIGHT_PER_CHANNEL = QuantDescriptor(num_bits=8, axis=(0))
QUANT_DESC_8BIT_CONV2D_WEIGHT_PER_CHANNEL = QuantDescriptor(num_bits=8, axis=(0))
QUANT_DESC_8BIT_CONV3D_WEIGHT_PER_CHANNEL = QuantDescriptor(num_bits=8, axis=(0))
QUANT_DESC_8BIT_LINEAR_WEIGHT_PER_ROW = QuantDescriptor(num_bits=8, axis=(0))
QUANT_DESC_8BIT_CONVTRANSPOSE1D_WEIGHT_PER_CHANNEL = QuantDescriptor(num_bits=8, axis=(0))
QUANT_DESC_8BIT_CONVTRANSPOSE2D_WEIGHT_PER_CHANNEL = QuantDescriptor(num_bits=8, axis=(0))
QUANT_DESC_8BIT_CONVTRANSPOSE3D_WEIGHT_PER_CHANNEL = QuantDescriptor(num_bits=8, axis=(0))
如果在QuantDescriptor
中给出最amax
,TensorQuantizer
将使用它进行量化。否则,TensorQuantizer
将计算amax
,然后进行量化。amax
被计算通过指定的axis
轴。注意QuantDescriptor
将剩余轴指定与max()
轴相反。
例子:
from pytorch_quantization.tensor_quant import QuantDescriptor
from pytorch_quantization.nn.modules.tensor_quantizer import TensorQuantizer
quant_desc = QuantDescriptor(num_bits=4, fake_quant=False, axis=(0), unsigned=True)
接下来看量化函数:pytorch_quantization
提供3个自定义的张量量化函数算子,继承torch.autograd.function
,实现函数的前向传播、反向传播
训练感知量化是在训练时就进行量化的(训练时量化后面会说),操作实际包括一个量化再紧跟一个反量化。具体的,就是在前向传播的时候(forward)模拟了量化的这个过程,在forward时首先会把权值和激活值量化到8bit再反量化回有误差的32bit,整体训练还是浮点,反向传播(backward)的时候求得的梯度是模拟量化之后权值的梯度,用这个梯度去更新量化前的权值。
在反向传播过程中,使用权重的渐变来更新浮点权重。为了处理量化梯度,除了未定义的点之外,几乎所有地方都是零(连续的浮点数量化映射成了离散的点),故在反向传播求梯度时无法求得,所以需要在反向传播中构建一个近似量化。在实际工程中一种效果比较好的近似方法是使用使用 直通估计器 ( STE ):
x o u t = c l a m p ( x m i n , x m a x , x ) x_{out}=clamp(x_{min},x_{max},x) xout=clamp(xmin,xmax,x)
TensorQuantFunction
- 通用的张量量化函数
TensorQuantFunction
class TensorQuantFunction(Function):
"""
一个输入张量,输出一个量化张量。`scale`的粒度可以从amax的形状来解释
"""
forward
在前向过程中,对浮点权重和激活进行伪量化,并使用这些伪量化的权重和激活来执行层的操作
@staticmethod
def forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
ctx.save_for_backward(inputs, amax)
outputs, scale = _tensor_quant(inputs, amax, num_bits, unsigned, narrow_range)
# Check if scale overflows FP16
if outputs.dtype == torch.half and scale.max() > 65504:
raise ValueError("scale is too large for FP16 with amax={}".format(amax))
return outputs, scale.to(inputs.dtype)
output_dtype
指示量化值是以整数还是浮点形式存储。希望将其存储在浮点中的原因是pytorch函数接受量化值,它可能不接受整数输入,例如Conv2D
。
它使用 2 n u m _ b i t s − 1 2^{num\_bits-1} 2num_bits−1值,例如,对于num_bits=8
,使用[-127,127]
遵循tensorflow约定,传入最大值并用于确定比例,而不是直接输入比例。尽管直接输入比例可能更自然。
参数:
-
ctx
:一个用于向后存储张量的Context对象。 -
inputs
:float32型张量。 -
amax
:float32型张量。输入将在[-amax,amax]范围内量化,amax将广播到inputs tensor。 -
num_bits
:用于计算缩放因子的整数, s c a l e = ( 2 n u m _ b i t s − 1 − 1 ) / m a x scale=(2^{num\_bits-1}-1)/max scale=(2num_bits−1−1)/max。默认值8。 -
output_dtype
:张量的一种类型。torch.int32或torch.float32。希望存储为float,pytorch函数接受float量化值,它可能不接受整数输入。
unsigned:boolean,使用无符号整数范围。例如,对于num_bits=8,[0,255]。默认为False。 -
narrow_range
:布尔值。使用对称整数范围进行有符号量化
例如,对于num_bits=8,用[-127,127]代替[-128,127]。默认为True。
Returns
:
-
outputs
:output_dtype
类型的张量。 -
scale
:float32
型张量。outputs / scale
将对输出张量进行反量化。
Raises
:
ValueError
:
backward
通过clipping
实现直通估计。对于-amax<=input<=amax
,梯度直接通过,否则梯度为零。
参数:
ctx
:一个上下文对象,其中保存了来自forward的张量。grad_outputs
:outputs梯度张量。grad_scale
:scale梯度张量。
Returns:
grad_inputs
:梯度张量。
@staticmethod
def backward(ctx, grad_outputs, grad_scale):
"""
Implements straight through estimation with clipping. For -amax <= input <= amax
the gradient passes straight through, otherwise the gradient is zero.
Args:
ctx: A Context object with saved tensors from forward.
grad_outputs: A tensor of gradient of outputs.
grad_scale: A tensor of gradient of scale.
Returns:
grad_inputs: A tensor of gradient.
"""
inputs, amax = ctx.saved_tensors
zero = grad_outputs.new_zeros(1) # create a zero tensor with the same type and device
grad_inputs = torch.where(inputs.abs() <= amax, grad_outputs, zero)
return grad_inputs, None, None, None, None
tensor_quant = TensorQuantFunction.apply
给TensorQuantFunction.apply
赋予一个别名tensor_quant
,这样可以直接调用tensor_quant
进行量化,例如:
from pytorch_quantization import tensor_quant
# Generate random input. With fixed seed 12345, x should be
# tensor([0.9817, 0.8796, 0.9921, 0.4611, 0.0832, 0.1784, 0.3674, 0.5676, 0.3376, 0.2119])
torch.manual_seed(12345)
x = torch.rand(10)
# quantize tensor x. quant_x will be
# tensor([126., 113., 127., 59., 11., 23., 47., 73., 43., 27.])
# with scale=128.0057
quant_x, scale = tensor_quant.tensor_quant(x, x.abs().max())
FakeTensorQuantFunction
class FakeTensorQuantFunction(Function):
"""Fake version of TensorQuantFunction
See comments of TensorQuantFunction, arguments are the same.
"""
@staticmethod
def forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
ctx.save_for_backward(inputs, amax)
outputs, scale = _tensor_quant(inputs, amax, num_bits, unsigned, narrow_range)
return outputs / scale.to(inputs.dtype)
@staticmethod
def backward(ctx, grad_outputs):
inputs, amax = ctx.saved_tensors
zero = grad_outputs.new_zeros(1)
grad_inputs = torch.where(inputs.abs() <= amax, grad_outputs, zero)
return grad_inputs, None, None, None, None
在向后过程中,使用权重的渐变来更新浮点权重。为了处理量化梯度,除了未定义的点之外,几乎所有地方都是零,可以使用 直通估计器 ( STE ),它通过伪量化操作符传递梯度。
fake_tensor_quant = FakeTensorQuantFunction.apply
给TensorQuantFunction.apply
赋予一个别名fake_tensor_quant
,这样可以直接调用fake_tensor_quant
进行量化,例如:
from pytorch_quantization import tensor_quant
# Generate random input. With fixed seed 12345, x should be
# tensor([0.9817, 0.8796, 0.9921, 0.4611, 0.0832, 0.1784, 0.3674, 0.5676, 0.3376, 0.2119])
torch.manual_seed(12345)
x = torch.rand(10)
# fake quantize tensor x. fake_quant_x will be
# tensor([0.9843, 0.8828, 0.9921, 0.4609, 0.0859, 0.1797, 0.3672, 0.5703, 0.3359, 0.2109])
fake_quant_x = tensor_quant.fake_tensor_quant(x, x.abs().max())
_tensor_quant
def _tensor_quant(inputs, amax, num_bits=8, unsigned=False, narrow_range=True):
"""Shared function body between TensorQuantFunction and FakeTensorQuantFunction"""
# Fine scale, per channel scale will be handled by broadcasting, which could be tricky. Pop a warning.
if isinstance(amax, torch.Tensor) and inputs.dim() != amax.dim():
logging.debug("amax %s has different shape than inputs %s. Make sure broadcast works as expected!",
amax.size(), inputs.size())
logging.debug("{} bits quantization on shape {} tensor.".format(num_bits, inputs.size()))
if unsigned:
if inputs.min() < 0.:
raise TypeError("Negative values encountered in unsigned quantization.")
# Computation must be in FP32 to prevent potential over flow.
input_dtype = inputs.dtype
if inputs.dtype == torch.half:
inputs = inputs.float()
if amax.dtype == torch.half:
amax = amax.float()
min_amax = amax.min()
if min_amax < 0:
raise ValueError("Negative values in amax")
max_bound = torch.tensor((2.0**(num_bits - 1 + int(unsigned))) - 1.0, device=amax.device)
if unsigned:
min_bound = 0
elif narrow_range:
min_bound = -max_bound
else:
min_bound = -max_bound - 1
scale = max_bound / amax
epsilon = 1. / (1<<24)
if min_amax <= epsilon: # Treat amax smaller than minimum representable of fp16 0
zero_amax_mask = (amax <= epsilon)
scale[zero_amax_mask] = 0 # Value quantized with amax=0 should all be 0
outputs = torch.clamp((inputs * scale).round_(), min_bound, max_bound)
if min_amax <= epsilon:
scale[zero_amax_mask] = 1. # Return 1 makes more sense for values quantized to 0 with amax=0
if input_dtype == torch.half:
outputs = outputs.half()
return outputs, scale
待梳理!!!