- Unified quantization API: each quantization function now returns `Q, S` where `Q` is the quantized tensor and `S` the quantization state which may hold absolute max values, a quantization map or more. For dequantization all functions now accept the inputs `Q, S` so that `Q` is dequantized with the quantization state `S`.
- Fixed an issue where the CUDA 11.1 binary was not compiled with the right headers
API changes:
- Block-wise quantization for optimizers now enabled by default
Features:
- Block-wise quantization routines now support CPU Tensors.
- Introduced ModuleConfig overrides which can be seamlessly be used at initialization time of a module.
- Added `bnb.nn.Embedding` layer which runs at 32-bit but without the layernorm. This works well if you need to fine-tune pretrained models that do not have a embedding layer norm. #19
- Fixed a bug where the StableEmbedding layer 32-bit optimizer override would not work without registering the whole model first (`bnb.optim.GlobalOptimManager.get_instance().register_parameters(model.parameters())`). #13#15
- Added 8-bit matrix multiplication form cuBLAS, and cuBLASLt as well as multiple GEMM kernels (GEMM, GEMMEx, GEMMLt)
- Added 8-bit Linear layers with 8-bit Params that perform memory efficient inference with an option for 8-bit mixed precision matrix decomposition for inference without performance degradation
- Added quantization methods for "fake" quantization as well as optimized kernels vector-wise quantization and equalization as well as optimized cuBLASLt transformations
- CPU only build now available (Thank you, @mryab)
- Int8 MatmulLt now supports backward through inversion of the ColTuring/ColAmpere format. Slow, but memory efficient. Big thanks to @borzunov
- Int8 now supported on all GPUs. On devices with compute capability <7.5,theIntweightsarecastto16/32-bitforthematrixmultiplication.Contributedby@borzunov
Improvements:
- Improved logging for the CUDA detection mechanism.
- Support for 32 and 8-bit Lion has been added. Thank you @lucidrains
- Support for serialization of Linear8bitLt layers (LLM.int8()). This allows to store and load 8-bit weights directly from the HuggingFace Hub. Thank you @myrab
- Fixed a bug where 8-bit models consumed twice the memory as expected after serialization
Deprecated:
- Kepler binaries (GTX 700s and Tesla K40/K80) are not longer provided via pip and need to be compiled from source. Kepler support might be fully removed in the future.
- CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk
- Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470#451#453#477 Thank you @jllllll and @stoperro
- Added precompiled CUDA 11.8 binaries to support H100 GPUs without compilation #571
- CUDA SETUP now no longer looks for libcuda and libcudart and relies PyTorch CUDA libraries. To manually override this behavior see: how_to_use_nonpytorch_cuda.md. Thank you @rapsealk
Bug fixes:
- Fixed a bug where the default type of absmax was undefined which leads to errors if the default type is different than torch.float32. # 553
- Fixed a missing scipy dependency in requirements.txt. #544
- Fixed a bug, where a view operation could cause an error in 8-bit layers.
- Fixed a bug where CPU bitsandbytes would during the import. #593 Thank you @bilelomrani
- Fixed a but where a non-existent LD_LIBRARY_PATH variable led to a failure in python -m bitsandbytes #588
- Removed outdated get_cuda_lib_handle calls that lead to errors. #595 Thank you @ihsanturk
- Fixed bug where read-permission was assumed for a file. #497
- Fixed a bug where prefetchAsync lead to errors on GPUs that do not support unified memory but not prefetching (Maxwell, SM52). #470#451#453#477 Thank you @jllllll and @stoperro
Documentation:
- Improved documentation for GPUs that do not support 8-bit matmul. #529
- Added description and pointers for the NF4 data type. #543
User experience:
- Improved handling of default compute_dtype for Linear4bit Layers, so that compute_dtype = input_dtype if the input data type is stable enough (float32, bfloat16, but not float16).
Performance:
- improved 4-bit inference performance for A100 GPUs. This degraded performance for A40/RTX3090 and RTX 4090 GPUs slightly.