TorchScale is a PyTorch library that allows researchers and developers to scale up Transformers efficiently and effectively.
It has the implementation of fundamental research to improve modeling generality and capability, as well as training stability and efficiency of scaling Transformers.
It takes only several lines of code to create a model with the above fundamental research features enabled. Here is how to quickly obtain a BERT-like encoder:
```python
>>> from torchscale.architecture.config import EncoderConfig
>>> from torchscale.architecture.encoder import Encoder
>>> config = EncoderConfig(vocab_size=64000)
>>> model = Encoder(config)
>>> print(model)
```
We also support the `Decoder` architecture and the `EncoderDecoder` architecture:
```python
# Creating a decoder model
>>> from torchscale.architecture.config import DecoderConfig
>>> from torchscale.architecture.decoder import Decoder
>>> config = DecoderConfig(vocab_size=64000)
>>> decoder = Decoder(config)
>>> print(decoder)
# Creating a encoder-decoder model
>>> from torchscale.architecture.config import EncoderDecoderConfig
>>> from torchscale.architecture.encoder_decoder import EncoderDecoder
We plan to provide more examples regarding different tasks (e.g. vision pretraining and speech recognition) and various deep learning toolkits (e.g. [DeepSpeed](https://github.com/microsoft/DeepSpeed) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)). Any comments or PRs are welcome!
TorchScale supports arbitrary depths and widths, successfully scaling-up the models without pain.
## Acknowledgments
Some implementations in TorchScale are either adapted from or inspired by the [FairSeq](https://github.com/facebookresearch/fairseq) repository and the [UniLM](https://github.com/microsoft/unilm) repository.
## Citations
If you find this repository useful, please consider citing our work:
```
@article{deepnet,
author = {Hongyu Wang and
Shuming Ma and
Li Dong and
Shaohan Huang and
Dongdong Zhang and
Furu Wei},
title = {{DeepNet}: Scaling Transformers to 1,000 Layers},
title={On the Representation Collapse of Sparse Mixture of Experts},
author={Zewen Chi and Li Dong and Shaohan Huang and Damai Dai and Shuming Ma and Barun Patra and Saksham Singhal and Payal Bajaj and Xia Song and Xian-Ling Mao and Heyan Huang and Furu Wei},
booktitle={Advances in Neural Information Processing Systems},