torchscale/examples/fairseq/README.md

290 lines
9.1 KiB
Markdown
Raw Permalink Normal View History

2022-11-23 16:21:58 +00:00
# Example: Integration with FairSeq
## Setup
```bash
# Install the repo as a package:
2023-03-03 23:37:01 +00:00
git clone https://github.com/microsoft/torchscale.git
2022-11-23 16:21:58 +00:00
cd torchscale
pip install -e .
pip install git+https://github.com/shumingma/fairseq.git@moe
pip install git+https://github.com/shumingma/infinibatch.git
pip install iopath
2023-03-05 11:36:07 +00:00
pip install numpy==1.23.0
2022-11-23 16:21:58 +00:00
```
## Example: BERT Pretraining
### Data Format
We use a [streaming dataloader](https://github.com/microsoft/infinibatch) to read the data on-the-fly from the disk. It requires the data sharded into multiple small files (e.g. 10K lines per file), as well as a JSON file to contain some meta data and the paths to these files.
The overall data directory should be organized as follows:
```
Data/
├── json/
│ ├── train.json
│ └── valid.json
├── shard/
│ ├── train/
│ │ ├── 00000.txt
│ │ ├── 00001.txt
│ │ └── ...
│ └── valid/
│ ├── 00000.txt
│ ├── 00001.txt
│ └── ...
├── dict.txt
└── sentencepiece.bpe.model
```
We recommend that each sharded data files contains no more than 10K lines with one sentence per line, and two documents should be separated with an empty line.
```
Document 1 Line 1
Document 1 Line 2
Document 1 Line 3
Document 2 Line 1
Document 2 Line 2
...
```
Also, the JSON file should be in the format like this:
```
[
{
"source": [
"shard/train/00000.txt",
"shard/train/00001.txt",
...
],
"source_lang": "en",
"weight": 1.0
}
]
```
2022-12-02 04:40:09 +00:00
You can quickly get started with our processed vocabulary files: [sentencepiece.bpe.model](https://publicmodel.blob.core.windows.net/torchscale/vocab/sentencepiece.bpe.model) and [dict.txt](https://publicmodel.blob.core.windows.net/torchscale/vocab/dict.txt). Note that this vocabulary is English-only with 64K tokens. To train a new `sentencepiece.bpe.model` on your own data, please refer to the [SentencePiece](https://github.com/google/sentencepiece) repo. With the sentecepiece model and the installed `sentencepiece` library, you can extract the `dict.txt` file from it by
```
spm_export_vocab --model=sentencepiece.bpe.model | sed 's/\t/ /g' | tail -n +4 > dict.txt
```
2023-03-02 10:54:19 +00:00
### Dense Model
2022-11-23 16:21:58 +00:00
```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 train.py ${PATH_TO_DATA} \
--task pretraining \
--tokens-per-sample 512 \
--mask-prob 0.15 \
--span-length 3.0 \
--leave-unmasked-prob 0.0 \
--random-token-prob 0.0 \
--criterion masked_lm \
--arch mlm_base \
--share-encoder-input-output-embed \
--required-batch-size-multiple 8 \
--spm-model ${PATH_TO_DATA}/sentencepiece.bpe.model \
--dict-file ${PATH_TO_DATA}/dict.txt \
--optimizer adam \
--adam-betas '(0.9,0.98)' \
--adam-eps 1e-6 \
--clip-norm 2.0 \
--lr-scheduler polynomial_decay \
--lr 0.0005 \
--warmup-updates 10000 \
--total-num-update 125000 \
--max-update 125000 \
--max-sentences 32 \
--update-freq 1 \
--log-format simple \
--log-interval 100 \
--disable-validation \
--save-interval-updates 5000 \
--no-epoch-checkpoints \
--fp16 \
--fp16-init-scale 4 \
--fp16-scale-window 256 \
--min-loss-scale 0.0001 \
--seed 1 \
--save-dir ${PATH_TO_CKPT} \
--ddp-backend=no_c10d \
--distributed-no-spawn \
--reset-dataloader \
--batch-read-ahead 10000 \
--rel-pos-buckets 32 \
--max-rel-pos 128 \
--deepnorm
```
2023-03-02 10:54:19 +00:00
### Sparse (MoE) Model
```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 train.py ${PATH_TO_DATA} \
--task pretraining \
--tokens-per-sample 512 \
--mask-prob 0.15 \
--span-length 3.0 \
--leave-unmasked-prob 0.0 \
--random-token-prob 0.0 \
--arch mlm_base \
--share-encoder-input-output-embed \
--required-batch-size-multiple 8 \
--spm-model ${PATH_TO_DATA}/sentencepiece.bpe.model \
--dict-file ${PATH_TO_DATA}/dict.txt \
--optimizer adam \
--adam-betas '(0.9,0.98)' \
--adam-eps 1e-6 \
--clip-norm 2.0 \
--lr-scheduler polynomial_decay \
--lr 0.0005 \
--warmup-updates 10000 \
--total-num-update 125000 \
--max-update 125000 \
--max-sentences 32 \
--update-freq 1 \
--log-format simple \
--log-interval 100 \
--disable-validation \
--save-interval-updates 5000 \
--no-epoch-checkpoints \
--fp16 \
--fp16-init-scale 4 \
--fp16-scale-window 256 \
--min-loss-scale 0.0001 \
--seed 1 \
--save-dir ${PATH_TO_CKPT} \
--ddp-backend=no_c10d \
--distributed-no-spawn \
--reset-dataloader \
--batch-read-ahead 10000 \
--rel-pos-buckets 32 \
--max-rel-pos 128 \
--deepnorm \
--moe-expert-count 64 --moe-freq 2 \
--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
--moe-eval-capacity-token-fraction -1.0 \
2023-03-08 04:53:41 +00:00
--criterion masked_lm_moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
--use-xmoe --pad-to-max-length
2023-03-02 10:54:19 +00:00
```
2022-11-23 16:21:58 +00:00
## Example: GPT Pretraining
### Data Format
We use the format as in the FairSeq's [language modeling example](https://github.com/facebookresearch/fairseq/tree/main/examples/language_model#1-preprocess-the-data).
### Dense Model
```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
${PATH_TO_DATA} \
--num-workers 2 \
--activation-fn gelu \
--share-decoder-input-output-embed \
--validate-interval-updates 1000 \
--save-interval-updates 1000 \
--no-epoch-checkpoints \
--memory-efficient-fp16 \
--fp16-init-scale 4 \
--arch lm_base \
--task language_modeling \
--sample-break-mode none \
--tokens-per-sample 128 \
--optimizer adam --adam-betas "(0.9, 0.98)" \
--adam-eps 1e-08 \
--clip-norm 0.0 \
--lr 5e-4 \
--lr-scheduler polynomial_decay \
--warmup-updates 750 \
--dropout 0.1 \
--attention-dropout 0.1 \
--weight-decay 0.01 \
--batch-size 4 \
--update-freq 1 \
--required-batch-size-multiple 1 \
--total-num-update 50000 \
--max-update 50000 \
--seed 1 \
--ddp-backend=c10d
```
### Sparse (MoE) Model
```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
${PATH_TO_DATA} \
--num-workers 2 \
--activation-fn gelu \
--share-decoder-input-output-embed \
--validate-interval-updates 1000 \
--save-interval-updates 1000 \
--no-epoch-checkpoints \
--memory-efficient-fp16 \
--fp16-init-scale 4 \
--arch lm_base \
--task language_modeling \
--sample-break-mode none \
--tokens-per-sample 128 \
--optimizer adam --adam-betas "(0.9, 0.98)" \
--adam-eps 1e-08 \
--clip-norm 0.0 \
--lr 5e-4 \
--lr-scheduler polynomial_decay \
--warmup-updates 750 \
--dropout 0.1 \
--attention-dropout 0.1 \
--weight-decay 0.01 \
--batch-size 4 \
--update-freq 1 \
--required-batch-size-multiple 1 \
--total-num-update 50000 \
--max-update 50000 \
--seed 1 \
--ddp-backend=no_c10d \
--moe-expert-count 2 --moe-freq 2 \
--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
--moe-eval-capacity-token-fraction -1.0 \
--criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
--use-xmoe
```
## Example: Machine Translation
### Data Format
We follow the FairSeq's [neural machine translation example](https://github.com/facebookresearch/fairseq/tree/main/examples/translation#training-a-new-model) to preprocess the data.
### Dense Model
```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
${PATH_TO_DATA} \
--arch mt_base --share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--max-tokens 4096 --fp16
```
### Sparse (MoE) Model
```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
${PATH_TO_DATA} \
--arch mt_base --share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--moe-expert-count 2 --moe-freq 2 \
--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
--moe-eval-capacity-token-fraction -1.0 \
--criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
--use-xmoe \
--max-tokens 4096 --fp16
```