torchscale/examples/fairseq/README.md

# Example: Integration with FairSeq

## Setup

```bash
# Install the repo as a package:
git clone https://github.com/microsoft/torchscale.git
cd torchscale
pip install -e .
pip install git+https://github.com/shumingma/fairseq.git@moe
pip install git+https://github.com/shumingma/infinibatch.git
pip install iopath
pip install numpy==1.23.0
```

## Example: BERT Pretraining

### Data Format

We use a [streaming dataloader](https://github.com/microsoft/infinibatch) to read the data on-the-fly from the disk. It requires the data sharded into multiple small files (e.g. 10K lines per file), as well as a JSON file to contain some meta data and the paths to these files.

The overall data directory should be organized as follows:
```
Data/
├── json/
│   ├── train.json
│   └── valid.json
├── shard/
│   ├── train/
│   │   ├── 00000.txt
│   │   ├── 00001.txt
│   │   └── ...
│   └── valid/
│       ├── 00000.txt
│       ├── 00001.txt
│       └── ...
├── dict.txt
└── sentencepiece.bpe.model
```

We recommend that each sharded data files contains no more than 10K lines with one sentence per line, and two documents should be separated with an empty line.
```
Document 1 Line 1
Document 1 Line 2
Document 1 Line 3

Document 2 Line 1
Document 2 Line 2

...
```

Also, the JSON file should be in the format like this:
```
[
    {
        "source": [
            "shard/train/00000.txt",
            "shard/train/00001.txt",
            ...
        ],
        "source_lang": "en",
        "weight": 1.0
    }
]
```

You can quickly get started with our processed vocabulary files: [sentencepiece.bpe.model](https://publicmodel.blob.core.windows.net/torchscale/vocab/sentencepiece.bpe.model) and [dict.txt](https://publicmodel.blob.core.windows.net/torchscale/vocab/dict.txt). Note that this vocabulary is English-only with 64K tokens. To train a new `sentencepiece.bpe.model` on your own data, please refer to the [SentencePiece](https://github.com/google/sentencepiece) repo. With the sentecepiece model and the installed `sentencepiece` library, you can extract the `dict.txt` file from it by
```
spm_export_vocab --model=sentencepiece.bpe.model | sed 's/\t/ /g' | tail -n +4 > dict.txt
```

### Dense Model
```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 train.py ${PATH_TO_DATA} \
        --task pretraining  \
        --tokens-per-sample 512  \
        --mask-prob 0.15  \
        --span-length 3.0  \
        --leave-unmasked-prob 0.0  \
        --random-token-prob 0.0 \
        --criterion masked_lm  \
        --arch mlm_base  \
        --share-encoder-input-output-embed \
        --required-batch-size-multiple 8 \
        --spm-model ${PATH_TO_DATA}/sentencepiece.bpe.model \
        --dict-file ${PATH_TO_DATA}/dict.txt \
        --optimizer adam  \
        --adam-betas '(0.9,0.98)'  \
        --adam-eps 1e-6  \
        --clip-norm 2.0 \
        --lr-scheduler polynomial_decay  \
        --lr 0.0005  \
        --warmup-updates 10000  \
        --total-num-update 125000 \
        --max-update 125000 \
        --max-sentences 32  \
        --update-freq 1 \
        --log-format simple  \
        --log-interval 100 \
        --disable-validation \
        --save-interval-updates 5000 \
        --no-epoch-checkpoints \
        --fp16 \
        --fp16-init-scale 4 \
        --fp16-scale-window 256 \
        --min-loss-scale 0.0001 \
        --seed 1 \
        --save-dir ${PATH_TO_CKPT} \
        --ddp-backend=no_c10d \
        --distributed-no-spawn \
        --reset-dataloader \
        --batch-read-ahead 10000 \
        --rel-pos-buckets 32 \
        --max-rel-pos 128 \
        --deepnorm
```

### Sparse (MoE) Model
```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 train.py ${PATH_TO_DATA} \
        --task pretraining  \
        --tokens-per-sample 512  \
        --mask-prob 0.15  \
        --span-length 3.0  \
        --leave-unmasked-prob 0.0  \
        --random-token-prob 0.0 \
        --arch mlm_base  \
        --share-encoder-input-output-embed \
        --required-batch-size-multiple 8 \
        --spm-model ${PATH_TO_DATA}/sentencepiece.bpe.model \
        --dict-file ${PATH_TO_DATA}/dict.txt \
        --optimizer adam  \
        --adam-betas '(0.9,0.98)'  \
        --adam-eps 1e-6  \
        --clip-norm 2.0 \
        --lr-scheduler polynomial_decay  \
        --lr 0.0005  \
        --warmup-updates 10000  \
        --total-num-update 125000 \
        --max-update 125000 \
        --max-sentences 32  \
        --update-freq 1 \
        --log-format simple  \
        --log-interval 100 \
        --disable-validation \
        --save-interval-updates 5000 \
        --no-epoch-checkpoints \
        --fp16 \
        --fp16-init-scale 4 \
        --fp16-scale-window 256 \
        --min-loss-scale 0.0001 \
        --seed 1 \
        --save-dir ${PATH_TO_CKPT} \
        --ddp-backend=no_c10d \
        --distributed-no-spawn \
        --reset-dataloader \
        --batch-read-ahead 10000 \
        --rel-pos-buckets 32 \
        --max-rel-pos 128 \
        --deepnorm \
        --moe-expert-count 64 --moe-freq 2 \
        --moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
        --moe-eval-capacity-token-fraction -1.0 \
        --criterion masked_lm_moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
        --use-xmoe --pad-to-max-length
```

## Example: GPT Pretraining

### Data Format

We use the format as in the FairSeq's [language modeling example](https://github.com/facebookresearch/fairseq/tree/main/examples/language_model#1-preprocess-the-data).

### Dense Model

```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
    ${PATH_TO_DATA} \
    --num-workers 2 \
    --activation-fn gelu \
    --share-decoder-input-output-embed \
    --validate-interval-updates 1000 \
    --save-interval-updates 1000 \
    --no-epoch-checkpoints \
    --memory-efficient-fp16 \
    --fp16-init-scale 4 \
    --arch lm_base \
    --task language_modeling \
    --sample-break-mode none \
    --tokens-per-sample 128 \
    --optimizer adam --adam-betas "(0.9, 0.98)" \
    --adam-eps 1e-08 \
    --clip-norm 0.0 \
    --lr 5e-4 \
    --lr-scheduler polynomial_decay \
    --warmup-updates 750 \
    --dropout 0.1 \
    --attention-dropout 0.1 \
    --weight-decay 0.01 \
    --batch-size 4 \
    --update-freq 1 \
    --required-batch-size-multiple 1 \
    --total-num-update 50000 \
    --max-update 50000 \
    --seed 1 \
    --ddp-backend=c10d
```

### Sparse (MoE) Model

```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
    ${PATH_TO_DATA} \
    --num-workers 2 \
    --activation-fn gelu \
    --share-decoder-input-output-embed \
    --validate-interval-updates 1000 \
    --save-interval-updates 1000 \
    --no-epoch-checkpoints \
    --memory-efficient-fp16 \
    --fp16-init-scale 4 \
    --arch lm_base \
    --task language_modeling \
    --sample-break-mode none \
    --tokens-per-sample 128 \
    --optimizer adam --adam-betas "(0.9, 0.98)" \
    --adam-eps 1e-08 \
    --clip-norm 0.0 \
    --lr 5e-4 \
    --lr-scheduler polynomial_decay \
    --warmup-updates 750 \
    --dropout 0.1 \
    --attention-dropout 0.1 \
    --weight-decay 0.01 \
    --batch-size 4 \
    --update-freq 1 \
    --required-batch-size-multiple 1 \
    --total-num-update 50000 \
    --max-update 50000 \
    --seed 1 \
    --ddp-backend=no_c10d \
    --moe-expert-count 2 --moe-freq 2 \
    --moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
    --moe-eval-capacity-token-fraction -1.0 \
    --criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
    --use-xmoe
```

## Example: Machine Translation

### Data Format

We follow the FairSeq's [neural machine translation example](https://github.com/facebookresearch/fairseq/tree/main/examples/translation#training-a-new-model) to preprocess the data.

### Dense Model

```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
    ${PATH_TO_DATA} \
    --arch mt_base --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --max-tokens 4096 --fp16
```

### Sparse (MoE) Model

```bash
cd examples/fairseq/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
    ${PATH_TO_DATA} \
    --arch mt_base --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --moe-expert-count 2 --moe-freq 2 \
    --moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
    --moe-eval-capacity-token-fraction -1.0 \
    --criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
    --use-xmoe \
    --max-tokens 4096 --fp16
```
torchscale released 2022-11-23 16:21:58 +00:00			`# Example: Integration with FairSeq`

			`## Setup`

			```bash
			`# Install the repo as a package:`
Update README.md 2023-03-03 23:37:01 +00:00			`git clone https://github.com/microsoft/torchscale.git`
torchscale released 2022-11-23 16:21:58 +00:00			`cd torchscale`
			`pip install -e .`
			`pip install git+https://github.com/shumingma/fairseq.git@moe`
			`pip install git+https://github.com/shumingma/infinibatch.git`
			`pip install iopath`
set numpy version 2023-03-05 11:36:07 +00:00			`pip install numpy==1.23.0`
torchscale released 2022-11-23 16:21:58 +00:00			```

			`## Example: BERT Pretraining`

			`### Data Format`

			`We use a [streaming dataloader](https://github.com/microsoft/infinibatch) to read the data on-the-fly from the disk. It requires the data sharded into multiple small files (e.g. 10K lines per file), as well as a JSON file to contain some meta data and the paths to these files.`

			`The overall data directory should be organized as follows:`
			```
			`Data/`
			`├── json/`
			`│ ├── train.json`
			`│ └── valid.json`
			`├── shard/`
			`│ ├── train/`
			`│ │ ├── 00000.txt`
			`│ │ ├── 00001.txt`
			`│ │ └── ...`
			`│ └── valid/`
			`│ ├── 00000.txt`
			`│ ├── 00001.txt`
			`│ └── ...`
			`├── dict.txt`
			`└── sentencepiece.bpe.model`
			```

			`We recommend that each sharded data files contains no more than 10K lines with one sentence per line, and two documents should be separated with an empty line.`
			```
			`Document 1 Line 1`
			`Document 1 Line 2`
			`Document 1 Line 3`

			`Document 2 Line 1`
			`Document 2 Line 2`

			`...`
			```

			`Also, the JSON file should be in the format like this:`
			```
			`[`
			`{`
			`"source": [`
			`"shard/train/00000.txt",`
			`"shard/train/00001.txt",`
			`...`
			`],`
			`"source_lang": "en",`
			`"weight": 1.0`
			`}`
			`]`
			```

Add an example for vocab 2022-12-02 04:40:09 +00:00			You can quickly get started with our processed vocabulary files: [sentencepiece.bpe.model](https://publicmodel.blob.core.windows.net/torchscale/vocab/sentencepiece.bpe.model) and [dict.txt](https://publicmodel.blob.core.windows.net/torchscale/vocab/dict.txt). Note that this vocabulary is English-only with 64K tokens. To train a new `sentencepiece.bpe.model` on your own data, please refer to the [SentencePiece](https://github.com/google/sentencepiece) repo. With the sentecepiece model and the installed `sentencepiece` library, you can extract the `dict.txt` file from it by
			```
			`spm_export_vocab --model=sentencepiece.bpe.model \| sed 's/\t/ /g' \| tail -n +4 > dict.txt`
			```

Bert MoE 2023-03-02 10:54:19 +00:00			`### Dense Model`
torchscale released 2022-11-23 16:21:58 +00:00			```bash
			`cd examples/fairseq/`
			`python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 train.py ${PATH_TO_DATA} \`
			`--task pretraining \`
			`--tokens-per-sample 512 \`
			`--mask-prob 0.15 \`
			`--span-length 3.0 \`
			`--leave-unmasked-prob 0.0 \`
			`--random-token-prob 0.0 \`
			`--criterion masked_lm \`
			`--arch mlm_base \`
			`--share-encoder-input-output-embed \`
			`--required-batch-size-multiple 8 \`
			`--spm-model ${PATH_TO_DATA}/sentencepiece.bpe.model \`
			`--dict-file ${PATH_TO_DATA}/dict.txt \`
			`--optimizer adam \`
			`--adam-betas '(0.9,0.98)' \`
			`--adam-eps 1e-6 \`
			`--clip-norm 2.0 \`
			`--lr-scheduler polynomial_decay \`
			`--lr 0.0005 \`
			`--warmup-updates 10000 \`
			`--total-num-update 125000 \`
			`--max-update 125000 \`
			`--max-sentences 32 \`
			`--update-freq 1 \`
			`--log-format simple \`
			`--log-interval 100 \`
			`--disable-validation \`
			`--save-interval-updates 5000 \`
			`--no-epoch-checkpoints \`
			`--fp16 \`
			`--fp16-init-scale 4 \`
			`--fp16-scale-window 256 \`
			`--min-loss-scale 0.0001 \`
			`--seed 1 \`
			`--save-dir ${PATH_TO_CKPT} \`
			`--ddp-backend=no_c10d \`
			`--distributed-no-spawn \`
			`--reset-dataloader \`
			`--batch-read-ahead 10000 \`
			`--rel-pos-buckets 32 \`
			`--max-rel-pos 128 \`
			`--deepnorm`
			```

Bert MoE 2023-03-02 10:54:19 +00:00			`### Sparse (MoE) Model`
			```bash
			`cd examples/fairseq/`
			`python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 train.py ${PATH_TO_DATA} \`
			`--task pretraining \`
			`--tokens-per-sample 512 \`
			`--mask-prob 0.15 \`
			`--span-length 3.0 \`
			`--leave-unmasked-prob 0.0 \`
			`--random-token-prob 0.0 \`
			`--arch mlm_base \`
			`--share-encoder-input-output-embed \`
			`--required-batch-size-multiple 8 \`
			`--spm-model ${PATH_TO_DATA}/sentencepiece.bpe.model \`
			`--dict-file ${PATH_TO_DATA}/dict.txt \`
			`--optimizer adam \`
			`--adam-betas '(0.9,0.98)' \`
			`--adam-eps 1e-6 \`
			`--clip-norm 2.0 \`
			`--lr-scheduler polynomial_decay \`
			`--lr 0.0005 \`
			`--warmup-updates 10000 \`
			`--total-num-update 125000 \`
			`--max-update 125000 \`
			`--max-sentences 32 \`
			`--update-freq 1 \`
			`--log-format simple \`
			`--log-interval 100 \`
			`--disable-validation \`
			`--save-interval-updates 5000 \`
			`--no-epoch-checkpoints \`
			`--fp16 \`
			`--fp16-init-scale 4 \`
			`--fp16-scale-window 256 \`
			`--min-loss-scale 0.0001 \`
			`--seed 1 \`
			`--save-dir ${PATH_TO_CKPT} \`
			`--ddp-backend=no_c10d \`
			`--distributed-no-spawn \`
			`--reset-dataloader \`
			`--batch-read-ahead 10000 \`
			`--rel-pos-buckets 32 \`
			`--max-rel-pos 128 \`
			`--deepnorm \`
			`--moe-expert-count 64 --moe-freq 2 \`
			`--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \`
			`--moe-eval-capacity-token-fraction -1.0 \`
Update MoE criterions 2023-03-08 04:53:41 +00:00			`--criterion masked_lm_moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \`
add --pad-to-max-length in bert+moe example 2023-03-05 11:39:04 +00:00			`--use-xmoe --pad-to-max-length`
Bert MoE 2023-03-02 10:54:19 +00:00			```

torchscale released 2022-11-23 16:21:58 +00:00			`## Example: GPT Pretraining`

			`### Data Format`

			`We use the format as in the FairSeq's [language modeling example](https://github.com/facebookresearch/fairseq/tree/main/examples/language_model#1-preprocess-the-data).`

			`### Dense Model`

			```bash
			`cd examples/fairseq/`
			`python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \`
			`${PATH_TO_DATA} \`
			`--num-workers 2 \`
			`--activation-fn gelu \`
			`--share-decoder-input-output-embed \`
			`--validate-interval-updates 1000 \`
			`--save-interval-updates 1000 \`
			`--no-epoch-checkpoints \`
			`--memory-efficient-fp16 \`
			`--fp16-init-scale 4 \`
			`--arch lm_base \`
			`--task language_modeling \`
			`--sample-break-mode none \`
			`--tokens-per-sample 128 \`
			`--optimizer adam --adam-betas "(0.9, 0.98)" \`
			`--adam-eps 1e-08 \`
			`--clip-norm 0.0 \`
			`--lr 5e-4 \`
			`--lr-scheduler polynomial_decay \`
			`--warmup-updates 750 \`
			`--dropout 0.1 \`
			`--attention-dropout 0.1 \`
			`--weight-decay 0.01 \`
			`--batch-size 4 \`
			`--update-freq 1 \`
			`--required-batch-size-multiple 1 \`
			`--total-num-update 50000 \`
			`--max-update 50000 \`
			`--seed 1 \`
			`--ddp-backend=c10d`
			```

			`### Sparse (MoE) Model`

			```bash
			`cd examples/fairseq/`
			`python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \`
			`${PATH_TO_DATA} \`
			`--num-workers 2 \`
			`--activation-fn gelu \`
			`--share-decoder-input-output-embed \`
			`--validate-interval-updates 1000 \`
			`--save-interval-updates 1000 \`
			`--no-epoch-checkpoints \`
			`--memory-efficient-fp16 \`
			`--fp16-init-scale 4 \`
			`--arch lm_base \`
			`--task language_modeling \`
			`--sample-break-mode none \`
			`--tokens-per-sample 128 \`
			`--optimizer adam --adam-betas "(0.9, 0.98)" \`
			`--adam-eps 1e-08 \`
			`--clip-norm 0.0 \`
			`--lr 5e-4 \`
			`--lr-scheduler polynomial_decay \`
			`--warmup-updates 750 \`
			`--dropout 0.1 \`
			`--attention-dropout 0.1 \`
			`--weight-decay 0.01 \`
			`--batch-size 4 \`
			`--update-freq 1 \`
			`--required-batch-size-multiple 1 \`
			`--total-num-update 50000 \`
			`--max-update 50000 \`
			`--seed 1 \`
			`--ddp-backend=no_c10d \`
			`--moe-expert-count 2 --moe-freq 2 \`
			`--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \`
			`--moe-eval-capacity-token-fraction -1.0 \`
			`--criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \`
			`--use-xmoe`
			```

			`## Example: Machine Translation`

			`### Data Format`

			`We follow the FairSeq's [neural machine translation example](https://github.com/facebookresearch/fairseq/tree/main/examples/translation#training-a-new-model) to preprocess the data.`

			`### Dense Model`

			```bash
			`cd examples/fairseq/`
			`python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \`
			`${PATH_TO_DATA} \`
			`--arch mt_base --share-decoder-input-output-embed \`
			`--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \`
			`--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \`
			`--dropout 0.3 --weight-decay 0.0001 \`
			`--max-tokens 4096 --fp16`
			```

			`### Sparse (MoE) Model`

			```bash
			`cd examples/fairseq/`
			`python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \`
			`${PATH_TO_DATA} \`
			`--arch mt_base --share-decoder-input-output-embed \`
			`--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \`
			`--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \`
			`--dropout 0.3 --weight-decay 0.0001 \`
			`--moe-expert-count 2 --moe-freq 2 \`
			`--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \`
			`--moe-eval-capacity-token-fraction -1.0 \`
			`--criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \`
			`--use-xmoe \`
			`--max-tokens 4096 --fp16`
			```