290 lines
9.4 KiB
Markdown
290 lines
9.4 KiB
Markdown
# Example: Integration with FairSeq
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
# Install the repo as a package:
|
|
git clone https://github.com/microsoft/torchscale.git
|
|
cd torchscale
|
|
pip install -e .
|
|
pip install git+https://github.com/shumingma/fairseq.git@moe
|
|
pip install git+https://github.com/shumingma/infinibatch.git
|
|
pip install iopath
|
|
pip install numpy==1.23.0
|
|
```
|
|
|
|
## Example: BERT Pretraining
|
|
|
|
### Data Format
|
|
|
|
We use a [streaming dataloader](https://github.com/microsoft/infinibatch) to read the data on-the-fly from the disk. It requires the data sharded into multiple small files (e.g. 10K lines per file), as well as a JSON file to contain some meta data and the paths to these files.
|
|
|
|
The overall data directory should be organized as follows:
|
|
```
|
|
Data/
|
|
├── json/
|
|
│ ├── train.json
|
|
│ └── valid.json
|
|
├── shard/
|
|
│ ├── train/
|
|
│ │ ├── 00000.txt
|
|
│ │ ├── 00001.txt
|
|
│ │ └── ...
|
|
│ └── valid/
|
|
│ ├── 00000.txt
|
|
│ ├── 00001.txt
|
|
│ └── ...
|
|
├── dict.txt
|
|
└── sentencepiece.bpe.model
|
|
```
|
|
|
|
We recommend that each sharded data files contains no more than 10K lines with one sentence per line, and two documents should be separated with an empty line.
|
|
```
|
|
Document 1 Line 1
|
|
Document 1 Line 2
|
|
Document 1 Line 3
|
|
|
|
Document 2 Line 1
|
|
Document 2 Line 2
|
|
|
|
...
|
|
```
|
|
|
|
Also, the JSON file should be in the format like this:
|
|
```
|
|
[
|
|
{
|
|
"source": [
|
|
"shard/train/00000.txt",
|
|
"shard/train/00001.txt",
|
|
...
|
|
],
|
|
"source_lang": "en",
|
|
"weight": 1.0
|
|
}
|
|
]
|
|
```
|
|
|
|
You can quickly get started with our processed vocabulary files: [sentencepiece.bpe.model](https://publicmodel.blob.core.windows.net/torchscale/vocab/sentencepiece.bpe.model?sv=2020-04-08&st=2023-08-11T03%3A09%3A09Z&se=2053-08-12T03%3A09%3A00Z&sr=c&sp=rl&sig=3b6nDda%2Fu0vD6E%2BhoTO%2BHfNSnSlUfgvXFV%2FCNKquWjE%3D) and [dict.txt](https://publicmodel.blob.core.windows.net/torchscale/vocab/dict.txt?sv=2020-04-08&st=2023-08-11T03%3A09%3A09Z&se=2053-08-12T03%3A09%3A00Z&sr=c&sp=rl&sig=3b6nDda%2Fu0vD6E%2BhoTO%2BHfNSnSlUfgvXFV%2FCNKquWjE%3D). Note that this vocabulary is English-only with 64K tokens. To train a new `sentencepiece.bpe.model` on your own data, please refer to the [SentencePiece](https://github.com/google/sentencepiece) repo. With the sentecepiece model and the installed `sentencepiece` library, you can extract the `dict.txt` file from it by
|
|
```
|
|
spm_export_vocab --model=sentencepiece.bpe.model | sed 's/\t/ /g' | tail -n +4 > dict.txt
|
|
```
|
|
|
|
### Dense Model
|
|
```bash
|
|
cd examples/fairseq/
|
|
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 train.py ${PATH_TO_DATA} \
|
|
--task pretraining \
|
|
--tokens-per-sample 512 \
|
|
--mask-prob 0.15 \
|
|
--span-length 3.0 \
|
|
--leave-unmasked-prob 0.0 \
|
|
--random-token-prob 0.0 \
|
|
--criterion masked_lm \
|
|
--arch mlm_base \
|
|
--share-encoder-input-output-embed \
|
|
--required-batch-size-multiple 8 \
|
|
--spm-model ${PATH_TO_DATA}/sentencepiece.bpe.model \
|
|
--dict-file ${PATH_TO_DATA}/dict.txt \
|
|
--optimizer adam \
|
|
--adam-betas '(0.9,0.98)' \
|
|
--adam-eps 1e-6 \
|
|
--clip-norm 2.0 \
|
|
--lr-scheduler polynomial_decay \
|
|
--lr 0.0005 \
|
|
--warmup-updates 10000 \
|
|
--total-num-update 125000 \
|
|
--max-update 125000 \
|
|
--max-sentences 32 \
|
|
--update-freq 1 \
|
|
--log-format simple \
|
|
--log-interval 100 \
|
|
--disable-validation \
|
|
--save-interval-updates 5000 \
|
|
--no-epoch-checkpoints \
|
|
--fp16 \
|
|
--fp16-init-scale 4 \
|
|
--fp16-scale-window 256 \
|
|
--min-loss-scale 0.0001 \
|
|
--seed 1 \
|
|
--save-dir ${PATH_TO_CKPT} \
|
|
--ddp-backend=no_c10d \
|
|
--distributed-no-spawn \
|
|
--reset-dataloader \
|
|
--batch-read-ahead 10000 \
|
|
--rel-pos-buckets 32 \
|
|
--max-rel-pos 128 \
|
|
--deepnorm
|
|
```
|
|
|
|
### Sparse (MoE) Model
|
|
```bash
|
|
cd examples/fairseq/
|
|
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 train.py ${PATH_TO_DATA} \
|
|
--task pretraining \
|
|
--tokens-per-sample 512 \
|
|
--mask-prob 0.15 \
|
|
--span-length 3.0 \
|
|
--leave-unmasked-prob 0.0 \
|
|
--random-token-prob 0.0 \
|
|
--arch mlm_base \
|
|
--share-encoder-input-output-embed \
|
|
--required-batch-size-multiple 8 \
|
|
--spm-model ${PATH_TO_DATA}/sentencepiece.bpe.model \
|
|
--dict-file ${PATH_TO_DATA}/dict.txt \
|
|
--optimizer adam \
|
|
--adam-betas '(0.9,0.98)' \
|
|
--adam-eps 1e-6 \
|
|
--clip-norm 2.0 \
|
|
--lr-scheduler polynomial_decay \
|
|
--lr 0.0005 \
|
|
--warmup-updates 10000 \
|
|
--total-num-update 125000 \
|
|
--max-update 125000 \
|
|
--max-sentences 32 \
|
|
--update-freq 1 \
|
|
--log-format simple \
|
|
--log-interval 100 \
|
|
--disable-validation \
|
|
--save-interval-updates 5000 \
|
|
--no-epoch-checkpoints \
|
|
--fp16 \
|
|
--fp16-init-scale 4 \
|
|
--fp16-scale-window 256 \
|
|
--min-loss-scale 0.0001 \
|
|
--seed 1 \
|
|
--save-dir ${PATH_TO_CKPT} \
|
|
--ddp-backend=no_c10d \
|
|
--distributed-no-spawn \
|
|
--reset-dataloader \
|
|
--batch-read-ahead 10000 \
|
|
--rel-pos-buckets 32 \
|
|
--max-rel-pos 128 \
|
|
--deepnorm \
|
|
--moe-expert-count 64 --moe-freq 2 \
|
|
--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
|
|
--moe-eval-capacity-token-fraction -1.0 \
|
|
--criterion masked_lm_moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
|
|
--use-xmoe --pad-to-max-length
|
|
```
|
|
|
|
## Example: GPT Pretraining
|
|
|
|
### Data Format
|
|
|
|
We use the format as in the FairSeq's [language modeling example](https://github.com/facebookresearch/fairseq/tree/main/examples/language_model#1-preprocess-the-data).
|
|
|
|
### Dense Model
|
|
|
|
```bash
|
|
cd examples/fairseq/
|
|
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
|
|
${PATH_TO_DATA} \
|
|
--num-workers 2 \
|
|
--activation-fn gelu \
|
|
--share-decoder-input-output-embed \
|
|
--validate-interval-updates 1000 \
|
|
--save-interval-updates 1000 \
|
|
--no-epoch-checkpoints \
|
|
--memory-efficient-fp16 \
|
|
--fp16-init-scale 4 \
|
|
--arch lm_base \
|
|
--task language_modeling \
|
|
--sample-break-mode none \
|
|
--tokens-per-sample 128 \
|
|
--optimizer adam --adam-betas "(0.9, 0.98)" \
|
|
--adam-eps 1e-08 \
|
|
--clip-norm 0.0 \
|
|
--lr 5e-4 \
|
|
--lr-scheduler polynomial_decay \
|
|
--warmup-updates 750 \
|
|
--dropout 0.1 \
|
|
--attention-dropout 0.1 \
|
|
--weight-decay 0.01 \
|
|
--batch-size 4 \
|
|
--update-freq 1 \
|
|
--required-batch-size-multiple 1 \
|
|
--total-num-update 50000 \
|
|
--max-update 50000 \
|
|
--seed 1 \
|
|
--ddp-backend=c10d
|
|
```
|
|
|
|
### Sparse (MoE) Model
|
|
|
|
```bash
|
|
cd examples/fairseq/
|
|
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
|
|
${PATH_TO_DATA} \
|
|
--num-workers 2 \
|
|
--activation-fn gelu \
|
|
--share-decoder-input-output-embed \
|
|
--validate-interval-updates 1000 \
|
|
--save-interval-updates 1000 \
|
|
--no-epoch-checkpoints \
|
|
--memory-efficient-fp16 \
|
|
--fp16-init-scale 4 \
|
|
--arch lm_base \
|
|
--task language_modeling \
|
|
--sample-break-mode none \
|
|
--tokens-per-sample 128 \
|
|
--optimizer adam --adam-betas "(0.9, 0.98)" \
|
|
--adam-eps 1e-08 \
|
|
--clip-norm 0.0 \
|
|
--lr 5e-4 \
|
|
--lr-scheduler polynomial_decay \
|
|
--warmup-updates 750 \
|
|
--dropout 0.1 \
|
|
--attention-dropout 0.1 \
|
|
--weight-decay 0.01 \
|
|
--batch-size 4 \
|
|
--update-freq 1 \
|
|
--required-batch-size-multiple 1 \
|
|
--total-num-update 50000 \
|
|
--max-update 50000 \
|
|
--seed 1 \
|
|
--ddp-backend=no_c10d \
|
|
--moe-expert-count 2 --moe-freq 2 \
|
|
--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
|
|
--moe-eval-capacity-token-fraction -1.0 \
|
|
--criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
|
|
--use-xmoe
|
|
```
|
|
|
|
## Example: Machine Translation
|
|
|
|
### Data Format
|
|
|
|
We follow the FairSeq's [neural machine translation example](https://github.com/facebookresearch/fairseq/tree/main/examples/translation#training-a-new-model) to preprocess the data.
|
|
|
|
### Dense Model
|
|
|
|
```bash
|
|
cd examples/fairseq/
|
|
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
|
|
${PATH_TO_DATA} \
|
|
--arch mt_base --share-decoder-input-output-embed \
|
|
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
|
|
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
|
|
--dropout 0.3 --weight-decay 0.0001 \
|
|
--max-tokens 4096 --fp16
|
|
```
|
|
|
|
### Sparse (MoE) Model
|
|
|
|
```bash
|
|
cd examples/fairseq/
|
|
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 train.py \
|
|
${PATH_TO_DATA} \
|
|
--arch mt_base --share-decoder-input-output-embed \
|
|
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
|
|
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
|
|
--dropout 0.3 --weight-decay 0.0001 \
|
|
--moe-expert-count 2 --moe-freq 2 \
|
|
--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
|
|
--moe-eval-capacity-token-fraction -1.0 \
|
|
--criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
|
|
--use-xmoe \
|
|
--max-tokens 4096 --fp16
|
|
```
|