We use a [streaming dataloader](https://github.com/microsoft/infinibatch) to read the data on-the-fly from the disk. It requires the data sharded into multiple small files (e.g. 10K lines per file), as well as a JSON file to contain some meta data and the paths to these files.
The overall data directory should be organized as follows:
```
Data/
├── json/
│ ├── train.json
│ └── valid.json
├── shard/
│ ├── train/
│ │ ├── 00000.txt
│ │ ├── 00001.txt
│ │ └── ...
│ └── valid/
│ ├── 00000.txt
│ ├── 00001.txt
│ └── ...
├── dict.txt
└── sentencepiece.bpe.model
```
We recommend that each sharded data files contains no more than 10K lines with one sentence per line, and two documents should be separated with an empty line.
```
Document 1 Line 1
Document 1 Line 2
Document 1 Line 3
Document 2 Line 1
Document 2 Line 2
...
```
Also, the JSON file should be in the format like this:
You can quickly get started with our processed vocabulary files: [sentencepiece.bpe.model](https://publicmodel.blob.core.windows.net/torchscale/vocab/sentencepiece.bpe.model) and [dict.txt](https://publicmodel.blob.core.windows.net/torchscale/vocab/dict.txt). Note that this vocabulary is English-only with 64K tokens. To train a new `sentencepiece.bpe.model` on your own data, please refer to the [SentencePiece](https://github.com/google/sentencepiece) repo. With the sentecepiece model and the installed `sentencepiece` library, you can extract the `dict.txt` file from it by
We use the format as in the FairSeq's [language modeling example](https://github.com/facebookresearch/fairseq/tree/main/examples/language_model#1-preprocess-the-data).
--moe-gating-use-fp32 --moe-second-expert-policy random --moe-normalize-gate-prob-before-dropping \
--moe-eval-capacity-token-fraction -1.0 \
--criterion moe_cross_entropy --moe-gate-loss-wt 0.01 --moe-gate-loss-combine-method sum \
--use-xmoe
```
## Example: Machine Translation
### Data Format
We follow the FairSeq's [neural machine translation example](https://github.com/facebookresearch/fairseq/tree/main/examples/translation#training-a-new-model) to preprocess the data.