Add an example for vocab
This commit is contained in:
parent
7b29d32f03
commit
be167b3dda
|
@ -65,6 +65,11 @@ Also, the JSON file should be in the format like this:
|
|||
]
|
||||
```
|
||||
|
||||
You can quickly get started with our processed vocabulary files: [sentencepiece.bpe.model](https://publicmodel.blob.core.windows.net/torchscale/vocab/sentencepiece.bpe.model) and [dict.txt](https://publicmodel.blob.core.windows.net/torchscale/vocab/dict.txt). Note that this vocabulary is English-only with 64K tokens. To train a new `sentencepiece.bpe.model` on your own data, please refer to the [SentencePiece](https://github.com/google/sentencepiece) repo. With the sentecepiece model and the installed `sentencepiece` library, you can extract the `dict.txt` file from it by
|
||||
```
|
||||
spm_export_vocab --model=sentencepiece.bpe.model | sed 's/\t/ /g' | tail -n +4 > dict.txt
|
||||
```
|
||||
|
||||
### Training Command
|
||||
```bash
|
||||
cd examples/fairseq/
|
||||
|
|
Loading…
Reference in New Issue
Block a user