From be167b3dda09f6e4dc15fe2d44053b915a0b725b Mon Sep 17 00:00:00 2001
From: shumingma <shumma@microsoft.com>
Date: Thu, 1 Dec 2022 20:40:09 -0800
Subject: [PATCH] Add an example for vocab

---
 examples/fairseq/README.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/examples/fairseq/README.md b/examples/fairseq/README.md
index b65434d..c663abd 100644
--- a/examples/fairseq/README.md
+++ b/examples/fairseq/README.md
@@ -65,6 +65,11 @@ Also, the JSON file should be in the format like this:
 ]
 ```
 
+You can quickly get started with our processed vocabulary files: [sentencepiece.bpe.model](https://publicmodel.blob.core.windows.net/torchscale/vocab/sentencepiece.bpe.model) and [dict.txt](https://publicmodel.blob.core.windows.net/torchscale/vocab/dict.txt). Note that this vocabulary is English-only with 64K tokens. To train a new `sentencepiece.bpe.model` on your own data, please refer to the [SentencePiece](https://github.com/google/sentencepiece) repo. With the sentecepiece model and the installed `sentencepiece` library, you can extract the `dict.txt` file from it by
+```
+spm_export_vocab --model=sentencepiece.bpe.model | sed 's/\t/ /g' | tail -n +4 > dict.txt
+```
+
 ### Training Command
 ```bash
 cd examples/fairseq/