Is it necessary to change the from English_cleaners to Basic_cleaners when training non-english languages? #190

Hello,
I am training a hindi model, for that i created custom tokenizer and started the training. Now i found out this advice from a guy at this 152334H/DL-Art-School repo.

He said to change the cleaner from english to basic while training new languages:

You need to remember to change the tokenizer in the repository for synthesis, otherwise it will speak as if in an incomprehensible language. You can change it in tortoise/utils/tokenizer.py, line 180:


DEFAULT_VOCAB_FILE = os.path.join(
os.path.dirname(os.path.realpath(file)), "../data/polish_tokenizer.json"
)
And also line 191 to basic_cleaners():

def preprocess_text(self, txt):
    txt = basic_cleaners(txt)
    return txt
And remember to do the same for training, replacing english_cleaners with basic_cleaners. I did it by commenting out the original english_cleaners, but you can also delete it and change even basic_cleaners to english_cleaners, as the repository has constant values set for english_cleaners, so you would have to change it in several places, unless it can be done from the yml configuration file, but I did it as described above.

Now, im confused, This is the code inside tokenizer.py where mrq has modified for japanese language and he also said to use japanese.json during synthesis of japanese voice.

But nowhere he mentioned about the "basic_cleaners" and to use them while training a new model in a non-english alnguage.

def preprocess_text(self, txt):
        if self.language == 'ja':
          import pykakasi

          kks = pykakasi.kakasi()
          results = kks.convert(txt)
          words = []

          for result in results:
            words.append(result['kana'])

          txt = " ".join(words)
          txt = basic_cleaners(txt)
        else:
          txt = english_cleaners(txt)
        return txt

So, i think i also need to modify the tokenizer.py with something like this:

if self.language == 'hi':
  import indicnlp
  
  # Load the Indic Normalizer Factory
  factory = indicnlp.normalize.indic_normalize.IndicNormalizerFactory()

  # Get the normalizer for Hindi text
  normalizer = factory.get_normalizer("hi")

  # Normalize the text
  txt = normalizer.normalize(txt)

  # Remove diacritics
  txt = indicnlp.normalize.indic_normalize.remove_nuktas(txt)

  # Tokenize the text
  txt = ' '.join(indicnlp.tokenize.trivial_tokenize(txt))

  txt = basic_cleaners(txt)
  
else:
  txt = english_cleaners(txt)

return txt

I have generated the the above code block for hindi language from chatgpt.

So, is this code block ok to use? or can i proceed without any modifications?

Can we use english_cleaners also to train non english languages? Or the modification is necessary?

Hello, I am training a hindi model, for that i created custom tokenizer and started the training. Now i found out this advice from a guy at this **152334H/DL-Art-School repo**. He said to change the cleaner from english to basic while training new languages: ``` You need to remember to change the tokenizer in the repository for synthesis, otherwise it will speak as if in an incomprehensible language. You can change it in tortoise/utils/tokenizer.py, line 180: DEFAULT_VOCAB_FILE = os.path.join( os.path.dirname(os.path.realpath(file)), "../data/polish_tokenizer.json" ) And also line 191 to basic_cleaners(): def preprocess_text(self, txt): txt = basic_cleaners(txt) return txt And remember to do the same for training, replacing english_cleaners with basic_cleaners. I did it by commenting out the original english_cleaners, but you can also delete it and change even basic_cleaners to english_cleaners, as the repository has constant values set for english_cleaners, so you would have to change it in several places, unless it can be done from the yml configuration file, but I did it as described above. ``` Now, im confused, This is the code inside tokenizer.py where mrq has modified for japanese language and he also said to use japanese.json during synthesis of japanese voice. But nowhere he mentioned about the "basic_cleaners" and to use them while training a new model in a non-english alnguage. ``` def preprocess_text(self, txt): if self.language == 'ja': import pykakasi kks = pykakasi.kakasi() results = kks.convert(txt) words = [] for result in results: words.append(result['kana']) txt = " ".join(words) txt = basic_cleaners(txt) else: txt = english_cleaners(txt) return txt ``` So, i think i also need to modify the tokenizer.py with something like this: ``` if self.language == 'hi': import indicnlp # Load the Indic Normalizer Factory factory = indicnlp.normalize.indic_normalize.IndicNormalizerFactory() # Get the normalizer for Hindi text normalizer = factory.get_normalizer("hi") # Normalize the text txt = normalizer.normalize(txt) # Remove diacritics txt = indicnlp.normalize.indic_normalize.remove_nuktas(txt) # Tokenize the text txt = ' '.join(indicnlp.tokenize.trivial_tokenize(txt)) txt = basic_cleaners(txt) else: txt = english_cleaners(txt) return txt ``` I have generated the the above code block for hindi language from chatgpt. So, is this code block ok to use? or can i proceed without any modifications? Can we use english_cleaners also to train non english languages? Or the modification is necessary?

Can we use english_cleaners also to train non english languages? Or the modification is necessary?

See the proviso in modules/dlas/dlas/models/audio/tts/tacotron2/text/cleaners.py:

Cleaners are transformations that run over the input text at both training and eval time.

Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
  1. "english_cleaners" for English text
  2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using the Unidecode library (https://pypi.python.org/pypi/Unidecode)
  3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update the symbols in symbols.py to match your data).

>Can we use english_cleaners also to train non english languages? Or the modification is necessary? See the proviso in `modules/dlas/dlas/models/audio/tts/tacotron2/text/cleaners.py`\: ``` Cleaners are transformations that run over the input text at both training and eval time. Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners" hyperparameter. Some cleaners are English-specific. You'll typically want to use: 1. "english_cleaners" for English text 2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using the Unidecode library (https://pypi.python.org/pypi/Unidecode) 3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update the symbols in symbols.py to match your data). ```

Labels Milestones

Is it necessary to change the from English_cleaners to Basic_cleaners when training non-english languages? #190