Training with custom dataset #3

Open
opened 2023-08-14 13:08:34 +00:00 by arbianqx · 4 comments

Hello,

First of all, thank you for the great job so far!

We are trying to train a model with our custom dataset, in our language, but so far we did not have great success!

Firstly we changed the qnt.py file and at the line 34 of the qnt.py, we added a new env:
bandwidth_id - 6.0, because we were unable to start the quantising process, since it was failing.

Secondly we edited the g2p file to add our language instead of english.

Thirdly we tried to start training but we encountered in some errors:

  • We removed the distributed env, since we were unable to start the training.
  • Set use_hdf5 to False
  • Set training with full path to dataset dir
  • And then we get the following error:
    File "/home/Desktop/tts/src/TTS-VALL-E/.mrqvalle/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 328, in __init__ raise ValueError('sampler option is mutually exclusive with ' ValueError: sampler option is mutually exclusive with shuffle

Any ideas on what are we doing wrong, or how to proceed further!

Thanks in advance.

Hello, First of all, thank you for the great job so far! We are trying to train a model with our custom dataset, in our language, but so far we did not have great success! Firstly we changed the qnt.py file and at the line 34 of the qnt.py, we added a new env: `bandwidth_id - 6.0`, because we were unable to start the quantising process, since it was failing. Secondly we edited the g2p file to add our language instead of english. Thirdly we tried to start training but we encountered in some errors: - We removed the `distributed` env, since we were unable to start the training. - Set `use_hdf5` to False - Set `training` with full path to dataset dir - And then we get the following error: ` File "/home/Desktop/tts/src/TTS-VALL-E/.mrqvalle/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 328, in __init__ raise ValueError('sampler option is mutually exclusive with ' ValueError: sampler option is mutually exclusive with shuffle` Any ideas on what are we doing wrong, or how to proceed further! Thanks in advance.
Owner

Firstly we changed the qnt.py file and at the line 34 of the qnt.py, we added a new env:
bandwidth_id - 6.0, because we were unable to start the quantising process, since it was failing.

Ah right, I don't think I got around to validating the dataset preparation post-"rewrite" (moreso after overhauling the config class(es)). If you're using mrq/ai-voice-cloning, I think you also need to pass in yaml="./path/to/your/config.yaml", and if you're using the python3 -m vall_e.emb.qnt method, I imagine you also need to pass in yaml="./path/to/your/config.yaml" as well in the command, since it's deriving the EnCodec level to use from the training YAML.

Secondly we edited the g2p file to add our language instead of english.

Right, I keep forgetting touch vall_e.emb.g2p to allow a user-specified language. If anything there's a bit of an issue I remember with Japanese causing problems, or at least, in the very early infancy of the implementation.

Thirdly we tried to start training but we encountered in some errors:

We removed the distributed env, since we were unable to start the training.

Right, I immediately made that pointless by having it turn into a property that will derive whether or not distributed training is set based on the WORLD_SIZE environment variable. Should be removed from the example YAML.

And then we get the following error: File "/home/Desktop/tts/src/TTS-VALL-E/.mrqvalle/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 328, in init raise ValueError('sampler option is mutually exclusive with ' ValueError: sampler option is mutually exclusive with shuffle

I suppose I didn't get a chance to actually check how it runs without distributed training after having it set up to use it. I think it should be fixed in commit 5fa86182b5.

When I get a chance I'll validate training both with distributed and without again, since I might have fudged something up with un-hardcoding things.

> Firstly we changed the qnt.py file and at the line 34 of the qnt.py, we added a new env: bandwidth_id - 6.0, because we were unable to start the quantising process, since it was failing. Ah right, I don't think I got around to validating the dataset preparation post-"rewrite" (moreso after overhauling the config class(es)). If you're using [mrq/ai-voice-cloning](https://git.ecker.tech/mrq/ai-voice-cloning), I *think* you also need to pass in `yaml="./path/to/your/config.yaml"`, and if you're using the `python3 -m vall_e.emb.qnt` method, I imagine you also need to pass in `yaml="./path/to/your/config.yaml"` as well in the command, since it's deriving the EnCodec level to use from the training YAML. > Secondly we edited the g2p file to add our language instead of english. Right, I keep forgetting touch `vall_e.emb.g2p` to allow a user-specified language. If anything there's a bit of an issue I remember with Japanese causing problems, or at least, in the very early infancy of the implementation. > Thirdly we tried to start training but we encountered in some errors: > We removed the distributed env, since we were unable to start the training. Right, I immediately made that pointless by having it turn into a property that will derive whether or not distributed training is set based on the `WORLD_SIZE` environment variable. Should be removed from the example YAML. > And then we get the following error: File "/home/Desktop/tts/src/TTS-VALL-E/.mrqvalle/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 328, in __init__ raise ValueError('sampler option is mutually exclusive with ' ValueError: sampler option is mutually exclusive with shuffle I suppose I didn't get a chance to actually check how it runs without distributed training after having it set up to use it. I *think* it should be fixed in commit 5fa86182b536168a7acff19abd20210a100fa94a. When I get a chance I'll validate training both with distributed and without again, since I might have fudged something up with un-hardcoding things.
Author

I did the changes in your commit and got rid of the problem, but now I think I made a mistake on the dataset preparation part.

File "/home/vall_e/utils/trainer.py", line 213, in train for batch in _make_infinite_epochs(train_dl): File "/home/vall_e/utils/trainer.py", line 163, in _make_infinite_epochs yield from tqdm(dl, "Epoch progress", dynamic_ncols=True)

I followed the guide which was implemented on enhuiz original repo, basically inside the data/custom I put the wav files, transcripted txt, and did execute the qnt and g2p. Am I missing something in the dataset preparation part?

Thanks a lot!

I did the changes in your commit and got rid of the problem, but now I think I made a mistake on the dataset preparation part. ` File "/home/vall_e/utils/trainer.py", line 213, in train for batch in _make_infinite_epochs(train_dl): File "/home/vall_e/utils/trainer.py", line 163, in _make_infinite_epochs yield from tqdm(dl, "Epoch progress", dynamic_ncols=True)` I followed the guide which was implemented on enhuiz original repo, basically inside the data/custom I put the wav files, transcripted txt, and did execute the qnt and g2p. Am I missing something in the dataset preparation part? Thanks a lot!
Owner

Can you post the whole console log, from start to end?

I suppose I should also document what exactly you should add under the YAML's dataset.training and dataset.validation arrays, but for now it should be something like:

dataset:
  training: [
    "./some/path/to/dataset/LibriTTS/43/",
    "./some/path/to/dataset/LibriTTS/1310/",
  ]
  validation: [
    "./some/path/to/dataset/LibriTTS/2443/",
    "./some/path/to/dataset/LibriTTS/573/",
  ]

where, for example, the contents of ./some/path/to/dataset/LibriTTS/1310/ looks like:

1310_1014_00000.phn.txt  1310_1014_00001.phn.txt  1310_1014_00002.phn.txt  1310_1014_00003.phn.txt 
1310_1014_00000.qnt.pt   1310_1014_00001.qnt.pt   1310_1014_00002.qnt.pt   1310_1014_00003.qnt.pt

If you're having your speakers by one folder, rather than two, make sure to also set dataset.speaker_name_getter in the YAML to "lambda p: f'{p.parts[-2]}'" instead.

You should have a printout after the dataloader loads about the size of the training, evaluation (subtrain), and validation datasets.

And to double check you aren't loading a cached dataloader, set dataset.cache to False.

Outside of that, those are the only things I can think of validating without a full log.

Can you post the whole console log, from start to end? I suppose I should also document what exactly you should add under the YAML's `dataset.training` and `dataset.validation` arrays, but for now it should be something like: ``` dataset: training: [ "./some/path/to/dataset/LibriTTS/43/", "./some/path/to/dataset/LibriTTS/1310/", ] validation: [ "./some/path/to/dataset/LibriTTS/2443/", "./some/path/to/dataset/LibriTTS/573/", ] ``` where, for example, the contents of `./some/path/to/dataset/LibriTTS/1310/` looks like: ``` 1310_1014_00000.phn.txt 1310_1014_00001.phn.txt 1310_1014_00002.phn.txt 1310_1014_00003.phn.txt 1310_1014_00000.qnt.pt 1310_1014_00001.qnt.pt 1310_1014_00002.qnt.pt 1310_1014_00003.qnt.pt ``` If you're having your speakers by one folder, rather than two, make sure to also set `dataset.speaker_name_getter` in the YAML to `"lambda p: f'{p.parts[-2]}'"` instead. You should have a printout after the dataloader loads about the size of the training, evaluation (subtrain), and validation datasets. And to double check you aren't loading a cached dataloader, set `dataset.cache` to `False`. Outside of that, those are the only things I can think of validating without a full log.
Owner

Ah, I incidentally just so happened to run into what the issue is. It seems that technically I shouldn't be trying to pass in the sampler created in the dataset into the Dataloader constructor, as that causes problems:

2023-08-14 21:32:56 - vall_e.utils.trainer - INFO - GR=0;LR=0 -
New epoch starts.
Epoch progress: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/mrq/Programs/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 169, in <module>
    main()
  File "/home/mrq/Programs/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 162, in main
    trainer.train(
  File "/home/mrq/Programs/ai-voice-cloning/modules/vall-e/vall_e/utils/trainer.py", line 213, in train
    for batch in _make_infinite_epochs(train_dl):
  File "/home/mrq/Programs/ai-voice-cloning/modules/vall-e/vall_e/utils/trainer.py", line 163, in _make_infinite_epochs
    yield from tqdm(dl, "Epoch progress", dynamic_ncols=True)
  File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 433, in __iter__
    self._iterator = self._get_iterator()
                     ^^^^^^^^^^^^^^^^^^^^
  File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 386, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1084, in __init__
    self._reset(loader, first_iter=True)
  File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset
    self._try_put_index()
  File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index
    index = self._next_index()
            ^^^^^^^^^^^^^^^^^^
  File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 620, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/sampler.py", line 273, in __iter__
    sampler_iter = iter(self.sampler)
                   ^^^^^^^^^^^^^^^^^^
TypeError: 'Sampler' object is not iterable

I'm going to see if I can fix it and then push another commit. Whoops.


Seems to work fine. Fixed in commit 277c759ab1.

Ah, I incidentally just so happened to run into what the issue is. It seems that *technically* I shouldn't be trying to pass in the sampler created in the dataset into the Dataloader constructor, as that causes problems: ``` 2023-08-14 21:32:56 - vall_e.utils.trainer - INFO - GR=0;LR=0 - New epoch starts. Epoch progress: 0it [00:00, ?it/s] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/home/mrq/Programs/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 169, in <module> main() File "/home/mrq/Programs/ai-voice-cloning/modules/vall-e/vall_e/train.py", line 162, in main trainer.train( File "/home/mrq/Programs/ai-voice-cloning/modules/vall-e/vall_e/utils/trainer.py", line 213, in train for batch in _make_infinite_epochs(train_dl): File "/home/mrq/Programs/ai-voice-cloning/modules/vall-e/vall_e/utils/trainer.py", line 163, in _make_infinite_epochs yield from tqdm(dl, "Epoch progress", dynamic_ncols=True) File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/tqdm/std.py", line 1182, in __iter__ for obj in iterable: File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 433, in __iter__ self._iterator = self._get_iterator() ^^^^^^^^^^^^^^^^^^^^ File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 386, in _get_iterator return _MultiProcessingDataLoaderIter(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1084, in __init__ self._reset(loader, first_iter=True) File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset self._try_put_index() File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index index = self._next_index() ^^^^^^^^^^^^^^^^^^ File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 620, in _next_index return next(self._sampler_iter) # may raise StopIteration ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mrq/Programs/ai-voice-cloning/venv-3.11-cuda/lib/python3.11/site-packages/torch/utils/data/sampler.py", line 273, in __iter__ sampler_iter = iter(self.sampler) ^^^^^^^^^^^^^^^^^^ TypeError: 'Sampler' object is not iterable ``` I'm going to see if I can fix it and then push another commit. Whoops. --- Seems to work fine. Fixed in commit 277c759ab13fdce4c80c9c2f2c9d42def8698aac.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mrq/vall-e#3
No description provided.