基于 RTVC-7 Voice Cloning Model 的 Cross-Lingual TTS 的三步走: 第二步 Tuned-GE2E-SayEN-EarSpeech 不跨和跨语言实验观察

0. 说明

  • DataBaker 那几句
  • 标贝开源
  • Aishell-3
  • M2VoC

一共四大类, 作为 reference

根据训练的 Loss 情况:

使用

  • /ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/synthesizer/saved_models/Kiss/Kiss_195k.pt
  • 195k的 ckpt, 其实还有接着训练的空间, 才两天, 语料又这么大

1. 选取 Reference Speech

1.1. 选取音频

  1. 前 6 个是标贝双语, 分别是 2 个中, 2 个英, 2 个混合
  2. 然后两个是 Aishell 女 1
  3. 然后两个是 Aishell 女 2
  4. 然后一个是 Aishell 男1
  5. 然后一个是 Aishell 男2
  6. 然后两个是标贝开源
  7. 然后两个是 LJSpeech
  8. 最后四个来源于 VCTK, 女 + 男
  9. 注意, 采样率有的是 16k, 有的是 48k

1.2. 使用 GE2E Encoder 提取 Speaker Embedding

代码逻辑在: /ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/Kiss_GE2E_SayEN_Cross_synthesizer.py

2. 选取文本

input_text = ['Please call Stella.',
'When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.',
'Their prescription is largely about changing attitudes.',
'He is not a Celtic man.',
'First blood double kill triple kill.',
'People who like to eat watermelon are very cute.']

3. 主观大体观察实验

https://drive.google.com/file/d/1kVLU4LZnJ3uE0F_ryIcOPrhJdcSF8-dJ/view?usp=sharing

  • 总体来说很好, 有了音色迁移方案的优点: 没有口音
  • 不过在稳定性上确实不稳定
  • 整体感觉 Reference Speech 长一些的话, 会稳定点
  • 如果一句 Refence Speech 合成两三句都比较稳定的话, 那么接下来仍用他还会稳定的可能性就比较大

4. Speaker Emb 观察实验

4.1. 用 PCA 来观察

  • PCA 快, 指定的参数也少
  • PCA 可以先画训练集, 也就是 VCTK 说话人的 Speaker Emb 分布, 然后固定不动, 再去转换新来的 标贝 等人, 比 t-SNE 方便
  • 虽然 Git 名称: https://github.com/ruclion/ASV-T-SNE, 但是目前先画 PCA
  • 并且 SayEN 和 SayCN 的都放到这个项目中, SayEN 的统一加后缀 SayEN

4.2. 画图数据来源

4.2.1. samples_Cross_and_unCross

  • (2 + 2 + 2) + 2 + 2 + 2 个说话人
  • 音频在: /ceph/home/hujk17/Tuned-GE2E-EarSpeech/log_FaPig_GE2E_syn_wavs_Cross
  • Speaker Emb 在: /ceph/home/hujk17/Tuned-GE2E-EarSpeech/samples_Cross
cross_samples_path = ['/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/000001.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/000016.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/100001.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/100024.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/200001.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/200005.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/BZNS-000003.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/BZNS-008701.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/LJ001-0001.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/LJ002-0014.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/aishell-SSB01120011.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/aishell-SSB01120079.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/aishell-SSB08870225.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/aishell-SSB11610005.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/aishell-SSB11610040.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/aishell-SSB16300375.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/p225_001.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/p225_003.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/p315_003.wav',
                      '/ceph/home/hujk17/Tuned-GE2E-SayEN-EarSpeech/samples_Cross/p315_023.wav',]

4.2.2. VCTK samples

  • /ceph/home/hujk17/npy-EarSpeech-HCSI-Data/dereverb_npy 文件夹中, 一共 174 个人 (本来 218 个人, 可能是实验室小伙伴去噪声时候删掉了一些), 每个文件夹取 100/16 句 
  • 能不能获得 VCTK 说话人的性别: /ceph/dataset/VCTK-Corpus/speaker-info.txt 是可以得到的, TODO

猜你喜欢

转载自blog.csdn.net/u013625492/article/details/115006454