让语音合成更具表现力:StyleMoE 的“分而治之”策略

近年来,语音合成技术取得了长足进步,合成语音不仅清晰易懂,还拥有丰富的感情和韵律,更接近于人类的表达方式。然而,如何从各种不同的参考语音中提取并编码风格信息仍然是一个挑战,尤其是当遇到从未见过的语音风格时。

StyleMoE:将风格编码空间“分而治之”

为了解决这一难题,研究人员提出了 StyleMoE,一种将风格编码空间划分为多个可处理的子空间,并由专门的“风格专家”负责处理的模型。StyleMoE 将 TTS 系统中的风格编码器替换为一个“专家混合” (MoE) 层。通过使用门控网络将参考语音路由到不同的风格专家,每个专家在优化过程中专门负责风格空间的特定方面。

StyleMoE 的工作原理

StyleMoE 的核心思想是将风格编码空间划分为多个子空间,每个子空间由一个专门的风格专家负责处理。这就像将一个复杂的难题分解成多个更容易解决的小问题,每个专家都专注于解决其中一个问题。

具体来说,StyleMoE 使用一个门控网络来决定哪个专家应该处理当前的参考语音。门控网络会根据参考语音的特点,选择最适合的专家,并为每个专家分配相应的权重。每个专家都拥有独立的参数,在优化过程中只负责处理分配给它的子空间,从而提高模型的效率和准确性。

StyleMoE 的优势

StyleMoE 的优势在于:

  • 提高风格空间覆盖率:通过将风格编码空间划分为多个子空间,StyleMoE 可以更好地处理各种不同的风格,包括从未见过的风格。
  • 提高模型泛化能力:每个专家只负责处理特定的子空间,这有助于提高模型的泛化能力,减少模型对训练数据的依赖。
  • 降低计算成本:StyleMoE 使用稀疏 MoE,这意味着只有少数专家会参与到模型的计算中,从而降低了模型的计算成本。

实验结果

研究人员在 ESD 和 VCTK 数据集上对 StyleMoE 进行了测试,结果表明,StyleMoE 在各种指标上都优于基线模型,包括:

  • 提高语音质量:StyleMoE 合成的语音具有更高的自然度和清晰度。
  • 提高风格相似度:StyleMoE 合成的语音更接近于参考语音的风格。
  • 提高模型泛化能力:StyleMoE 在处理从未见过的风格时表现出色。

未来展望

StyleMoE 为语音合成技术的进步开辟了新的方向。未来,研究人员将继续探索不同的门控网络架构,并尝试将 StyleMoE 应用于更复杂的语音合成系统。

参考文献

[1] M. Schr¨oder, “Emotional speech synthesis: A review,” in Seventh European Conference on Speech Communication and Technology, 2001.

[2] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” ArXiv, vol. abs/1609.03499, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:6254678

[3] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. A. J. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Interspeech, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:4689304

[4] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis with transformer network,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 6706–6713.

[5] X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A survey on neural speech synthesis,” 2021.

[6] S. Takamichi, T. Toda, A. W. Black, G. Neubig, S. Sakti, and S. Nakamura, “Postfilters to modify the modulation spectrum for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 755–767, 2016.

[7] H. -T. Luong, S. Takaki, G. E. Henter, and J. Yamagishi, “Adapting and controlling dnn-based speech synthesis using input codes,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4905–4909.

[8] Y. Lee, A. Rabiee, and S.-Y. Lee, “Emotional end-to-end neural speech synthesizer,” arXiv preprint arXiv:1711.05447, 2017.

[9] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in international conference on machine learning.
PMLR, 2018, pp. 4693–4702.

[10] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in International conference on machine learning. PMLR, 2018, pp. 5180–5189.

[11] K. Akuzawa, Y. Iwasawa, and Y. Matsuo, “Expressive speech synthesis via modeling expressions with variational autoencoder,” arXiv preprint arXiv:1804.02135, 2018.

[12] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.

[13] A. Ła´ncucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6588–6592.

[14] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, “Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis,” in ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2020, pp. 6264–6268.

[15] R. Huang, Y. Ren, J. Liu, C. Cui, and Z. Zhao, “Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech,” Advances in Neural Information Processing Systems, vol. 35, pp. 10 970–10 983, 2022.

[16] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural computation, vol. 3, no. 1, pp. 79–87, 1991.

[17] S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a literature survey,” Artificial Intelligence Review, vol. 42, pp. 275–293, 2014.

[18] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.

[19] C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,” Advances in Neural Information Processing Systems, vol. 34, pp. 8583–8595, 2021.

[20] D. Eigen, M. Ranzato, and I. Sutskever, “Learning factored representations in a deep mixture of experts,” arXiv preprint arXiv:1312.4314, 2013.

[21] D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 7748–7759.

[22] T. H. Teh, V. Hu, D. S. R. Mohan, Z. Hodari, C. G. Wallis, T. G. Ibarrondo, A. Torresquintero, J. Leoni, M. Gales, and S. King, “Ensemble prosody prediction for expressive speech synthesis,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.

[23] Y. Yan, X. Tan, B. Li, G. Zhang, T. Qin, S. Zhao, Y. Shen, W.-Q. Zhang, and T.-Y. Liu, “Adaspeech 3: Adaptive text to speech for spontaneous style,” arXiv preprint arXiv:2107.02530, 2021.

[24] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019.

[25] J. Yamagishi, C. Veaux, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:213060286

[26] K. Zhou, B. Sisman, R. Liu, and H. Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 920–924.

[27] X. An, F. K. Soong, and L. Xie, “Disentangling style and speaker attributes for tts style transfer,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 646–658, 2022.

https://arxiv.org/pdf/2406.03637 https://arxiv.org/html/2406.03637v1

0 0 投票数
Article Rating
订阅评论
提醒
1 评论
最多投票
最新 最旧
内联反馈
查看所有评论
1
0
希望看到您的想法,请您发表评论x