让语音合成更具表现力：StyleMoE 的「分而治之」策略

近年来，语音合成技术取得了长足进步，合成语音不仅清晰易懂，还拥有丰富的感情和韵律，更接近于人类的表达方式。然而，如何从各种不同的参考语音中提取并编码风格信息仍然是一个挑战，尤其是当遇到从未见过的语音风格时。

StyleMoE：将风格编码空间「分而治之」

为了解决这一难题，研究人员提出了 StyleMoE，一种将风格编码空间划分为多个可处理的子空间，并由专门的「风格专家」负责处理的模型。StyleMoE 将 TTS 系统中的风格编码器替换为一个「专家混合」 (MoE) 层。通过使用门控网络将参考语音路由到不同的风格专家，每个专家在优化过程中专门负责风格空间的特定方面。

StyleMoE 的工作原理

StyleMoE 的核心思想是将风格编码空间划分为多个子空间，每个子空间由一个专门的风格专家负责处理。这就像将一个复杂的难题分解成多个更容易解决的小问题，每个专家都专注于解决其中一个问题。

具体来说，StyleMoE 使用一个门控网络来决定哪个专家应该处理当前的参考语音。门控网络会根据参考语音的特点，选择最适合的专家，并为每个专家分配相应的权重。每个专家都拥有独立的参数，在优化过程中只负责处理分配给它的子空间，从而提高模型的效率和准确性。

StyleMoE 的优势

StyleMoE 的优势在于：

提高风格空间覆盖率：通过将风格编码空间划分为多个子空间，StyleMoE 可以更好地处理各种不同的风格，包括从未见过的风格。
提高模型泛化能力：每个专家只负责处理特定的子空间，这有助于提高模型的泛化能力，减少模型对训练数据的依赖。
降低计算成本：StyleMoE 使用稀疏 MoE，这意味着只有少数专家会参与到模型的计算中，从而降低了模型的计算成本。

实验结果

研究人员在 ESD 和 VCTK 数据集上对 StyleMoE 进行了测试，结果表明，StyleMoE 在各种指标上都优于基线模型，包括：

提高语音质量：StyleMoE 合成的语音具有更高的自然度和清晰度。
提高风格相似度：StyleMoE 合成的语音更接近于参考语音的风格。
提高模型泛化能力：StyleMoE 在处理从未见过的风格时表现出色。

未来展望

StyleMoE 为语音合成技术的进步开辟了新的方向。未来，研究人员将继续探索不同的门控网络架构，并尝试将 StyleMoE 应用于更复杂的语音合成系统。

参考文献

[1] M. Schr¨oder, 「Emotional speech synthesis: A review,」 in Seventh European Conference on Speech Communication and Technology, 2001.✅

[2] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, 「Wavenet: A generative model for raw audio,」 ArXiv, vol. abs/1609.03499, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:6254678✅

[3] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. A. J. Clark, and R. A. Saurous, 「Tacotron: Towards end-to-end speech synthesis,」 in Interspeech, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:4689304✅

[4] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, 「Neural speech synthesis with transformer network,」 in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 6706–6713.✅

[5] X. Tan, T. Qin, F. Soong, and T.-Y. Liu, 「A survey on neural speech synthesis,」 2021.✅

[6] S. Takamichi, T. Toda, A. W. Black, G. Neubig, S. Sakti, and S. Nakamura, 「Postfilters to modify the modulation spectrum for statistical parametric speech synthesis,」 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 755–767, 2016.✅

[7] H. -T. Luong, S. Takaki, G. E. Henter, and J. Yamagishi, 「Adapting and controlling dnn-based speech synthesis using input codes,」 in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4905–4909.✅

[8] Y. Lee, A. Rabiee, and S.-Y. Lee, 「Emotional end-to-end neural speech synthesizer,」 arXiv preprint arXiv:1711.05447, 2017.✅

[9] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, 「Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,」 in international conference on machine learning.✅
PMLR, 2018, pp. 4693–4702.

[10] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, 「Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,」 in International conference on machine learning. PMLR, 2018, pp. 5180–5189.✅

[11] K. Akuzawa, Y. Iwasawa, and Y. Matsuo, 「Expressive speech synthesis via modeling expressions with variational autoencoder,」 arXiv preprint arXiv:1804.02135, 2018.✅

[12] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, 「Fastspeech 2: Fast and high-quality end-to-end text to speech,」 arXiv preprint arXiv:2006.04558, 2020.✅

[13] A. Ła´ncucki, 「Fastpitch: Parallel text-to-speech with pitch prediction,」 in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6588–6592.✅

[14] G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu, 「Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis,」 in ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2020, pp. 6264–6268.✅

[15] R. Huang, Y. Ren, J. Liu, C. Cui, and Z. Zhao, 「Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech,」 Advances in Neural Information Processing Systems, vol. 35, pp. 10 970–10 983, 2022.✅

[16] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, 「Adaptive mixtures of local experts,」 Neural computation, vol. 3, no. 1, pp. 79–87, 1991.✅

[17] S. Masoudnia and R. Ebrahimpour, 「Mixture of experts: a literature survey,」 Artificial Intelligence Review, vol. 42, pp. 275–293, 2014.✅

[18] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, 「Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,」 arXiv preprint arXiv:1701.06538, 2017.✅

[19] C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, 「Scaling vision with sparse mixture of experts,」 Advances in Neural Information Processing Systems, vol. 34, pp. 8583–8595, 2021.✅

[20] D. Eigen, M. Ranzato, and I. Sutskever, 「Learning factored representations in a deep mixture of experts,」 arXiv preprint arXiv:1312.4314, 2013.✅

[21] D. Min, D. B. Lee, E. Yang, and S. J. Hwang, 「Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,」 in International Conference on Machine Learning. PMLR, 2021, pp. 7748–7759.✅

[22] T. H. Teh, V. Hu, D. S. R. Mohan, Z. Hodari, C. G. Wallis, T. G. Ibarrondo, A. Torresquintero, J. Leoni, M. Gales, and S. King, 「Ensemble prosody prediction for expressive speech synthesis,」 in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.✅

[23] Y. Yan, X. Tan, B. Li, G. Zhang, T. Qin, S. Zhao, Y. Shen, W.-Q. Zhang, and T.-Y. Liu, 「Adaspeech 3: Adaptive text to speech for spontaneous style,」 arXiv preprint arXiv:2107.02530, 2021.✅

[24] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, 「Libritts: A corpus derived from librispeech for text-to-speech,」 arXiv preprint arXiv:1904.02882, 2019.✅

[25] J. Yamagishi, C. Veaux, and K. MacDonald, 「Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),」 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:213060286✅

[26] K. Zhou, B. Sisman, R. Liu, and H. Li, 「Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,」 in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 920–924.✅

[27] X. An, F. K. Soong, and L. Xie, 「Disentangling style and speaker attributes for tts style transfer,」 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 646–658, 2022.✅

https://arxiv.org/pdf/2406.03637 https://arxiv.org/html/2406.03637v1

让语音合成更具表现力：StyleMoE 的「分而治之」策略

《让语音合成更具表现力：StyleMoE 的「分而治之」策略》有1条评论

发表评论取消回复

《让语音合成更具表现力：StyleMoE 的「分而治之」策略》有1条评论

发表评论 取消回复

发表评论取消回复