How to generate Muong speech directly from Vietnamese text: Cross-lingual speech synthesis for close language pair
324 viewsDOI:
https://doi.org/10.54939/1859-1043.j.mst.81.2022.138-147Keywords:
Machine translation; Text to speech; Ethnic minority language; Vietnamese; Muong dialects; Unwritten languages; Cross-lingual speech synthesisAbstract
The paper introduces a method for automatic translation of Vietnamese text into Muong speech in two dialects, Muong Bi - Hoa Binh and Muong Tan Son - Phu Tho, which are all unwritten dialects of the Muong language. Due to the very close relationship between the Vietnamese and Muong languages, the translation system was built to look like a cross-lingual speech synthesis system, in which the input is the text of one language (i.e., the Vietnamese) and the output is the speech of another language (i.e., the two Muong dialects). The system used the modern sequence-to-sequence TTS neural models Tacotron2 and WaveGlow. The evaluation results showed a high quality of translation (with a fluency score of 4.61/5.0 and an adequacy score of 4.79/5.0) and also synthesized speech quality (with naturalness on the MOS scale of 4.68/5.0 and intelligibility of 94.60%). The received results show that the applicability of the proposed system to other minority languages is promising, especially in the case of unwritten languages.
References
[1]. P. Taylor, “Text-To-Speech Synthesis,” Camb. Univ. Press, (2009). DOI: https://doi.org/10.1017/CBO9780511816338
[2]. X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A Survey on Neural Speech Synthesis,” ArXiv210615561 Cs Eess, (2021). [Online]. Available: http://arxiv.org/abs/2106.15561.
[3]. F. de Saussure, "Course in General Linguistics". Columbia University Press, (2011).
[4]. Y. Ning, S. He, Z. Wu, C. Xing, and L.-J. Zhang, “A review of deep learning based speech synthesis,” Appl. Sci., vol. 9, no. 19, p. 4050, (2019). DOI: https://doi.org/10.3390/app9194050
[5]. Z. Mu, X. Yang, and Y. Dong, “Review of end-to-end speech synthesis technology based on deep learning,” ArXiv Prepr. ArXiv210409995, (2021).
[6]. G. Neubig, “Neural machine translation and sequence-to-sequence models: A tutorial,” ArXiv Prepr. ArXiv170301619, (2017).
[7]. K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” ArXiv Prepr. ArXiv14061078, (2014). DOI: https://doi.org/10.3115/v1/D14-1179
[8]. T. T. T. Nguyen, “HMM-based Vietnamese Text-To-Speech: Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation,” Paris 11, (2015).
[9]. T. Do Dat, E. Castelli, L. X. Hung, J.-F. Serignat, and T. Van Loan, “Linear F0 contour model for Vietnamese tones and Vietnamese syllable synthesis with TD-PSOLA,” (2006).
[10]. M. Ferlus, “Langues et peuples viet-muong,” Monkhmer Stud., pp. 7–28, (1996).
[11]. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Adv. Neural Inf. Process. Syst., vol. 27, (2014).
[12]. K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” ArXiv Prepr. ArXiv14091259, (2014). DOI: https://doi.org/10.3115/v1/W14-4012
[13]. W.-C. Huang, T. Hayashi, S. Watanabe, and T. Toda, “The sequence-to-sequence baseline for the voice conversion challenge 2020: Cascading asr and tts,” ArXiv Prepr. ArXiv201002434, (2020). DOI: https://doi.org/10.21437/VCC_BC.2020-24
[14]. O. Watts, G. E. Henter, J. Fong, and C. Valentini-Botinhao, “Where do the improvements come from in sequence-to-sequence neural TTS?,” in 2019 ISCA Speech Synthesis Workshop (SSW), vol. 10, pp. 217–222, (2019). DOI: https://doi.org/10.21437/SSW.2019-39
[15]. J. Sotelo et al., “Char2wav: End-to-end speech synthesis,” (2017).
[16]. Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis,” ArXiv Prepr. ArXiv170310135, (2017). DOI: https://doi.org/10.21437/Interspeech.2017-1452
[17]. J. Shen et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4779–4783, (2018). DOI: https://doi.org/10.1109/ICASSP.2018.8461368
[18]. L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Special Issue on Processing Under-Resourced Languages-Speech Communication Journal.” Elsevier, (2014). DOI: https://doi.org/10.1016/j.specom.2013.09.001
[19]. J. Riesa, B. Mohit, K. Knight, and D. Marcu, “Building an English-Iraqi Arabic machine translation system for spoken utterances with limited resources,” (2006). DOI: https://doi.org/10.21437/Interspeech.2006-261
[20]. J. Jiang, Z. Ahmed, J. Carson-Berndsen, P. Cahill, and A. Way, “Phonetic representation-based speech translation,” 13th Mach. Transl. Summit, (2011).
[21]. T. Kempton, R. K. Moore, and T. Hain, “Cross-Language Phone Recognition when the Target Language Phoneme Inventory is not Known.,” in INTERSPEECH, pp. 3165–3168, (2011). DOI: https://doi.org/10.21437/Interspeech.2011-792
[22]. P. K. Muthukumar and A. W. Black, “Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2594–2598, (2014). doi: 10.1109/ICASSP.2014.6854069. DOI: https://doi.org/10.1109/ICASSP.2014.6854069
[23]. Nguyễn Văn Tài, "Ngữ âm tiếng Mường qua các phương ngôn". NXB Từ điển Bách khoa, (2005).
[24]. L. Duong, A. Anastasopoulos, D. Chiang, S. Bird, and T. Cohn, “An attentional model for speech translation without transcription,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 949–959, (2016). DOI: https://doi.org/10.18653/v1/N16-1109
[25]. A. Bérard, O. Pietquin, C. Servan, and L. Besacier, “Listen and translate: A proof of concept for end-to-end speech-to-text translation,” ArXiv Prepr. ArXiv161201744, (2016).
[26]. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621, (2019). DOI: https://doi.org/10.1109/ICASSP.2019.8683143