Pitch Marking Study towards High Quality in Concatenative Based Speech Synthesis

Margarita Selene Salazar Ávila, José Antonio Trejo Carrillo, Fabián Navarrete Rocha, Naram Isaí Hernández Belmontes, Jesús Rodríguez Zamarrón, Javier Saldívar Pérez, Ana luisa Hernández Gutiérrez, César Gamboa Rosales, Lorena Raquel Casanova Luna, Abubeker Gamboa Rosales, Claudia Sifuentes Gallardo


Generally, concatenative speech synthesis systems provide a considerable synthesis quality since the criteria for unit selection methods have been optimized. However, the level of synthesis quality still depends on the adequate concatenation of speech units. An adequate concatenation of speech units has as precondition that concatenation mismatches as phase mismatch, phase mismatch and discontinuity of spectral envelope must not appear in the synthesized speech signal. Therefore, avoiding phase mismatches leads to a high speech synthesis quality and the way to avoid phase mismatches is achieved by an appropriated pitch marking algorithm. Therefore, a pitch marking study was carried out through a evaluat-ing the available pitch marking algorithms. So, a speech database was pitch marked many times using the different pitch mark algorithms. Therewith, several sentences were synthesized applying the different pitch makings of the speech database. A mean Opinion Score (MOS) listening test was carried out for the evaluation of the synthesized speech sentences regarding mismatch human perception. The best pitch mark algorithm was selected according its observed effect in the quality of the speech synthesis.

Full Text:



Dutoit, T., “An introduction to Text-To-Speech Synthesis”, Kluwer Academic Publishers, 1996, 326 pp.

Engel, T., “RobusteMarkierung von Grundfrequenzperioden”, Diplomarbeit, TechnischeUniversitt Dresden, 2003.

Louw, A., “A short guide to pitch-marking in the Festival speech synthesis system and recommendations for improvements”. January 2004. http://www.llsti.org/pubs/Pitch marking.pdf.

Boersma, P. and Weenink, D.,“Praat: doing phonetics by computer (version 4.4.19)”. http://www.praat.org, 2001. As of June 30, 2006.

Sjlander, K. And Beskow J., “WaveSurfer - an open source speech tool.”, In Proc. of ICSLP, Beijing, Oct 16-20, 2000, Vol 4, pp. 464-467.

Hussein, H. and Jokisch, O., “Hybrid electroglottograph and speech signal based algorithm for pitch marking”, In INTERSPEECH- 2007, pp. 1653-1656.

Upadhyay, N. & Rosales, H.G. Int J Speech Technol (2016) 19: 869. https://doi.org/10.1007/s10772-016-9370-4

Galván-Tejada C, López-Monteagudo F, Alonso-González O, Galván-Tejada J, Celaya-Padilla J, Gamboa-Rosales H, Magallanes-Quintanar R, Zanella-Calzada L (2018) A Generalized Model for Indoor Location Estimation Using Environmental Sound from Human Activity Recognition. ISPRS International Journal of Geo-Information 7(3), 81. doi:10.3390/ijgi7030081.

Upadhyay, N. & Rosales, H.G. Natl. Acad. Sci. Lett. (2018) 41: 15. https://doi.org/10.1007/s40009-017-0597-7

Nematollahi, M.A., Vorakulpipat, C., Gamboa-Rosales, H. et al. Proc. Natl. Acad. Sci., India, Sect. A Phys. Sci. (2017) 87: 433. https://doi.org/10.1007/s40010-017-0371-8

Nematollahi, M.A., Gamboa-Rosales, H., Martinez-Ruiz, F.J. et al. Multimed Tools Appl (2017) 76: 7251. https://doi.org/10.1007/s11042-016-3350-1

Luna-García, H., Mendoza-González, R., Gamboa-Rosales, H., Celaya-Padilla, J., Galván-Tejada, C., López-Monteagudo, F., Collazos-, C., Mendoza-González, A..(2018). MENTAL MODELS ASSOCIATED TO VOICE USER INTERFACES FOR INFOTAINMENT SYSTEMS. DYNA, 93(3). 245. DOI: http://dx.doi.org/10.6036/8766

García-Hernández A, Galván-Tejada C, Galván-Tejada J, Celaya-Padilla J, Gamboa-Rosales H, Velasco-Elizondo P, Cárdenas-Vargas R (2017) A SimilarityAnalysis of Audio Signal to Develop a Human ActivityRecognitionUsingSimilarity Networks. Sensors 17(11), 2688. doi:10.3390/s17112688.

Hossmar, Maria. Master Thesis: “Emotional Prosody for Speech Synthesis”. Dresden University of Technology, 2009.

Kotnik, B., “Determination of characteristic points inside the glottal cycle”, Technical report, University of Maribor, Faculty of Electrical. Engineering and Computer Science, 2005.

Mohammad Ali Nematollahi, ChaleeVorakulpipat, and Hamurabi Gamboa Rosales, “Optimization of a Blind Speech Watermarking Technique against Amplitude Scaling,” Security and Communication Networks, vol. 2017, Article ID 5454768, 13 pages, 2017. https://doi.org/10.1155/2017/5454768.

Mohammad Ali Nematollahi, ChaleeVorakulpipat, and Hamurabi Gamboa Rosales, “Semifragile Speech Watermarking Based on Least Significant Bit Replacement of Line Spectral Frequencies,” Mathematical Problems in Engineering, vol. 2017, Article ID 3597695, 9 pages, 2017. https://doi.org/10.1155/2017/3597695.

Nematollahi, M.A., Al-Haddad, S.A.R., Doraisamy, S. et al. Natl. Acad. Sci. Lett. (2016) 39: 197. https://doi.org/10.1007/s40009-016-0430-8

Carlos E. Galván-Tejada, Jorge I. Galván-Tejada, José M. Celaya-Padilla, et al., “An Analysis of Audio Features to Develop a Human Activity Recognition Model Using Genetic Algorithms, Random Forests, and Neural Networks,” Mobile Information Systems, vol. 2016, Article ID 1784101, 10 pages, 2016. https://doi.org/10.1155/2016/1784101.

Luna-Garcia, H., Mendoza-Gonzalez, R., Gamboa-Rosales, H., Celaya-Padilla, J., Galvan-Tejada, C., Lopez-Monteagudo, F., Collazos-, C., Mendoza-Gonzalez, A..(2018). MENTAL MODELS ASSOCIATED TO VOICE USER INTERFACES FOR INFOTAINMENT SYSTEMS. DYNA, 93(3). 245. DOI: http://dx.doi.org/10.6036/8766

Celaya-Padilla, J., Galván-Tejada, C., López-Monteagudo, F., Alonso-González, O., Moreno-Báez, A., Martínez-Torteya, A., Galván-Tejada, J., Arceo-Olague, J., Luna-García, H. and Gamboa-Rosales, H. (2018). Speed Bump Detection Using Accelerometric Features: A Genetic Algorithm Approach, Sensors 18(2): 443. Retrieved from http://dx.doi.org/10.3390/s18020443

Nandal, A., Dhaka, A., Gamboa-Rosales, H., Marina, N., Galvan-Tejada, J., Galvan-Tejada, C., Moreno-Baez, A., Celaya-Padilla, J. and Luna-Garcia, H. (2018). Sensitivity and Variability Analysis for Image Denoising Using Maximum Likelihood Estimation of Exponential Distribution. Circuits, Systems, and Signal Processing, 37(9), pp.3903-3926.

Gamboa Rosales, H. (2010). OptimaleBausteinauswahl in der KorpusbasiertenSprachsynthese. Dresden: TUDpress.

Nematollahi, M., Vorakulpipat, C. and Gamboa, H. (2017). Digital Watermarking. 1st ed. Springer Singapore:.

Gamboa Rosales, H., Jokisch, O. and Hoffmann, R. (2006). SPECTRAL DISTANCE COSTS FOR MULTILINGUAL UNIT SELECTION IN SPEECH SYNTHESIS. Proc. International Conference on Speech and Computer (SPECOM), pp.270-273.


  • There are currently no refbacks.

Copyright (c) 2018 Fabian Navarrete Rocha, Margarita Selene Salazar Ávila

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.