Modern speech recognition methods

МРНТИ 28.23.37                                                                  №1 (2021г.)



Mamyrbayev O., Oralbekova D.


This article presents the main ideas, advantages and disadvantages of models based on hidden Markov models (HMMs) – a Gaussian mixture models
(GMM), end-to-end models and also the article indicates that the end-to-end model is a developing area in the field of speech recognition. The authors consider in the article an analytical review of the varieties of end-to-end systems for automatic speech recognition, namely, models based on the connection time classification (CTC), attention-based mechanism and conditional random fields (CRF), and theoretical comparisons are made. Ultimately, their respective advantages and disadvantages and the possible future development of these systems are indicated.
Keywords: automatic speech recognition, hidden Markov models, end-to-end, neural networks, CTC.


1 Kazachkin A. E. Metody` raspoznavaniya rechi, sovremenny`e rechevy`e tekhnologii // Molodoj ucheny`j. — 2019. — №39. — S. 6-8. — URL https:// (data obrashheniya: 28.01.2020). [Kazachkin A.E. Metody` raspoznavaniya rechi, sovremenny`e rechevy`e tekhnologii// Molodoy uchyony`j.-2019.-N39.-S.6-8]

2 Ronzhin A.L., Karpov A.A., Li I.V. Rechevoj i mnogomodal`ny`j interfejsy` // M.: Nauka. 2006. -173 s.]. [Ronzhin A.L., Karpov A.A., Li I.V. Rechevoy i mnogomodal`ny`j interfeysy`// M.: Nauka, 2006.- 173s.]

3 Gusev M.N, Degtyarev V.M. Sistema raspoznavaniya rechi: osnovny`e modeli i algoritmy` / SPb.: Znak, 2013. – 128 s. [Gusev M.N., Degtearyov V.M. Sistema raspoznavaniya rechi: osnovny`e modeli i algoritmy`/ SPb: Znak, 2013.-128s.]

4 Ibrahim M. El-Henawy, Walid I. Khedr, Osama M. ELkomy, Al-Zahraa M.I. Abdalla, Recognition of phonetic Arabic figures via wavelet based Mel Frequency Cepstrum using HMMs, HBRC Journal, Volume 10, Issue 1, 2014, Pages 49-54, ISSN 1687-4048

5 Vorob`eva S. A. Metody` raspoznavaniya rechi // Molodoj ucheny`j. — 2016. — №26. — S. 136-141. — URL (data obrashheniya: 28.01.2020. [Vorob`yova S.A. Metody` raspoznovaniya rechi// Molodoj uchyony`j .-2016.-N26.-S.136-141]

6 Sirko Molau, Michael Pitz, Ralf Schluter and Hermann Ney. (2001) “Computing Mel frequency Cepstral Coefficients on the power spectrum.” IEEE Transactions on Audio, Speech and Language Processing

7 Bezoui Mouaz, Beni Hssane Abderrahim, Elmoutaouakkil Abdelmajid, Speech Recognition of Moroccan Dialect Using Hidden Markov Models, Procedia Computer Science, Volume 151, 2019, Pages 985-991, ISSN 1877-0509

8 Rabiner L-R., Juang B-H., Fundamentals of Speech Recognition, Prentice-Hall, 1993.

9 Rao, K.; Sak, H.; Prabhavalkar, R. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 193–199.].

10 Lu, L.; Zhang, X.; Cho, K.; Renals, S. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 3249–3253.

11 Rahhal Errattahi, Asmaa El Hannani, Hassan Ouahmane, Automatic Speech Recognition Errors Detection and Correction: A Review, Procedia Computer Science, Volume 128, 2018, Pages 32-37, ISSN 1877-0509, https://

12 Ueno, Sei & Inaguma, Hirofumi & Mimura, Masato & Kawahara, Tatsuya. (2018). Acoustic-to-Word Attention-Based Model Complemented with CharacterLevel CTC-Based Model. 5804-5808. 10.1109/ICASSP.2018.8462576.].

13 Prabhavalkar, R.; Rao, K.; Sainath, T.N.; Li, B.; Johnson, L.; Jaitly, N. A comparison of sequence-to-sequence models for speech recognition. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 939–943.

14 Wang, Dong & Wang, Xiaodong & Lv, Shaohe. (2019). An Overview of End-to-End Automatic Speech Recognition. Symmetry. 11. 1018. 10.3390/ sym11081018.

15 Mamy`rbaev O., Shayakhmetova A., Ky`dy`rbekova A., Turdaly`uly` M. Integral`ny`j podkhod raspoznavaniya rechi dlya agglyutinativny`kh yazy`kov, AUE`S Vestnik, № 1(48).- 2020, [Mamy`rbaev O., Shayakhmetova A., Kady`rbekova A., Turdaly`uly` M. Integral`ny`j podkhod raspoznavaniya rechi agglyutinativny`x yazy`kov, AUE`S Vesstnik, N1 (48).-2020]

16 Bahdanau, D.; Chorowski, J.; Serdyuk, D.; Brakel, P.; Bengio, Y. Endto-end attention-based large vocabulary speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4945–4949.]

17 J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the International Conference on Machine Learning (ICML’01), Williamstown, MA, USA, Jun. 2001, pp. 282–289.

18 E. Fosler-Lussier, Y. He, P. Jyothi, and R. Prabhavalkar, “Conditional random fields in speech, audio, and language processing,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1054–1075, 2013.

19 Markovnikov N.M., Kipyatkova I.S. Analiticheskij obzor integral`ny`kh sistem raspoznavaniya rechi, Tr. SPIIRAN, 58 (2018), 77–110 [Markovnikov N.M., Kipyatkova I.S. Analiticheskij obzor integral`ny`kh system raspoznavaniya rechi, Tr.SPIIRAN, 58 (2018)]

20 Hifny Y., Renals S. Speech recognition using augmented conditional random fields // IEEE Transactions on Audio, Speech, and Language Processing. 2009. vol. 17. no. 2. pp. 354–365.

21 H. Tang et al., “End-to-End Neural Segmental Models for Speech Recognition,” in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1254-1264, Dec. 2017.


Комментарии закрыты.