Современные методы распознавания речи

МРНТИ 28.23.37                                                                  №1 (2021г.)

PDF

Оралбекова   Д.О., Мамырбаев О.Ж.

В статье представлены основные идеи, преимущества и недостатки моделей, на основе скрытых марковских моделей (НММ) — смеси гауссовских распределений (GMM) и интегральных систем (end-to-end), а также указано, что интегральная модель является развивающим направлением в области распознавания речи. Рассмотрен аналитический обзор разновидностей интегральных систем автоматического распознавания речи, а именно модели, основанные на коннекционной временной классификации (CTC), на основе механизма внимания и условных случайных полей (CRF), и делаются теоретические сравнения. В конечном итоге указываются их соответствующие преимущества и недостатки и возможное будущее развитие этих систем.

Ключевые слова: автоматическое распознавание речи, скрытые марковские модели, end-to-end; нейронные сети, CTC.

Список литературы:

1 Казачкин А. Е. Методы распознавания речи, современные речевые технологии // Молодой ученый. — 2019. — №39. — С. 6-8. — URL https:// moluch.ru/archive/277/62675/ (дата обращения: 28.01.2020). [Kazachkin A.E. Metody` raspoznavaniya rechi, sovremenny`e rechevy`e tekhnologii// Molodoy uchyony`j.-2019.-N39.-S.6-8]

2 Ронжин А.Л., Карпов А.А., Ли И.В. Речевой и многомодальный интерфейсы // М.: Наука. 2006. -173 с.]. [Ronzhin A.L., Karpov A.A., Li I.V. Rechevoy i mnogomodal`ny`j interfeysy`// M.: Nauka, 2006.- 173s.]

3 Гусев М.Н, Дегтярев В.М. Система распознавания речи: основные модели и алгоритмы / СПб.: Знак, 2013. – 128 с. [Gusev M.N., Degtearyov V.M. Sistema raspoznavaniya rechi: osnovny`e modeli i algoritmy`/ SPb: Znak, 2013.-128s.]

4 Ibrahim M. El-Henawy, Walid I. Khedr, Osama M. ELkomy, Al-Zahraa M.I. Abdalla, Recognition of phonetic Arabic figures via wavelet based Mel Frequency Cepstrum using HMMs, HBRC Journal, Volume 10, Issue 1, 2014, Pages 49-54, ISSN 1687-4048

5 Воробьева С. А. Методы распознавания речи // Молодой ученый. — 2016. — №26. — С. 136-141. — URL https://moluch.ru/archive/130/36213/ (дата обращения: 28.01.2020. [Vorob`yova S.A. Metody` raspoznovaniya rechi// Molodoj uchyony`j .-2016.-N26.-S.136-141]

6 Sirko Molau, Michael Pitz, Ralf Schluter and Hermann Ney. (2001) “Computing Mel frequency Cepstral Coefficients on the power spectrum.” IEEE Transactions on Audio, Speech and Language Processing

7 Bezoui Mouaz, Beni Hssane Abderrahim, Elmoutaouakkil Abdelmajid, Speech Recognition of Moroccan Dialect Using Hidden Markov Models, Procedia Computer Science, Volume 151, 2019, Pages 985-991, ISSN 1877-0509

8 Rabiner L-R., Juang B-H., Fundamentals of Speech Recognition, Prentice-Hall, 1993.

9 Rao, K.; Sak, H.; Prabhavalkar, R. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 193–199.].

10 Lu, L.; Zhang, X.; Cho, K.; Renals, S. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 3249–3253.

11 Rahhal Errattahi, Asmaa El Hannani, Hassan Ouahmane, Automatic Speech Recognition Errors Detection and Correction: A Review, Procedia Computer Science, Volume 128, 2018, Pages 32-37, ISSN 1877-0509, https:// doi.org/10.1016/j.procs.2018.03.005

12 Ueno, Sei & Inaguma, Hirofumi & Mimura, Masato & Kawahara, Tatsuya. (2018). Acoustic-to-Word Attention-Based Model Complemented with CharacterLevel CTC-Based Model. 5804-5808. 10.1109/ICASSP.2018.8462576.].

13 Prabhavalkar, R.; Rao, K.; Sainath, T.N.; Li, B.; Johnson, L.; Jaitly, N. A comparison of sequence-to-sequence models for speech recognition. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 939–943.

14 Wang, Dong & Wang, Xiaodong & Lv, Shaohe. (2019). An Overview of End-to-End Automatic Speech Recognition. Symmetry. 11. 1018. 10.3390/ sym11081018.

15 Мамырбаев О., Шаяхметова А., Кыдырбекова А., Турдалыулы М. Интегральный подход распознавания речи для агглютинативных языков, АУЭС Вестник, № 1(48).- 2020, [Mamy`rbaev O., Shayakhmetova A., Kady`rbekova A., Turdaly`uly` M. Integral`ny`j podkhod raspoznavaniya rechi agglyutinativny`x yazy`kov, AUE`S Vesstnik, N1 (48).-2020]

16 Bahdanau, D.; Chorowski, J.; Serdyuk, D.; Brakel, P.; Bengio, Y. Endto-end attention-based large vocabulary speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4945–4949.]

17 J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the International Conference on Machine Learning (ICML’01), Williamstown, MA, USA, Jun. 2001, pp. 282–289.

18 E. Fosler-Lussier, Y. He, P. Jyothi, and R. Prabhavalkar, “Conditional random fields in speech, audio, and language processing,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1054–1075, 2013.

19 Марковников Н.М., Кипяткова И.С. Аналитический обзор интегральных систем распознавания речи, Тр. СПИИРАН, 58 (2018), 77–110 [Markovnikov N.M., Kipyatkova I.S. Analiticheskij obzor integral`ny`kh system raspoznavaniya rechi, Tr.SPIIRAN, 58 (2018)]

20 Hifny Y., Renals S. Speech recognition using augmented conditional random fields // IEEE Transactions on Audio, Speech, and Language Processing. 2009. vol. 17. no. 2. pp. 354–365.

21 H. Tang et al., “End-to-End Neural Segmental Models for Speech Recognition,” in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1254-1264, Dec. 2017.

References

1 Kazachkin A. E. Metody` raspoznavaniya rechi, sovremenny`e rechevy`e tekhnologii // Molodoj ucheny`j. — 2019. — №39. — S. 6-8. — URL https:// moluch.ru/archive/277/62675/ (data obrashheniya: 28.01.2020). [Kazachkin A.E. Metody` raspoznavaniya rechi, sovremenny`e rechevy`e tekhnologii// Molodoy uchyony`j.-2019.-N39.-S.6-8]

2 Ronzhin A.L., Karpov A.A., Li I.V. Rechevoj i mnogomodal`ny`j interfejsy` // M.: Nauka. 2006. -173 s.]. [Ronzhin A.L., Karpov A.A., Li I.V. Rechevoy i mnogomodal`ny`j interfeysy`// M.: Nauka, 2006.- 173s.]

3 Gusev M.N, Degtyarev V.M. Sistema raspoznavaniya rechi: osnovny`e modeli i algoritmy` / SPb.: Znak, 2013. – 128 s. [Gusev M.N., Degtearyov V.M. Sistema raspoznavaniya rechi: osnovny`e modeli i algoritmy`/ SPb: Znak, 2013.-128s.]

4 Ibrahim M. El-Henawy, Walid I. Khedr, Osama M. ELkomy, Al-Zahraa M.I. Abdalla, Recognition of phonetic Arabic figures via wavelet based Mel Frequency Cepstrum using HMMs, HBRC Journal, Volume 10, Issue 1, 2014, Pages 49-54, ISSN 1687-4048

5 Vorob`eva S. A. Metody` raspoznavaniya rechi // Molodoj ucheny`j. — 2016. — №26. — S. 136-141. — URL https://moluch.ru/archive/130/36213/ (data obrashheniya: 28.01.2020. [Vorob`yova S.A. Metody` raspoznovaniya rechi// Molodoj uchyony`j .-2016.-N26.-S.136-141]

6 Sirko Molau, Michael Pitz, Ralf Schluter and Hermann Ney. (2001) “Computing Mel frequency Cepstral Coefficients on the power spectrum.” IEEE Transactions on Audio, Speech and Language Processing

7 Bezoui Mouaz, Beni Hssane Abderrahim, Elmoutaouakkil Abdelmajid, Speech Recognition of Moroccan Dialect Using Hidden Markov Models, Procedia Computer Science, Volume 151, 2019, Pages 985-991, ISSN 1877-0509

8 Rabiner L-R., Juang B-H., Fundamentals of Speech Recognition, Prentice-Hall, 1993.

9 Rao, K.; Sak, H.; Prabhavalkar, R. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 193–199.].

10 Lu, L.; Zhang, X.; Cho, K.; Renals, S. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 3249–3253.

11 Rahhal Errattahi, Asmaa El Hannani, Hassan Ouahmane, Automatic Speech Recognition Errors Detection and Correction: A Review, Procedia Computer Science, Volume 128, 2018, Pages 32-37, ISSN 1877-0509, https:// doi.org/10.1016/j.procs.2018.03.005

12 Ueno, Sei & Inaguma, Hirofumi & Mimura, Masato & Kawahara, Tatsuya. (2018). Acoustic-to-Word Attention-Based Model Complemented with CharacterLevel CTC-Based Model. 5804-5808. 10.1109/ICASSP.2018.8462576.].

13 Prabhavalkar, R.; Rao, K.; Sainath, T.N.; Li, B.; Johnson, L.; Jaitly, N. A comparison of sequence-to-sequence models for speech recognition. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 939–943.

14 Wang, Dong & Wang, Xiaodong & Lv, Shaohe. (2019). An Overview of End-to-End Automatic Speech Recognition. Symmetry. 11. 1018. 10.3390/ sym11081018.

15 Mamy`rbaev O., Shayakhmetova A., Ky`dy`rbekova A., Turdaly`uly` M. Integral`ny`j podkhod raspoznavaniya rechi dlya agglyutinativny`kh yazy`kov, AUE`S Vestnik, № 1(48).- 2020, [Mamy`rbaev O., Shayakhmetova A., Kady`rbekova A., Turdaly`uly` M. Integral`ny`j podkhod raspoznavaniya rechi agglyutinativny`x yazy`kov, AUE`S Vesstnik, N1 (48).-2020]

16 Bahdanau, D.; Chorowski, J.; Serdyuk, D.; Brakel, P.; Bengio, Y. Endto-end attention-based large vocabulary speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4945–4949.]

17 J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the International Conference on Machine Learning (ICML’01), Williamstown, MA, USA, Jun. 2001, pp. 282–289.

18 E. Fosler-Lussier, Y. He, P. Jyothi, and R. Prabhavalkar, “Conditional random fields in speech, audio, and language processing,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1054–1075, 2013.

19 Markovnikov N.M., Kipyatkova I.S. Analiticheskij obzor integral`ny`kh sistem raspoznavaniya rechi, Tr. SPIIRAN, 58 (2018), 77–110 [Markovnikov N.M., Kipyatkova I.S. Analiticheskij obzor integral`ny`kh system raspoznavaniya rechi, Tr.SPIIRAN, 58 (2018)]

20 Hifny Y., Renals S. Speech recognition using augmented conditional random fields // IEEE Transactions on Audio, Speech, and Language Processing. 2009. vol. 17. no. 2. pp. 354–365.

21 H. Tang et al., “End-to-End Neural Segmental Models for Speech Recognition,” in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1254-1264, Dec. 2017.

Комментарии закрыты.