NATIVE ACCENT SENSITIVE VOICE CLONING USING PAIRWISE RANKING BASED DECODER MODELS, 122-129.

Chetan Madan, Harshita Diddee, Deepika Kumar, Shilpa Gupta, Shivani Jindal, Mansi Lal, and Chiranjeev

References

  1. [1] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y.Wu, Exploring the limits of language modeling, arXiv preprintarXiv:1602.02410, 2016.
  2. [2] F. Nolan, Forensic phonetics, Journal of Linguistics, 27(2),1991, 483–493
  3. [3] L.H. Kim, S.I. Kang, H.S. Ryu, W.I. Jang, S.B. Lee, andA. Pandya, Design and implementation of artificial intelli-gent motorized wheelchair system using speech recognitionand joystick, Control and Intelligent Systems, 33, 2005. doi:10.2316/Journal.201.2005.2.201-1512.
  4. [4] Y. Shao, W. Dong, S. Ma, and X. Sun, Enriching classroomteaching means by utilizing E-learning resources in collabora-tive edge and core cloud, Mechatronic Systems and Control,47(2), 2019, doi: 10.2316/J.2019.201-2977.
  5. [5] Y. Stylianou, Voice transformation: A survey, inIEEE International Conference on Acoustics, Speech127and Signal Processing (ICASSP), Taipei, Taiwan, 2009,3585–3588.
  6. [6] S. Arık, J. Chen, K. Peng, W. Ping, and Y. Zhou, Neural voicecloning with a few samples, Advances in Neural InformationProcessing Systems, 2018, 10019–10029.
  7. [7] S.S. Mehri, et al., SampleRNN: An unconditional end-to-endneural audio generation model, in 5th International Conferenceon Learning Representations, ICLR 2017 - Conference TrackProceedings, 2017.
  8. [8] J. Sotelo, S. Mehri, K. Kumar, J.F. Santos, K. Kastner, et al.Char2wav: End-To-End Speech Synthesis, ICLR, 2017.
  9. [9] J.H. Kim and S.B. Lee, Speech recognition using multilayerrecurrent neural prediction models and HMM, Control andIntelligent Systems, 35, 2007, 9–14.
  10. [10] S.O. Arik, M. Chrzanowski, A. Coates, G. Diamos, et al., Deepvoice: Real-time Neural Text-To-Speech, 2017.
  11. [11] S.O. Arik, et al., Deep voice 2: Multi-speaker neural text-to-speech, Advances in Neural Information Processing Systems,2017-December, 2017, 2963–2971.
  12. [12] W. Ping, et al., Deep Voice 3: Scaling text-to-speech with convo-lutional sequence learning (University of California, Berkeley,OpenAI, ICLR, 2018).
  13. [13] Y. Wang, et al., Tacotron: Towards end-To-end speech syn-thesis, in Proceedings of the Annual Conference of the Interna-tional Speech Communication Association, INTERSPEECH,vol. 2017-August, 2017, 4006–4010.
  14. [14] J. Shen, et al., Natural TTS synthesis by conditioning waveneton MEL spectrogram predictions, ICASSP, IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing -Proceedings, vol. 2018-April, 2018, 4779–4783.
  15. [15] A. van den Oord, et al., WaveNet: A generative model for rawaudio (2016).
  16. [16] T. Kaneko and H. Kameoka, CycleGAN-VC: Non-parallel voiceconversion using cycle-consistent adversarial networks (2018).
  17. [17] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, Stargan-vc: Non-parallel many-to-many voice conversion using stargenerative adversarial networks (2018).
  18. [18] P. Parikh, K. Velhal, S. Potdar, A. Sikligar, and R. Karani,English language accent classification and conversion usingmachine learning, SSRN Electron. J., 2020.
  19. [19] K. Chionh, M. Song, and Y. Yin, Application of convolutionalneural networks in Accent Identification (2019), 4–7.
  20. [20] J.S. Garofolo, et al., TIMIT acoustic-phonetic continuousspeech corpus (1993).
  21. [21] C. Veaux, J. Yamagishi, and K. MacDonald, CSTR VCTKCorpus: English Multi-speaker Corpus for CSTR Voice CloningToolkit. The Centre for Speech Technology Research (CSTR),University of Edinburgh.
  22. [22] J. Kominek and A.W. Black, The CMU arctic databases forspeech synthesis, Proceedings of ISCA Work. Speech Synthesis,2004, 223–224.
  23. [23] L. Wan, Q. Wang, A. Papir, and I.L. Moreno, Generalizedend-to-end loss for speaker verification, in ICASSP, IEEEInternational Conference on Acoustics, Speech and SignalProcessing - Proceedings, vol. 2018-April, 2018, 4879–4883.

Important Links:

Go Back