Segmental Evaluation of Accent and Pronunciation for Speech Generation – Alice Ross & Jinzuomu Zhong (University of Edinburgh)
Event details
Alice Ross (UKRI CDT in Natural Language Processing (NLP)) and Jinzuomu Zhong (UKRI CDT in Designing Responsible NLP) are PhD researchers at the Centre for Speech Technology Research in Edinburgh. Alice’s project focuses on listeners’ perceptions of synthesised speech and applications of speech technology in society. Jinzuomu works on modelling aspects like pronunciation and prosody to improve expressive and controllable speech synthesis.
We’ll briefly introduce some of our previous research and outline the project we’re currently working on, which aims to explore how well (or not) text-to-speech (TTS) systems reproduce real speakers’ regional accents. It’s often claimed that these systems’ output has ‘human-parity’, i.e. is indistinguishable from natural human speech. But when using data from speakers whose accents aren’t Mainstream American English or Standard Southern British English, we find that’s not accurate, and existing objective metrics (automated evaluation comparing generated and reference speech samples) don’t align with human perception. We’ll be running a series of workshops testing an annotation task to capture fine-grained, segmental evaluation of issues with accent in generated speech. We’re also aiming to collect participants’ feedback on the task, which features are most salient when judging accentedness, as well as their experiences with speech and speech technology more generally.