Interspeech Shared Task: Automatic Speech Recognition for Non-Native Children’s Speech

Organized by cleong - Current server time: Jan. 19, 2021, 8:54 p.m. UTC

First phase

Closed Track
April 14, 2020, midnight UTC


Competition Ends
April 25, 2020, noon UTC

This shared task will help advance the state-of-the-art in automatic speech recognition (ASR) by considering a challenging domain for ASR: non-native children's speech.  A new data set containing English spoken responses produced by Italian students will be released for training and evaluation. The spoken responses in the data set were produced in the context of an English speaking proficiency examination.  The following data will be released for this shared task: training set of 49 hours of transcribed speech, development set of 2 hours of transcribed speech, test set of 2 hours of speech, and a baseline Kaldi ASR system with evaluation scripts. The shared task will consist of two tracks: a closed track and an open track.  In the closed track, only the training data distributed as part of the shared task can be used to train the models; in the open track, any additional data can be used to train the models.

For questions about the shared task, please email

Important Dates

  • Release of training data (initial set), development data, and baseline system: February 7, 2020
  • Release of training data (additional 40 hours): February 14, 2020
  • Test data released and opening of submission site: April 17, 2020
  • Closing of submission site: April 24, 2020 (midnight anywhere in the world, i.e., 12pm UTC on April 25)
  • Announcement of results: April 27, 2020
  • Interspeech paper submission deadline: May 8, 2020



Daniele Falavigna, Fondazione Bruno Kessler

Roberto Gretter, Fondazione Bruno Kessler

Marco Matassoni, Fondazione Bruno Kessler

Keelan Evanini, Educational Testing Service

Ben Leong, Educational Testing Service 


Further information about the shared task

The availability of large amounts of training data and large computational resources have made Automatic Speech Recognition (ASR) technology usable in many application domains, and recent research has demonstrated that ASR systems can achieve performance levels that match human transcribers for some tasks. However, ASR systems still present deficiencies when applied to speech produced by specific types of speakers, in particular, non-native speakers and children.

Several phenomena that regularly occur in non-native speech can greatly reduce ASR performance, including mispronounced words, ungrammatical utterances, disfluencies (including false starts, partial words, and filled pauses), and code-switched words.  ASR for children’s speech can be challenging due to linguistic differences from adult speech at many levels (acoustic, prosodic, lexical, morphosyntactic, and pragmatic) caused by physiological differences (e.g., shorter vocal tract lengths), cognitive differences (e.g., different stages of language acquisition), and behavioral differences (e.g., whispered speech). Developing ASR systems for both of these domains is made more challenging due to the lack of publicly available databases for both non-native speech and children’s speech.

Despite these difficulties, a significant portion of the speech transcribed by ASR systems in practical applications may come from both non-native speakers, (e.g., newscasts, movies, internet videos, human-machine interactions, human-human conversations in telephone call centers, etc.) and children (e.g., educational applications, smart speakers, speech-enabled gaming devices, etc.) Therefore, it is necessary to continue to improve ASR systems to be able to accurately process speech from these populations.  An additional important application area is the automatic assessment of second language speaking proficiency, where the ASR difficulties can be increased by the low proficiency levels of the speakers, especially if they are children. The lack of training data is especially pronounced for this population (non-native children’s speech).

With this special session we aim to help address these gaps and stimulate research that can advance the present state-of-the-art in ASR for non-native children’s speech.  To achieve this aim we will distribute a new data set containing non-native children’s speech and organize a challenge that will be presented in the special session.  The data set consists of spoken responses collected in Italian schools from students between the ages of 9 and 16 in the context of English speaking proficiency assessments.  The data that will be released includes both a test set (ca. 4 hours) and adaptation (ca. 9 hours) set, both of which were carefully transcribed by human listeners. In addition, a set of around 90 hours of untranscribed spoken responses will be distributed. A Kaldi baseline system will also be released together with the data, and a challenge web site will be developed for collecting and scoring submissions. 

The following points makes this session special:

  • Distribution of a unique and challenging (from the ASR perspective) set of spoken language data acquired in schools from students of different ages.
  • Organization of a challenge addressing research topics in several ASR subfields, including:
    • Language models:  How to handle grammatically incorrect sentences, false starts and partial words, code-switched words, etc.
    • Lexicon:  Generation of multiple pronunciations for non-native accents, training of pronunciation models, etc.
    • Acoustic models: Multilingual model training, transfer learning approaches, model adaptation for non-native children (supervised, unsupervised, lightly supervised), modeling of spontaneous
    • speech phenomena, acoustic models for non-native children, etc.
    • Evaluation:  Database acquisition and annotation of non-native speech, performance evaluation for non-native children’s speech
    • Handling low resource training/adaptation data for less commonly studies populations (non-native speech, children’s speech)
  • Establishing benchmarks for future research.
  • Establishing a common data set for additional future annotations for applications beyond ASR (e.g., computer assisted language learning).
  • The special session will be supported by SIG-CHILD, the ISCA special interest group focusing on multimodal child-computer interaction and will continue a series of productive events that have been hosted by SIG-CHILD in the area of child-computer interaction and analysis of children’s speech since 2008 (including the Interspeech 2019 special session entitled Spoken Language Processing for Children’s Speech).

Submissions to the Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech will be evaluated according to the Word Error Rate (WER) between the ASR hypotheses in the submission and the reference human transcriptions for the evaluation set as calculated by the evaluation script that was distributed with the training data. 

Each participating team may submit one submission per day for each track during the evaluation period with a maximum of 7 submissions per team per track. The submissions will be ranked against other submissions based on WER, regardless of the order of the submissions (e.g., if a team's submission from the first day achieves the lowest WER out of a total of 7 submissions from that team, the submission from the first day will be the top-ranking submission for that team).

The performance of the baseline system developed by the organizers at FBK is a WER of 35.09% on the evaluation set.  This result is displayed as the "Baseline System from Organizing Team" team on the CodaLab leaderboard for the shared task.

A participating team can view detailed results for their submissions including the number of substitutions, deletions, and insertions, by going to the "Participate" tab in CodaLa, selecting the "View / Submit Results" sub-page, clicking on the "+" at the right side of the entry for a submission in the table to expand the box, and then accessing the "View scoring output log" link.  The detailed results for that submission will then be displayed in a separate webpage in the following format:

WER= 35.09% (S= 971 I= 437 D= 711) / REFERENCE_WORDS= 6038 - UTTERANCES= 578 

For questions about the shared task, please email 

  • Tracks
    • This shared task has two tracks: a closed track and an open track.  Submissions to the closed track should be generated by systems that were trained only using the data distributed for this shared task (including the training and development sets) and no other sources of data.  Submissions to the open track may use any additional sources of data.
    • When making a submission, please double-check to make sure that you are submitting it to the intended track.  To select the track, click on either the "Closed Track" or "Open Track" button at the top of the Submit / View Results page.
  • Number of submissions
    • Each participating team may submit one submission per day for each track during the evaluation period with a maximum of 7 submissions per team per track.
    • Only one single team member per participating team should register on CodaLab and make submissions.
  • Submission file format
    • The submission file should be a plain text file that contains 578 lines, one line for each audio file in the evaluation set.
    • Each line in the submission file should first contain the audio file ID and then the corresponding ASR output for that audio file.
    • Example:

1010106_en_22_20_100 this is a fake asr output

1010106_en_22_20_101 this is a fake asr output

1010106_en_22_20_102 this is a fake asr output

    • The lines in the submission file should be sorted by the audio file ID and should be in the same order as the entries in the sample submission file for the evaluation set distributed with the evaluation data (TLT2017eval.fake.asr)
    • Then, the submission file should be compressed as a .zip file in order to be uploaded to the CodaLab site, e.g., run the command "zip results.txt" (the .zip file and .txt submission file can have any arbitrary names)
    • The .zip file containing the submission should be uploaded in the Submit / View Results section of the Participate tab.  First enter a brief description of the system configuration that was used to generate the results in the submission in the text box for tracking purposes and then click on "Submit" to upload the .zip file.
  • Leaderboard
    • After a participating team makes a submission, the results will be posted on the leaderboard.
    • Each team will be able to view the results for all of their own previous submissions; however the leaderboard will only display the best current result obtained by the other teams (not all results for other teams that made multiple submissions).
    • The user names will be anonymized on the leaderboard, so the identity of the team that made each submission will not be visible to other teams.
    • However, participating teams have the option of selecting a team name to identify themselves.  This team name will be displayed on the leaderboard (teams can also choose to not select a team name).  To select a team name, go to the Settings for your registered CodaLab profile and fill in the "Team name" field.
    • After the close of the competition, the identity of the winning team will be announced.  Then the identities of all participating teams will be announced at the Interspeech 2020 special session about the shared task (if your team would prefer to remain anonymous even after the competition has closed, please let the organizers know).

For questions about the shared task, please email

Closed Track

Start: April 14, 2020, midnight

Open Track

Start: April 14, 2020, midnight

Competition Ends

April 25, 2020, noon

You must be logged in to participate in competitions.

Sign In