Airbus, in collaboration with IRIT and Safety Data-CFH held a challenge to
assess the current state-of-the-art in automatic speech recognition and call
sign detection in English Air Traffic Control (ATC) communications. The
challenge provided participants with annotated training data and a leaderboard
to assess the performance of their systems on a heldout set of development
data. ATC communications are challenging for today's technology as the audio
is contaminated with various types of noises, and the speech is spoken by a
wide range of speakers with different native and non-native accents. The
speech is generally in English, but may be in the language spoken in the
country (in the case of this challenge, French) and may contain code-switching
with English. ATC communications are generally spoken at a fast speech rate
and make use of domain specific grammar and vocabulary. There are many
potential uses of speech technology in the domain of air traffic
communications to improve safety and training.
The Vocapia Research and the Spoken Language Processing Group
at LIMSI CNRS submission to
the Airbus ATC challenge 2018 ranked first for both the speech
recognition and call sign detection tasks. The speech-to-text transcription
technology used for the challenge is based on that under-development over the
last 20 years, including deep neural networks for both the acoustic and
linguistic models. Compared with more general transcription tasks, the Air
Traffic Contol communications are at the same time more complicated and
simpler: the language is nominally more constratined with a more or less
controlled vocabulary and syntax, but the environmental conditions can be quite
challenging with various noises and transmission drop out. The call sign
detection task requires locating a flight identifier in the automatic
transcription. The call sign my be complete, adhering to the full structure
(airline code, followed by 3-5 numbers and optionally 1 or 2 letters) or
partial. The exchanges between the pilot and control occur in a known context
which simplifies the task of understanding partial call signs by humans.
However, this contextual information was not available to the automatic
systems, complexifying the call sign detection task.
About Vocapia Research
Vocapia Research, founded in July 2000, is an R&D company and
software publisher developing and providing leading edge speech
technologies and solutions for many languages, including most major
European Union languages as well as Arabic, Mandarin, and Russian. The
Vocapia Research VoxSigma
® software suite uses advanced
language technologies such as language identification, speech
recognition, and speaker identification to transform raw audio and
audiovisual data into structured and searchable XML documents. This
technology relies on over 25 years of research at LIMSI-CNRS, with
which there is a priviledged partnership. Joint systems developed
with LIMSI have achieved top ranks in national and international
challenges of speech-to-text transcription. The most common
applications of the VoxSigma software suite are audio and audiovisual
data mining (broadcast data, podcasts, call center data), media
monitoring, and media asset management. Vocapia Research is located in
the scientific pole of the Saclay Plateau, France. Readers who wish to
get more information about Vocapia Research are invited to check out
the Vocapia Research website or use the contact information page
http://www.vocapia.com/contact.