An Edtech User’s Glossary to Speech Recognition and AI in the Classroom

Voices | Special Needs

An Edtech User’s Glossary to Speech Recognition and AI in the Classroom

By Thomas C. Murray     Sep 2, 2021

An Edtech User’s Glossary to Speech Recognition and AI in the Classroom

In a recent white paper, former Scholastic president of education Margery Mayer dubbed 2021 the “year of speech recognition” in education. And she may be right: A spike in adoption by edtech developers in the first half of this year reflects the recognition that technology holds the potential to not only create more engaging learning experiences for students, but to transform the very practice of early literacy instruction altogether.

In prior years, such a vision may have seemed far fetched. But as EdSurge has previously noted, the science behind speech recognition for children has begun to come of age, enabling educational applications that have piqued the interest of edtech developers, educators and researchers alike.

Part of what has enabled the growing use of speech recognition in education is the availability today of technology built specifically to cater to kids’ voices and behaviors. Previous speech recognition systems were modeled on adult voices and lacked the accuracy required for an educational context. The kid-specific speech recognition that now powers oral reading fluency tools are much more accurate and effective, and have the potential to offer what I have described as an increased “return on instruction” for children and their teachers.

These new voice-enabled learning tools also have the potential to address equity and bias. The speech recognition that powers them has been built with diversity in mind so all accents and dialects can be understood equally—thereby democratizing access to educational resources, and mitigating the risks of implicit biases, for example, in observational assessments. Perhaps most importantly however, these solutions are “personal and authentic” because they tap into a student’s most natural tool for learning: their own voice.

While 2021 might be the year of speech recognition in education, the technology itself is relatively new to most educators, families and students, even if they have a voice assistant or smart speaker in the home. And, given the power of this technology, I expect more solutions like Amplify’s mClass Express to enter the market, making it important for educators and others to understand how they work and how best to use them.

Recently, I collaborated with SoapBox Labs’ Amelia Kelly, the vice president of speech technology there, to create a glossary to help educators and edtech developers better familiarize themselves with speech recognition and make informed decisions about its use in educational settings. Below are some of the key terms that are particularly important, along with an explanation for why those terms matter.

Artificial intelligence (AI)

Systems designed to carry out tasks autonomously rather than being specifically programmed by humans.

Why it matters: AI is increasingly being used in education products, a trend that, no doubt, will continue in the coming years.

Machine learning

A subset of AI that trains computers on large amounts of data so they can carry out tasks automatically and at scale.

Why it matters: Machine learning algorithms “learn” and “improve” with each experience, which improves the speech recognition functions of voice-enabled educational tools.

Deep learning

A machine learning algorithm based on deep neural networks, which require large amounts of training data and have a multi-layered architecture that allows them to model complex behaviors like human speech and language usage.

Why it matters: Neural networks are used extensively for speech recognition, image recognition, and other pattern-recognition problems, which have applications for K-12 learning.

Voice technology

An umbrella term for technologies that allow users to interact with products, services and platforms using their voices. The underlying technologies that enable this are speech recognition (understanding human speech), speech synthesis (computers speaking aloud), natural language processing (reading and understanding human language) and machine translation (converting human speech from one language to another).

Why it matters: In the K-12 edtech context, voice technology—and speech recognition, in particular—can power a number of use cases, enabling independent reading practice, language learning, dyslexia screening, learning feedback, and summative and formative assessment.

Automatic speech recognition/Speech recognition/Speech-to-text

Allows digital devices to convert speech into text, making it easier for a device to understand the intent of the speaker. Words or concepts in the text can trigger actions (e.g., “turn off the lights,” “text my sister”).

Why it matters: Once a digital device has a transcript of the child’s reading, it can compare it against a rubric to determine reading fluency and comprehension. It can also provide time stamps for individual words, making it easy for a teacher to find a particular word or phrase read by the child, and listen back to it. These systems can also return pronunciation “confidence scores” at the utterance, word, and even down to the phoneme level.


Intentional processes used to reduce or remove unintended bias in speech recognition. Artificial intelligence systems can reflect the biases of their creators, resulting in inferior and often prejudicial experiences for underrepresented users. Machine learning algorithms, in particular, carry out decisions based on data sets on which they have been trained and can become biased if those data sets are not representative of diverse populations.

Why it matters: A biased system can amplify and propagate deep-seated prejudices held by the designers of that system, as well as the limitations of available data sets. The effects of such biases in practice, assessment, and screening platforms, and in learning tools for kids can be disastrous. If a biased system fails to understand a child’s accent or dialect while reading, for example, it can feed back to that child that they are a poor reader when, in fact, they’re reading correctly. An unbiased system, on the other hand, will offer fair and uncompromised feedback and data to facilitate education companies and platforms in supporting children on their learning journey.

Voice-enabled assessment

Uses speech recognition technology to listen, identify and assess learning invisibly while the child is reading aloud.

Why it matters: Voice-enabled assessment tools used in the classroom and remotely, can provide data on pronunciation and oral reading fluency. They can also be used to screen for learning challenges like dyslexia. When used to power assessments, speech recognition technology provides data that can support and improve educational outcomes for children, as well as help determine the type and level of support provided by teachers.

Keyword detection

A feature of speech recognition engines that identifies keywords and phrases in speech.

Why it matters: Keyword detection is particularly useful when analyzing children’s speech, where search terms in an audio file can be identified either in isolation, in a sentence or through background noise. For example, a child might pick his or her favorite animal from a list. Keyword detection can score for each of the possible responses, triggering a response within the game or lesson.

Pronunciation assessment

Assesses the quality of the pronunciation of a word or phrase.

Why it matters: Pronunciation assessments are a tremendous time-saving tool for teachers, particularly when supporting in-person observational assessments because they provide teachers with scores that compare what the child actually said to a target given word, empowering teachers to better understand where a student may be struggling and need more support or attention.

Fluency assessment

Assesses children’s oral reading fluency.

Why it matters: Another time-saving tool for teachers. When a child reads a passage, the speech recognition system records and counts the number of word substitutions, omissions, insertions and correct words. That, in turn, becomes a measurement of fluency measurement, sometimes expressed as “words correct per minute” or “WCPM.”

Speech-therapy assessments

Voice-enabled assessments that evaluate speech patterns and sentence structure.

Why it matters: Speech recognition-powered screening and practice tools can identify speaking patterns that may point to speech development pathologies enabling students to practice at home between speech therapy sessions, while also providing progress data to speech therapists.

Privacy by design

An approach to technology development, design and processes that ensures individual users’ data privacy rights are protected from the earliest stages through to the end-user experience. Privacy by design commits companies to transparency when it comes to handling data, for example, a commitment to only use the data they collect to improve their service and not for any commercial purposes such as reselling, profiling or advertising.

Why it matters: When it comes to kids’ data rights, privacy cannot be an afterthought or designed in at a later stage. Privacy needs to be baked into every level of infrastructure, data, and process, and be part of the ethos and vision of the voice-enabled solution from the very beginning.


More from EdSurge

Get our email newsletterSign me up
Keep up to date with our email newsletterSign me up