Speech recognition is a process that allows people to speak naturally to a computer on any topic and to be understood accurately. Speech is a form of communication we learn early and practice often, so the use of speech recognition software can simplify computer interfaces and make computers accessible to users unable to key text using a standard keyboard. However, computer-based speech recognition is more difficult to achieve than one might at first assume.
The speech recognition process is statistical in nature and is based on Hidden Markov Models (HMMs). An HMM is a finite set of states, each of which is associated with a probability distribution. Transitions among the states are governed by a set of probabilities called transition probabilities. The HMM is first trained using speech data for which the associated text is known. Subsequently, the trained HMM is used to "decode" new speech data into text.
The recognition vocabulary and vocabulary size play a key role in determining the accuracy of a system. A vocabulary defines the set of words that can be recognized by a speech recognition system. In addition, a language model is used to estimate the probability of a sequence of words in a particular domain. The language model assists the speech engine in recognizing speech by biasing the output toward high-probability word sequences. Together, vocabularies and language models are used in the selection of the best match for a word by the speech recognition engine. Therefore, speech systems can only "hear" words that are present in the vocabulary; a word that is not in the vocabulary will be misinterpreted as a similar sounding word that is present in the vocabulary.
Since speech recognition is probabilistic, the most probable decoding of the audio signal is output as the recognized text, but multiple hypotheses are considered during the process. Recognition systems generally have no means to distinguish between correctly and incorrectly recognized words. Therefore, during recognition, a "word lattice representation" is often used to consider all hypothesized word sequences. A word lattice representation is an acyclic directed graph that consists of nodes and arcs used to represent the multiple hypotheses considered during recognition. The nodes represent points in time, and the arcs represent the hypothesized word. The path with the highest probability is generally output as the final recognized text. Often, the multiple hypotheses (for example phrases such as "be quite" and "beak white") sound the same and may only be distinguished by higher level semantic knowledge provided by the language model.
Speech Recognition Applications
Speech recognition applications may be classified into three categories: dictation systems, navigational or transactional systems, and multimedia indexing systems. Each category of applications has a different tolerance for speech recognition errors. Advances in technology are making significant progress toward the goal of any individual being able to speak naturally to a computer on any topic and to be understood accurately.
Such applications are those in which the words spoken by a user are transcribed directly into written text. Such applications are used to create text such as personal letters, business correspondence, or e-mail messages. Usually, the user has to be very explicit, specifying all punctuation and capitalization in the dictation. Dictation applications often combine mouse and keyboard input with spoken input. Using speech to create text can still be a challenging experience since users have a hard time getting used to the process of dictating. Best results are achieved when the user speaks clearly, enunciates each syllable properly, and has organized the content mentally before starting. As the user speaks, the text appears on the screen and is available for correction. Correction can take place either with traditional methods such as a mouse and keyboard, or with speech.
Speech is used in transactional applications to navigate around the application or to conduct a transaction. For example, speech can be used to purchase stock, reserve an airline itinerary, or transfer bank account balances. It can also be used to follow links on the web or move from application to application on one's desktop. Most often, but not exclusively, this category of speech applications involves the use of a telephone. The user speaks into a phone, the signal is interpreted by a computer (not the phone), and an appropriate response is produced. A custom, application-specific vocabulary is usually used; this means that the system can only "hear" the words in the vocabulary. This implies that the user can only speak what the system can "hear." These applications require careful attention to what the system says to the user since these prompts are the only way to cue the user as to which words can be used for a successful outcome.
Multimedia Indexing Applications.
In multimedia indexing applications, speech is used to transcribe words from an audio file into text. The audio may be part of a video. Subsequently, information retrieval techniques are applied on the transcript to create an index with time offsets into the audio. This enables a user to search a collection of audio/video documents using text keywords . Retrieval of unstructured multimedia documents is a challenge; retrieval using keyword search based on speech recognition is a big step toward addressing this challenge. It is important to have realistic expectations with respect to retrieval performance when speech recognition is used. The user interface design is typically guided by the "search the speech, browse the video" metaphor where the primary search interface is through textual keywords, and browsing of the video is through video segmentation techniques. In general, it has been observed that the accuracy of the top-ranking search results is more important than finding every relevant match in the audio. So, speech indexing systems often bias their ranking to reflect this. Since the user does not directly interact with the indexing system using speech input, standard search engine user interfaces are seamlessly applicable to speech indexing interfaces.
Advances in speech recognition technology have progressed to a point that it is practical to consider speech input in applications. Speech recognition is also gaining acceptance as a means of creating searchable text from audio streams. Dictation applications have the highest accuracy requirements and must be designed for efficient error correction. Transactional applications are more tolerant to speech errors but require careful designing of the constrained vocabulary and cueing of the user. Multimedia indexing applications are also tolerant to speech errors since the search algorithm can be adapted to meet the requirements of the application.
see also Input Devices; Neural Networks; Pattern Recognition.
Karat, C., et al. "Patterns of Entry and Correction in Large Vocabulary Continuous
Speech Recognition Systems." Proceedings of CHI '99: Human Factors in Computing Systems, (1999): 568-575.
Rabiner, L. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedings of IEEE 77, no. 2 (1989):257-286.
Schmandt, C. Voice Communications with Computers. New York: Van Nostrand Reinhold, 1994.
Wactlar, H., et al. "Lessons Learned from Building a Terabyte Digital Video Library." IEEE Computer (1999): 66-73.
Yankelovich, N. "How Do Users Know What to Say?" ACM Interactions 3, no. 6 (1996).
Typically speech recognition is a many-stage process, starting with the digital sampling of the acoustic signal followed by some form of spectral analysis, such as linear predictive coding (LPC), cochlear modeling, etc. The next stage is to recognize the elements of speech – phonemes, groups of phonemes, and words; many systems employ hidden Markov model (HMM) algorithms, dynamic time warping (DTW), or neural networks (NN) for the recognition phase. In addition most systems utilize some knowledge of the language.
See also voice input device.