Glossary

 

A  |  B  |  C  |  D  |  E  |  F  |  G  |  I  |  L  |  M

N  |  O  |  P  |  R  |  S  |  T  |  U  |  V  |  W

A

ACM (Audio Compression Module)

Code typically used by an engine that converts PCM data to a different format.

active voice menu

A set of voice commands that can be recognized.

archiving

Storing copies of programs and data to ensure against loss.

asleep state

The state in which an application listens to each sound, but responds only to commands on the sleep menu. See also awake state.

audio destination

A device such as an audio speaker or the telephone over which text is played as speech. An audio-destination object is an OLE COM object that supports audio communication interfaces in common with a text-to-speech engine.

audio signal

An electrical signal with varying voltage that becomes sound when amplified and converted to vibrations played by an audio speaker.

audio source

A device such as a microphone or telephone that provides audio data for speech recognition. An audio-source object is an OLE COM object that supports audio communication interfaces in common with a speech recognition engine.

awake state

The state in which an application recognizes and executes commands on active voice menus. See also asleep state.

B

bookmark

A marker embedded in an audio recording that can be used to locate and play back an audio segment.

C

complete-phrase value

The number of milliseconds that the engine waits before regarding a phrase as complete after the user has stopped speaking.

component object

An object defined according to the OLE Component Object Model (COM). A component object has a set of interfaces that communicate with the object, data associated with an instance of the object at run-time, and the ability to support multiple instances of the object running at the same time.

COM (Component Object Model)

See OLE Component Object Model.

context-free grammar

Uses rules that predict the words that might follow the word just spoken, reducing the number of candidates that need to be evaluated to recognize the next word.

continuous speech

A continuous utterance without pauses between words. Some speech recognition engines can recognize continuous speech.

D

degradation

A reduction in quality or performance of a communications channel.

deterioration

The gradual loss of data stored by a speech recognition results object. The information in a results object can occupy a significant amount of memory, so an engine developer may permit the object to discard data automatically as time passes.

dictation grammar

Defines a context for the speaker by identifying the subject of the dictation, the expected style of language, and what dictation has already been done.

digital-audio format

Audio format controlled by binary or numeric data.

digital-audio stream

Continuous audio data received from or sent to an audio device.

Digital Signal Processor (DSP)

A general-purpose multiprocessor tailored to a particular type of operation. Applications involving communications, compression and audio are more efficiently performed on a DSP than on the host computer.

diphone

A sound consisting of two phonemes: one that leads into the sound and one that finishes the sound. For example, the word "hello" consists of these diphones: silence-h h-eh eh-l l-oe oe-silence.

diphone concatenation

The text-to-speech engine concatenates short digital-audio segments and performs intersegment smoothing to produce a continuous sound.

discrete speech

Every word must be isolated by a pause before and after the word-usually about a quarter of a second-in order for the engine to recognize it.

DTMF (Dual Tone Multi-Frequency)

Touch-tone or push-button dialing. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency.

E

echo canceling

A method of controlling echoing on communication lines, in which the sender checks the inbound channel for a slightly delayed duplicate of its own transmission. In echo canceling, the sender adds an appropriately modified, reversed version of its transmission to the path on which it receives information. The result is to erase the echo electronically but leave incoming data intact.

energy floor

See noise floor.

engine

A program that does the actual work of recognizing speech or translating text into speech. Most speech recognition engines convert incoming audio data to engine-specific phonemes, which are then translated into text for use by an application. A text-to-speech engine performs the same process, only in reverse. An engine object is an OLE COM object that represents a mode of a speech recognition or text-to-speech engine.

engine enumerator

Enumerates the speech recognition or text-to-speech modes supported by a particular engine.

engine-specific phoneme character set

A character set that describes phonemes, pauses, and so on, and that is specific to a text-to-speech engine.

F

frequency

The rate of vibration or oscillation, measured in hertz (Hz). The normal human ear can detect sounds ranging from 20 Hz to 20,000 Hz.

G

gain

The increase in signaling power, measured in decibels (dB), that occurs as the signal is boosted by an electronic device.

global voice menu

A voice menu that is active all of the time regardless of which window is in the foreground.

grammar

A set of words and phrases that can be recognized by an engine. A grammar object is an OLE COM object that an application uses to control how an engine uses the grammar to recognize speech.

GUID

Globally unique identifier used by an interface or object for identification.

I

incomplete-phrase value

The number of milliseconds that the speech recognition engine waits before discarding an incomplete phrase because the user has stopped speaking.

interface

A set of semantically related functions that an application can call to perform the actions defined for that interface.

interference

Noise or other external signals that affect the performance of a communications channel; also, the electromagnetic signals generated by electronic devices, such as computers, that can disturb radio or television reception.

IPA (International Phonetic Alphabet)

A standard system for indicating specific sounds, first introduced in 1886. The Unicode character set includes all single symbols and diacritics in the most recent revision of the IPA, which occurred in 1989, as well as a few IPA symbols no longer in use.

L

lexicon

See pronunciation lexicon.

limited-domain grammar

Provides a set of words to recognize without using strict syntax structures. A limited-domain grammar is a hybrid between a context-free grammar and a dictation grammar.

localization

Adaptation of a software package from English to the needs of a foreign country.

M

marshaling

If an instance uses a separate process space from that of the application that invokes it, its data must be marshaled across the process boundary. Each interface contains marshaling code that allows its parameters to be transmitted across process boundaries.

matching techniques

The methods by which the engine matches a detected word to known words in its vocabulary.

N

node

A word or phoneme on a recognition path in a recognition/alternative graph generated by an engine.

noise

Any interference that affects the operation of a device. In communications, noise consists of random electronic signals, produced either naturally or by the circuitry, that degrade the quality or performance of a communications channel.

noise floor

The noise value in the signal-to-noise (SNR) ratio for an environment. In general, the higher the noise floor, the more sensitive the engine will be to background noise.

notification sink

Similar to a callback function, except the sink is implemented as an interface with a set of functions rather than as a single function.

O

OLE Component Object Model (COM)

A specification that defines a binary standard for OLE object implementation independent of programming language.

P

PCM (pulse code modulation)

The most common method of encoding an analog voice signal into a digital bit stream. First, the amplitude of the voice conversation is sampled. Then, the sample is coded into binary data, which can then be switched, transmitted, and stored digitally.

perplexity

The number of choices at a given node in a recognition path.

phoneme

The smallest structural unit of sound in any language that can be used to distinguish one word from another.

phrase

An ordered list of words that are spoken in the same utterance.

pitch

The tone of a sound, which generally is determined by the sound's frequency. A high-pitched sound has a higher frequency; a low-pitched sound has a lower frequency.

pronunciation lexicon

A database of pronunciations maintained by a speech recognition or text-to-speech engine. An engine may allow an application to collect new or corrected pronunciations from the end-user.

pronunciation rule

A rule followed by a text-to-speech engine to convert text into phonemes.

prosody

The inflection, timing and accent of speech.

R

recognition mode

Each speech recognition engine supports one or more recognition modes that conform to a different code set or data set. For example, each language (or dialect) supported by the engine will have a different mode.

recognition path

A sequence of words or phonemes that an engine analyzed while attempting to recognize an utterance.

recognition rule

A rule followed by a speech recognition engine using a context-free grammar to recognize speech.

recognition/alternative graph

A graph generated by a speech recognition engine that depicts the recognition paths explored by the engine in recognizing an utterance.

recursion

The number of levels of rules in a context-free grammar.

registry

The database in which configuration information is stored. The database takes the place of most configuration and initialization files for Microsoft® Windows® and new Windows-based programs.

results object

See speech recognition results object.

rules

See pronunciation rule, and recognition rule.

S

SAPI

Microsoft Speech application programming interface. A set of routines, protocols, and tools that enable programmers to build speech-enabled applications for Microsoft Windows platforms.

SNR (signal-to-noise ratio)

The amount of power, measured in decibels (dB), by which a signal exceeds the amount of channel noise at the same point of transmission. It provides an indication of the clarity or accuracy with which communication can take place.

speaker

The end-user who utters the speech to be recognized by an application. Training performed by a speaker may be stored in a speaker profile.

speaker-adaptive

The engine trains itself to recognize the user's voice while the user performs ordinary tasks.

speaker-dependent

The engine requires the user to train it to recognize his or her voice.

speaker-independent

The engine does not require training. Speaker-independent engines typically start with an accuracy above 95 percent for most users (those who speak without accents).

speaker profile

All of the information the engine has about the speaker, such as a data header, languages for which training has been done, known patterns of speech and the language model, how specific words are pronounced, phonetic training, speaker ID, and speaker preferences.

speech recognition

The ability of a computer to understand the spoken word for the purpose of receiving command and data input from the speaker.

speech-recognition engine

An OLE Component Object Model dynamic-link library (DLL) or executable file (.exe) that performs recognition from a digital-audio stream. Speech recognition engines are supplied by vendors who specialize in the software.

speech-recognition enumerator

Enumerates the engines that are available to an application.

speech-recognition mode

An engine typically provides an assortment of modes that can be used to recognize speech in different languages, dialects, and audio-sampling rates.

speech-recognition results object

Provides detailed information about a speech recognition event.

speech-recognition sharing object

Enumerates shared engine-audio source pairs, or creates new ones.

subword matching

The engine looks for subwords—usually phonemes—and then performs further pattern recognition on those.

synthesis

The text-to-speech engine synthesizes the glottal pulse from human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape and tongue position.

T

tags

See text-to-speech control tags.

TAPI

Microsoft Telephony application programming interface. A set of routines, protocols, and tools that enable programmers to build telephony applications for Microsoft Windows platforms.

Telephony

Refers to computer hardware and software that performs functions traditionally performed by telephone equipment (like voice mail or fax services).

text-to-speech

Technologies for converting textual (ASCII) information into synthetic speech output. Used in voice-processing applications requiring production of broad, unrelated, and unpredictable vocabularies, such as products in a catalog or names and addresses. This technology is appropriate when system design constraints prevent the more efficient use of speech concatenation alone.

text-to-speech control tags

Instructions that can be embedded in text sent to a text-to-speech engine to improve the prosody of the spoken text.

text-to-speech engine

An OLE Component Object Model dynamic-link library (DLL) or executable file (.exe) that provides functionality for converting text to digital-audio speech. Text-to-speech engines are supplied by vendors who specialize in the software.

text-to-speech enumerator

Enumerates the text-to-speech modes provided by all of the engines available to the application.

text-to-speech mode

Analogous to voice quality or personality. Every text-to-speech mode is different, and each allows for different properties such as timbre, accent, language and digital-audio sampling rate.

threshold

The point below which an utterance is rejected as unrecognized.

training

The process of speaking a series of pre-selected phrases for the engine. This provides the engine with more information about the voice of the speaker and can improve speech recognition.

U

Unicode

A 16-bit character set that replaces ASCII and allows any character from any language to be represented in a text string. The Unicode character set contains a subset for International Phonetic Alphabet (IPA) phonemes.

utterance

Anything heard by the engine as a finite series of sounds that the engine attempts to recognize as speech.

V

vocabulary

A set of words used in a grammar. A speech recognition engine typically supports several different sizes of vocabulary, which determine the words that the engine can recognize in a given state.

voice command

A word or phrase associated with a voice menu. When an engine recognizes a voice command, it notifies the application that owns the voice menu containing the command.

Voice Command site

A speech recognition mode and audio source that together serve as a source of Voice Command input.

voice menu

A list of voice commands to which an application can respond. A voice menu must be active before an engine can recognize its commands.

voice-text site

A text-to-speech mode and an audio destination that together serve as a destination for Voice Text output.

VU (Volume Units) Meter

An indicator that displays the volume of sound being received by the microphone or through the line-in port. Optimum reception is achieved when the meter registers in the middle area.

W

whole-word matching

The engine compares the incoming digital-audio signal against a prerecorded template of the word.

word

An atomic Unicode text string. A "word" can have several related vernacular words (such as "Los Angeles") within it because the vernacular words are always used in common.

word separation

The degree of isolation between words required for the engine to recognize a word.

word spotting

A series of words may be spoken in a continuous utterance, but the engine recognizes only one word or phrase.