Speech-Technology Overview for Windows Vista Developers

Richard Davis

SharpLogic Software

October 2007

Windows Speech Recognition in the Windows Vista operating system enables users to interact with their computers by using their voice. It was designed for people who want to limit significantly their use of the mouse and keyboard, while maintaining or increasing their overall productivity. Users can dictate documents and e-mail in mainstream applications, use voice commands to start and switch between applications, control the operating system, and even fill out forms on the Web. If you have access to a Windows Vista computer and microphone, you can try out the new capabilities for yourself by selecting All Programs | Accessories | Ease of Access | Windows Speech Recognition from the Start menu.

With Windows Speech Recognition, users are empowered right from the start; a guided setup and an interactive training module provide familiarization with key concepts and commands. Windows Speech Recognition also features a user interface that assists users in controlling their computer by voice.

Figure 1. Text that was dictated by speaking into a microphone, shown in Microsoft WordPad

This article provides a high-level road map for developers who need to create speech-enabled applications to:

  • Use voice for application control and input (speech recognition).
  • Translate text into computerized speech (speech synthesis).

In the past, Microsoft offered Windows Desktop Speech technologies for developers through a separate Speech SDK that contained development files and redistributable binaries. The latter files had to be packaged with the application and installed on each user's computer to enable speech capabilities. Because speech technologies have matured and entered mainstream use, this state of affairs has changed for Windows Vista; the operating system now has integrated speech capabilities, and the speech application programming interfaces (APIs) are now included with the Windows SDK.

Windows Vista automatically provides some basic speech capabilities to any application that is designed to work with two Windows accessibility technologies: Microsoft Active Accessibility (MSAA) and Microsoft Windows UI Automation (WUIA). Many client applications already get this baseline functionality for free by using Windows Forms or the Windows Presentation Foundation (WPF), where most UI controls implement accessibility technologies by default. At run time, when speech is used to open or switch to an application, the speech engine queries that application to determine which accessibility features it supports, and then works through those.

Windows Speech Recognition utilizes a technology that is called the Text Services Framework (TSF) to get processed voice commands and text to applications. TSF provides an abstraction layer between applications and various types of input, including language support, handwriting recognition, and keyboard processors. Fortunately for most application developers, Windows Vista provides native TSF support for many of the common Win32 and WPF controls that allow textual input, which provides dictation capabilities for free. In cases in which additional TSF integration is needed, such as if your application or control is directly responsible for displaying and editing text, the Resources for Further Investigation section that appears at the end of this article provides more information.

For additional speech-related capabilities, both native and managed interfaces are provided: a COM-based Microsoft Speech API (SAPI), and the Microsoft .NET Framework 3.0 System.Speech.Recognition and System.Speech.Synthesis namespaces. Windows Vista provides SAPI version 5.3, although applications that were developed with the use of the SAPI 5.1 SDK should be forward-compatible with SAPI 5.3 on Windows Vista.

Figure 2. Speech APIs in Windows Vista

Figure 2 shows that both native and managed applications ultimately work with the synthesizer (text-to-speech) and recognizer (speech-to-text) functionality through the SAPI, although managed applications have an additional layer of abstraction (System.Speech). SAPI is middleware that provides an API and a device driver interface (DDI) for speech engines to implement. Speech engines are either speech recognizers or synthesizers; and, although the word device is being used, these engines are typically software implementations. Windows Vista supplies default recognition and synthesis speech engines, but this architecture enables plugging-in additional ones without changes to applications that use the SAPI.

Note  Most application developers will not need to concern themselves with the functionality that lies below the API of their choice (SAPI or System.Speech).

Although they share some common elements, the recognition and synthesis speech capabilities can be (and are) used separately. A speech-synthesis engine is instantiated locally in every application that uses it; whereas a speech-recognition engine can be instantiated privately, or the shared desktop instance can be used. The shared speech-recognition engine instance in the shared recognition-service (SAPISVR.EXE) process provides two major benefits. First, recognizers generally require considerably more run-time resources than synthesizers, and sharing a recognizer is an effective way to reduce the overhead. Second, the shared recognizer is also used by the built-in speech functionality of Windows Vista. Therefore, applications that use the shared recognizer can benefit from the system's existing microphone and feedback UI. There's no additional code to write, and no new UI for the user to learn. After instantiating an engine, an application can adjust its characteristics, invoke operations on it, and register for speech-event notifications.

Speech synthesis, which is commonly referred to as text-to-speech (TTS), is used to translate either plain text or XML into voice. SAPI 5.3 supports the W3C Speech Synthesis Markup Language (SSML) version 1.0. SSML provides the ability to mark-up voice characteristics, rate, volume, pitch, emphasis, and pronunciation, so that developers can make TTS sound more natural in their applications. Using speech synthesis is relatively straightforward (see the ISpVoice interface for native development, or the SpeechSynthesizer class for managed development). Collections of hints, which are called lexicons, can also be supplied to the engine, to assist it in pronunciation and part-of-speech information for specific words.

Speech-recognition technology is significantly more complicated than synthesis, but the SAPI does a good job of hiding much of the complexity from the application developer. Speech recognition has two modes of operation: dictation mode and grammar mode. Dictation mode is an unconstrained, free-form speech-interpretation mode that uses a built-in grammar that is provided by the recognizer for a specific language. For applications that must expose a constrained set of available commands, speech-recognition accuracy can be greatly improved by using a context-free grammar (CFG). SAPI 5.3 now supports the W3C Speech Recognition Grammar Specification (SRGS), which defines a specific set of words and their combination that can be used to form valid sentences.

ASP.NET Web applications should not use the SAPI. Instead, ASP.NET applications can be speech-enabled when they are powered by the Microsoft Speech Server (MSS) and developed with the Microsoft Speech Application SDK (SASDK). These speech-enabled Web applications can be designed for devices that range from telephones to Windows Mobile–based devices to desktop computers.

Resources for Further Investigation

About the author

Richard Davis is a software-design engineer at SharpLogic Software, a Microsoft technology–centric organization that focuses primarily on developing software for the .NET and Win32 platforms, as well as the integration that is required to interface with systems that use Java, Linux, UNIX, and other platforms. In his time at SharpLogic, Richard has played a key role as the primary developer on some of the company's most visible projects, including the development of Microsoft .NET class libraries for programming Skype and LEGO Mindstorms. Richard earned a Bachelor of Science degree in computer science from Washington State University, where he also minored in mathematics.