Audio API Overview for Windows Vista Developers

Richard Davis

SharpLogic Software

October 2007

The Windows Vista operating system provides numerous options for developers who are looking to incorporate audio into their applications. These options include application programming interfaces (APIs) that existed prior to Windows Vista, such as Microsoft Windows Multimedia Audio (historically, Windows Multimedia Extensions or "MME"), Microsoft DirectSound, and Microsoft Windows Media, as well as the new core APIs and Microsoft Media Foundation that were introduced with Windows Vista. Given the multitude of options, it is important to understand both the relationship between the APIs and their relative strengths and weaknesses, in order to decide which API meets your application requirements.

Numerous architectural improvements were made to the audio system in Windows Vista, in order to provide applications and end users with better audio experiences—including low-latency and glitch-resilient audio streaming, improved reliability and security, and even software abstraction of endpoint audio devices such as speakers and microphones. To a certain degree, these benefits are provided "for free" to existing applications—eliminating the need for code updates or API migrations for many application scenarios.

First and foremost, improvement in the audio experience was enabled by the migration of a large portion of the audio-system functionality from kernel mode to user mode. Instead of having audio streams handled and mixed system-wide by kernel-mode drivers (as was the case, prior to Windows Vista), most of this work is now performed in a user-mode audio engine that runs as a service. This improvement represents the first major architectural change in the core audio system since Microsoft Windows 98 arrived with the ability to mix multiple audio streams at once. The benefits from these architectural changes include the following:

  • Reliability of audio engine
  • Better audio-device abstraction via endpoints
  • Per-application volume control

Reliability of the audio engine, as well as the entire audio stack, will continue to improve over time. Prior to Windows Vista—and even to some extent today, as sound drivers are being updated by their respective vendors—a lot of value-added functionality was built-in to their proprietary drivers, in order to interface with the audio engine. Because these drivers were loaded in kernel mode, any bugs had the potential to bring down the entire system. With the audio engine in user mode, however, this value-added functionality—often, digital signal processing (DSP) can instead be implemented as user-mode plug-ins. If the audio engine crashes, the system will continue to operate; plus, the services can automatically be restarted quickly.

For most end users, the big improvement that will be noticed is the per-application volume control. If you are running Windows Vista, click on the sound icon in the system tray to load the volume control for your speaker device. If you click on the Mixer link at the bottom of the volume control, you will see all of the applications that currently have an audio session open with the sound engine.

Figure 1. Volume Mixer in Windows Vista

The other major change that was made to the audio system was the addition of a new layer of core audio APIs, which makes interfacing with the user-mode audio engine possible. These include the Windows Multimedia Device (MMDevice) API, DeviceTopology API, EndpointVolume API, and the Windows Audio Session API (WASAPI). Here is a quick description of their functionality, straight from the MSDN Web site:

  • Windows Multimedia Device (MMDevice) API—Clients use this API to enumerate the audio endpoint devices in the system.
  • Windows Audio Session API (WASAPI)—Clients use this API to create and manage audio streams to and from audio endpoint devices.
  • DeviceTopology API—Clients use this API to access directly the topological features (for example, volume controls and multiplexers) that lie along the data paths that are inside hardware devices in audio adapters.
  • EndpointVolume API—Clients use this API to access directly the volume controls on audio endpoint devices. This API is primarily used by applications that manage exclusive-mode audio streams.

Figure 2. Architecture context for Windows Vista audio pipeline (Shared mode)

As Figure 2 shows, the new core audio APIs have had an effect upon all of the other audio APIs and existing applications that are part of Windows Vista. For example, the Multimedia Extensions and DirectSound APIs are now routed through the core audio APIs, although most legacy applications will not realize it and continue to function as expected. It is also important to note that Figure 2 describes what is called Shared audio mode, which is the default behavior for the high-level APIs and, therefore, most existing applications. This is what enables the per-application sound-control feature in Windows Vista.

For professional audio developers who need to reduce as much latency as possible and avoid DSP altogether, it is possible to use an Exclusive audio mode, which cuts out the entire user-mode audio engine completely. In Exclusive mode, the mapped direct memory access (DMA) memory to the actual sound card is moved into application space. This is a very powerful feature of the new architecture, because it does not require low-level coding or kernel-streaming programming as before.

Although the new core audio APIs are now available and documented on MSDN, it is important to understand when to use them directly and when the use of an existing higher-level API would be more appropriate. If an application is using audio in a simple way, such as to play notification sounds or music, using an existing API is the way to go. In this case, if the core audio APIs change in the future, your application will be insulated from those changes. The only situation in which you should directly utilize the core audio APIs is when you cannot achieve your functionality otherwise, which often occurs during the development of professional audio and real-time communication applications.

The choice between the higher-level APIs can be more difficult to make. A number of different factors will weigh into this choice:

  • Do you have an existing code base to take into consideration?
  • Is your application being designed for Windows Vista, or does it need to operate also on previous versions?
  • What programming languages are you able or willing to use?
  • Do you have a need to work with high-definition premium content and integrate with Windows Vista digital rights management (DRM)?

In general, if you are working on an application that has a high degree of multimedia integration or needs, and are targeting Windows Vista and later, a transition to Media Foundation or continued usage of the DirectX APIs might be the way to go. A great article is available on MSDN that deals with the migration story from Microsoft DirectShow to Media Foundation, and it is listed in the Resources for Further Investigation section that follows.

Resources for Further Investigation

About the author

Richard Davis is a software-design engineer at SharpLogic Software, a Microsoft technology–centric organization that focuses primarily on developing software for the .NET and Win32 platforms, as well as the integration that is required to interface with systems that use Java, Linux, UNIX, and other platforms. In his time at SharpLogic, Richard has played a key role as the primary developer on some of the company's most visible projects, including the development of Microsoft .NET class libraries for programming Skype and LEGO Mindstorms. Richard earned a Bachelor of Science degree in computer science from Washington State University, where he also minored in mathematics.