Ask Learn
Preview
Please sign in to use this experience.
Sign inThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Microsoft Corporation
Published November 2001
Updated June 2005
Summary: The Real-Time Communications (RTC) platform is a set of core components that provide rich real-time communications features. This platform is used by many in the industry, as well as by various Microsoft products. This paper outlines the media-related features and enhancements provided by these components. Application developers can use the RTC SDK to integrate real-time features that add video and voice to new or existing applications. (7 printed pages)
Introduction
Audio Codecs
Video Bandwidth and Frame Rate
Acoustic Echo Cancellation (AEC)
Redundant Audio Coding
Dynamic Jitter Buffer and Adjustment
Automatic Gain Control (AGC)
Bandwidth Estimation
Quality Control Algorithms
Available Bandwidth
Conclusion
Related Links
With the introduction of Microsoft Windows XP, rich communications features have been combined and enhanced to provide the infrastructure for a real-time communications (RTC) experience. These features are leveraged by various Microsoft applications, including Microsoft Office Communicator, MSN Messenger, and Windows Messenger, to expose user-to-user communications by using real-time voice and video, instant messaging, and other collaboration features. In addition, an application programming interfaces (API) exposes functions that make this rich communications infrastructure available to any application.
This paper details the media features that were added to the Real-Time Communications platform, which provides a rich experience for both end-users and application developers. When applications are built on the Real-Time Communications platform, the end user receives a vivid audio and video experience, and the developer gets a broad set of functionality. Applications built using this API also have access to the instant messaging and presence functionality that the Real-Time Communications platform provides. Information about the RTC API can be found in the Windows Platform SDK documentation.
This paper discusses the following features and improvements:
The real-time communications platform supports the audio codecs listed in the table below. Also listed are the related sampling rates and bit rates. The codec selection is based on both the capability of the parties involved in the session and the bandwidth between them. For example, if one party is on a dial-up link with a speed of 56 kilobytes per second (KBps), G.711 will be disabled, because it exceeds the bandwidth available. To take another example, if one party supports SIREN but the other does not; SIREN will be disabled. If both parties support SIREN and the bandwidth is sufficient, SIREN will be chosen over other audio codecs. Except for G.729, the platform does not support the plugging in third-party audio codecs.
Codec | Sampling Rate | Bit Rate | RTP Packet Duration |
---|---|---|---|
G.711 | 8 kHz | 64 Kbps | 20 msec |
G.722.1 | 16 kHz | 24 Kbps | 20 msec |
G.723 | 8 kHz | 6.4 Kbps | 30 msec, 60 msec or 90 msec |
GSM | 8 kHz | 13 Kbps | 20 msec |
DVI4 | 8 kHz | 32 Kbps | 20 msec |
SIREN | 16 kHz | 16 Kbps | 20 msec or 40 msec |
The codec for an outgoing audio stream is selected and configured based on the following factors:
The best codec and the frame size will be selected for different conditions. The changes will happen dynamically during the call. Endpoints that need to interoperate with Microsoft real-time clients should be prepared to support dynamic payload type and frame size changes if multiple payload types are published for the same media in the SDP (Session Description Protocol).
The H.263 and H.261 codecs are supported for video. H.263 is always preferred. The bit rate for this codec can vary from 6 to 125 KBps, depending on network conditions. The platform supports both Quarter Common Intermediate Format QCIF(176 x 144) and Common Intermediate Format CIF (352 x 288). The platform supports a variety of capture modes. In priority order, these are the modes that are currently enabled: MSH263, MSH261, YVU9, I420, IYUV, YUY2, UYVY, RGB16, RGB24, RGB4, and RGB8. Plugging in third-party video codecs is not supported.
The computed setting for the outgoing video stream is based on the following factors:
The computed video bit rate and frame rate are based on the above factors so that video will not interrupt the audio traffic. Again, all changes happen dynamically. The application can use MaxBitRate and TemporalSpatialTradeOff properties on the IRTCClient interface to influence the algorithm, but it cannot dictate the final settings.
AEC (acoustic echo cancellation) works by modeling the output from the speakers and removing it from the signal captured by the microphone. AEC helps to ensure that no echo is heard at the other end.
AEC can be enabled or disable through the Audio and Video Tuning wizard. In Microsoft products, this wizard is commonly found on the Tools or Options menus. As shown in Figure 1 below, selecting the I am using headphones check box in the Audio and Video Tuning wizard disables AEC. AEC is on by default if this check box is clear. Many cameras and microphones ship with hardware-specific AEC. Hardware-specific AEC often disables this check box, which the user then sees as dimmed. For information about configuring hardware-specific AEC, see the OEM's software documentation.
Figure 1. Audio and Video Tuning Wizard dialog box
AEC can be programmatically enabled or disabled by using the PreferredAEC method on the IRTCClient interface.
For more information about the RTC Client API and its interfaces, see the Windows Platform SDK documentation.
The AEC module that the real-time media platform uses is part of the Microsoft DirectSound application programming interface. This module includes, among other things, the following features and limitations:
Redundant audio coding is a technique that is used to compensate for packet losses. The real-time media platform implements a one-packet redundancy algorithm. When it is enabled, each packet will carry both the current audio frame and one of the earlier audio frames. If a packet is lost, the receiver has a second chance to get the audio frame in a later packet. This process is documented in IETF RFC 2198. The maximum number of consecutive packets that can be recovered is three. This algorithm adapts to information provided by the Real-Time Control Protocol (RTCP).
The algorithm starts with zero redundancy and introduces redundancy when packet loss is detected. The distance between the original packet and the packet that carries a copy of the original data determines how many lost packets can be recovered. This distance can vary from one to three packets. For example, if the distance is two and a program loses packet n, it will get the same information in packet n+2. If it loses both packet n and packet n+1, it can still recover all the information from packets n+2 and n+3. If it loses the n, n+1, and n+2 packets, then the information in packet n cannot be recovered (it was in packet n+2). The table below shows the distance for the different low and high packet loss rates.
Distance | Loss Rate (Low) | Loss Rate (High) |
---|---|---|
0 | 0 | 5 |
1 | 4 | 10 |
2 | 9 | 15 |
3 | 14 | 20 |
The Real-Time Communications platform performs Redundant Audio Coding automatically.
Jitter buffers smooth delay variations in received audio by buffering the packets and adjusting their rendering. The result is a smoother delivery of audio to the user. The client has a jitter buffer that can grow to 500 msec. In other words, the buffer can absorb up to 500 msec of delay variations in the received packets without causing choppy sound.
The total render buffer is a two-second circular buffer. If a packet storm that gives the program more than two seconds' worth of data in a very short time, then new packets will be discarded.
The jitter buffer is readjusted at the beginning of a new audio spurt. By default, the real-time media platform uses silence suppression.
AGC (automatic gain control) is a mechanism by which gain is adjusted automatically as the input signal level changes. The real-time media platform implements AGC by adjusting the microphone gain depending on the level of the captured audio.
When the capture or render device's audio output no longer varies according to the input gain, so that the output is essentially a flat line at maximum level, the audio breaks up. This condition is called clipping. When the real-time media platform detects that the running average peak Pulse Code Modulation (PCM) value (the audio gain) of each packet exceeds a ceiling threshold, it automatically reduces the gain so that clipping does not occur.
On the other hand, if the captured audio is too low (for example, if the running average peak PCM value of each packet is below a floor threshold), the real-time media platform boosts the gain. However, the gain is adjusted so that the level does not exceed the levels set by the user in the tuning wizard.
The actual available bandwidth may be less than the local connection speed reported by using Windows Sockets. Several factors can cause this discrepancy, including a low-speed connection in the path or bandwidth consumed by other applications.
In order to estimate the actual bandwidth available, the real-time media platform sends back-to-back RTCP packets (commonly referred to as packet pair bandwidth estimation). The other endpoint calculates the delay between the packets to estimate the actual bandwidth. The estimation is initially done for every RTCP report, which is approximately once every 5 seconds, and the frequency is then gradually reduced one for every three RTCP reports.
The aim of quality control (QC) in the real-time media platform is to provide a good audio and video experience to users of the real-time media under different network conditions. QC constantly monitors networking condition, computes the available bandwidth for outgoing streams, and dynamically alters the settings for audio and video outgoing streams to provide stream smoothness and to minimize jitter and delay. Between audio and video outgoing streams, QC puts a higher priority on audio.
QC applies its adjustments to outgoing streams when it receives commands or events from the application, the remote party, or the Real-Time Transport Protocol (RTP) module. The application triggers an adjustment by adding or removing streams or by changing the maximum bit rate setting. The adjustment is also triggered when the remote party starts a new SDP (Session Description Protocol) packet, which in turn changes streams and bit rate settings. The RTP module periodically sends real-time communications media events to inform the peer of the estimated bandwidth and packet loss rate. Upon receiving these events, QC adjusts the outgoing audio and video streams.
The QC algorithm consists of three main parts:
QC computes the available bandwidth for outgoing streams according to these factors:
The media features included in the Real-Time Communications platform enable developers to create rich VoIP and Video over IP client experiences in their applications. The Real-Time Communications platform is used by internal and external developers, which enables applications to make millions of calls each month.
See the following resources for further information:
Please sign in to use this experience.
Sign in