Text-to-Speech and the Microsoft Speech Technologies Platform

Article
06/30/2006

SpeechWorks International

July 2003

Applies to:
Microsoft Speech Technologies Platform
Microsoft® Visual Studio® .NET

Summary: Learn how Text-to-Speech has reached new heights of quality, which enables deployment of customer-facing speech applications that rely on it to support a pleasant user experience. (5 printed pages)

Introduction
Why Use TTS?
How Does TTS Work?
Speech Technology Developers and TTS
Conclusion

Introduction

Text-To-Speech (TTS) has been available for decades (since 1939). Unfortunately, quality of the output—especially in terms of naturalness—has historically been sub-optimal. Terms such as "robotic" have been used to describe synthetic speech.

Recently, the overall quality of TTS from some vendors has dramatically improved. Quality is now evident not only in the remarkable naturalness of inflection and intonation, but also in the ability to process text such as numbers, abbreviations and addresses in the appropriate context.

The goal of this paper is to raise awareness that TTS has reached new heights of quality, enabling deployment of customer-facing speech applications that rely on TTS to support a pleasant user experience. For Microsoft® .NET developers this means, for example, the ability to easily convert data already used in other parts of the organization to reach a new audience with audio.

Why Use TTS?

TTS allows applications to stream text from virtually any source for conversion to an audio format. There are many reasons why this is valuable. One of the most obvious is that you can save time and money by not having to pre-record information for storage as sound files. Also, presenting dynamic data using TTS is the only way some speech applications can be realistically deployed. For example, dynamic data for account information, e-mail reading, and so on. would be impossible to deliver without TTS. Some vendors provide TTS with quality so human-like that speech applications can substitute TTS for voice talents where prompts are used to direct callers. This level of quality presents an opportunity for a richer user interface for speech applications in general.

From a SALT Developer's perspective, TTS provides a means to seamlessly interweave existing corporate data or dynamic information into any speech application with minimal effort. Developers coding with SALT simply reference the <prompt> tag to identify when to send text to the TTS engine. Microsoft Visual Studio .NET also supports TTS as a default mode for prompts.

How Does TTS Work?

TTS engines are generally based on either a formant or concatenative approach. Formant systems generate audio from text using an entirely algorithmic design to create the speech. Concatenative systems draw from a database of speech recorded from a voice talent. Sound segments are then joined (concatenated), to form words. Systems such as Speechify™ included within the Microsoft Speech Technologies server are concatenative and, because they use actual recorded speech, offer a far more human-like tone. They can even be customized for corporate branding or seamless integration with pre-recorded audio from the same voice talent used for the TTS engine.

However, it is not just improved sound that now makes some TTS ideal for customer-facing applications. TTS systems must also process text correctly. There are a myriad of ambiguities in text that TTS systems must deal with. Some (such as spelling mistakes) are not covered. Others (such as abbreviations) are expanded for more natural sounding output. TTS engines also provide dictionaries so that developers can customize output of abbreviations, acronyms, symbols, or words. Functions TTS engines perform in preparing streaming text for output include:

Text Normalization—TTS software identifies words or symbols appropriate to a phrase or a sentence. (for example $1M = "1 million dollars" NOT "dollar sign 1 M".
Linguistic Analysis—Determines the overall flow of control using phrase breaks.
Prosody Generation—Identifies the relative word stress levels and pitch accents within each phrase.

Once the system determines what to read and how a phrase should be spoken, concatenative TTS then identifies the most appropriate "units" of speech stored as segments within its database and completes the process of assembling segments into audio elements as words within a sentence or phrase.

Speech Technology Developers and TTS

The Microsoft Speech Technologies Server uses SAPI 5.1 to interface to TTS engines such as Speechify. For developers, this mean there is seamless integration on the .NET platform. Developers must thus concern themselves with 3 primary areas in building speech applications that include TTS output.

User Interface Design
Code to invoke TTS
Tips to enhance TTS output

User Interface Design is critical in the success of any speech application. From a TTS perspective there are a number of issues developers must consider. One of the most obvious is matching the TTS voice to the applications. Is a male or female voice most appropriate? Which dialect is needed? Can you use the same speaker for prompts as the TTS voice so as to offer a more unified voice presentation? You also want to be a consistent as possible when inserting TTS output into a pre-recorded audio string to avoid jarring a caller. Another important issue is whether to allow barge-in while synthesized speech is being read.

Code to invoke TTS ** using SALT or Microsoft Visual Studio® .NET is quite straightforward. The <prompt> tag in SALT is provided for TTS output. Simple prompts need specify only the text required for output, for example:

<prompt> 33 Gray St., Hutchinson, MN </prompt>

<prompt> Red Flannel Jacket, Size XL </prompt>

The prompt text may also include SSML (Speech Synthesis Markup Language) tags. SSML is proposed as a standard formatting protocol within the W3C, and makes it possible to define how a specific word should be pronounced and where emphasis should be added. SSML also makes it possible to speed up or slow down text playback. The TTS engine must support SSML to enable these capabilities.

The<value>element can be used inside the prompt tag to refer to text or markup held in elements of the document. For example:

<prompt>
   So your full name is
   <value targetelement="txtBoxFirstName" targetattribute="value"/> 
   <value targetelement="txtBoxLastName" targetattribute="value"/>.
   Is that right?
</prompt>

The<content>element can be used inside the prompt tag to specify a link to a URI, either local or remote, from which the text to be played can be retrieved. The URI can be a pointer to a static text file, or to a CGI script or similar Web service which will dynamically generate the text that will be sent to the TTS engine (which could include XML or SSML markup).

The<param>element can be used to set configuration parameters that are specific to the TTS engine on the Microsoft Speech Technologies platform.

Visual Studio .NET is Microsoft's graphical development environment for application development. VS.NET provides a graphical palette for building a call flow. The palette includes elements associated with the SALT prompt element, and provides a graphical interface for specifying static TTS prompts and client-side script functions that will dynamically generate text that will be sent to the TTS engine (with or without SSML mark-up). Actual SALT code is created behind the scenes to make it easier for all levels of developers to build speech-enabled applications.

Enhancement Tips ** include a variety of text-oriented alternatives that can make a big difference in overall quality for any speech application. A major difference for Web developers is in presenting content not by what the audience sees, but how it is heard. Remember, TTS systems don't understand the text they read. Because of this, TTS systems rely heavily on correctly spelled, well-punctuated, unambiguous input texts for clues as to how a text should be read.

When designing a TTS application, there are varying degrees of control a developer may have over the input text. There are some simple guidelines that can be applied to maximize the quality of the output speech if you have control over the input text. Predictably formatted text input (like from database entries), may benefit from the implementation of a custom text pre-processor that reformats the database entries on the fly. If you have no control over the text input (for instance, instant messaging), you must rely on the system's standard pre-processing capability. Some examples include:

Context

I read the newspaper every day.

Is it (rhymes with red), or the present tense version (rhymes with reed)? It's hard to say which is correct. However it becomes obvious preceded by:

I had nothing to do last week—it was wonderful.

Now the past tense is more appropriate. TTS systems almost always read sentences as single, context-independent items. This may require that developers reword some sentences to meet caller expectations.

Words that are spelled the same but have different pronunciations (like lives and read) are called homographs. Homographs are not always full words—abbreviations can behave as homographs too. For example 'Corp' may be pronounced as 'Corporal' and 'Corporation'. Developers should always test textual strings where possible and reword if necessary.

Text Formatting

Punctuation, spelling and use of case all have an impact on the quality of TTS output. Commas to indicate pauses are one of the most noticeable textual elements when synthesizing speech. Without commas, TTS sounds too fast and unnatural. The lack of periods or question marks also negatively impacts quality, creating run-on sentences that become increasingly painful. People like to hear words grouped together into meaningful phrases, rather than in long unbroken strings.

Also, the clarity of visually formatted text is not preserved in TTS. Avoid using fonts, bulleted lists, and colors with the intention of affecting the output speech, or implement a parser to automatically reformat the text for synthesis.

Spelling

Proper spelling may seem like an obvious point, but TTS systems don't possess spell-checkers—they read precisely what you enter. In the sentence, "The weather is expected to be cloudy this morning", we can tell that the word is meant to be weather, but TTS systems will read this as something like "wet her" or "wheat her".

Use of Case

Speechify TTS uses case to correctly determine output. For example:

Acronyms—He works for NASA.

Initialisms—She studies at MIT.

Spellings—It's spelled S-C-Y-T-H-E.
TTS systems, however, do not apply:

Emphasis—Don't do it THAT way.

Emotions—DROP EVERYTHING AND READ THIS MESSAGE!!!!!!

Conclusion

Recent leaps in quality TTS output as embodied by Speechify TTS included on the Microsoft Speech Technologies platform open exciting new possibilities for developers considering the deployment of speech applications. TTS has become an essential element to virtually every speech application that requires any degree of dynamic data. TTS can be cleverly employed to repurpose data that's already in use to reach a new audience using speech. Tools available within SALT code and VisualStudio.NET make it easy for developers to identify a text resource, stream text at the appropriate point in an application, and seamlessly interweave TTS with pre-recorded audio.

Use of extremely natural sounding TTS combined with effective coding and text preparation will deliver an exceptional caller experience that can include elements of personalization while at the same time supporting a highly efficient deployment model. The Microsoft Speech Technologies model gives developers an easy to use and familiar platform for introducing speech to their organization and customer base.