Talk Back

Voice Response Workflows with Speech Server 2007

Michael Dunn

This article discusses:
  • Voice response application basics
  • Creating a voice response workflow
  • Prompts, keywords, and grammars
  • Handling user responses
This article uses the following technologies:
Speech Server 2007, .NET Framework

Code download available at: SpeechServerAndWF2008_04.exe (188 KB)
Browse the Code Online

Contents

Voice App Basics
Voice Response Workflow
Building Prompts
Prerecorded Prompts
Keyword and Conversational Grammars
Handling Responses
Debugging and Testing

Embedded presence, instant messaging (IM), audio and video conferencing, and telephony are among the unified communications features offered by Microsoft® Office Communications Server (OCS) 2007. Developers can build on an array of OCS APIs to include these and other features in their own applications. But OCS 2007 adds one new developer-centric feature you may not have heard about yet—interactive voice response (IVR) workflows based on the Microsoft Speech Server platform.

Not sure what an IVR is? If you've ever called a company and heard a message like "press 1 for sales, press 2 for customer service," then you've encountered an IVR system.

Voice App Basics

Microsoft Speech Server 2004 only supported IVR development via the Speech Application Language Tags (SALT) API. The new version, OCS 2007 Speech Server (Speech Server 2007 for short), not only supports both SALT and VoiceXML but introduces a Microsoft .NET Framework API for creating IVR applications. Speech Server 2007 also includes a visual IVR application designer based on Windows® Workflow Foundation (Windows WF) called the Voice Response Workflow Designer.

Current IVR development standards such as SALT and VoiceXML focus on using a Web-based development model relying on XML tags and JavaScript. Both of these standards focus on delivering the user interface of an IVR application. However, any non-trivial IVR application still needs to perform tasks such as accessing data from a database. For these tasks, server-side code, typically developed in a .NET-compliant language or Java, is still needed. And this somewhat diminishes the main benefit that SALT and VoiceXML intend to provide: cross-platform compatibility for IVR applications.

From a developer perspective, using the .NET Framework as the basis for your IVR application development has significant benefits over SALT or VoiceXML, the main advantage being the ability to use an object-oriented programming language instead of XML tags. From a business perspective, a .NET Framework-based IVR allows the business to utilize investments they have already made in Microsoft technologies by being able to reuse existing business and data-access logic.

Another benefit is finding key resources. Try asking a hiring manager or recruiter to find you a developer using VoiceXML and a developer using .NET. Which one do you think they'll be able to find first? There are just more developers around who use .NET.

Voice Response Workflow

Much like an ASP.NET page, the logic of a Voice Response Workflow consists of a partial class in two files, one containing the Visual Studio®-generated code and the other containing your code (called the code-beside). Voice Response Workflows use Speech Dialog activities as well as standard Windows WF activities. The Visual Studio toolbox contains both activity types.

The first step in developing an IVR application is to lay out the call flow. In Visual Studio you do that by dragging the Speech Workflow activities onto the Speech Workflow canvas and arranging them in the order you want the activities to execute, as shown in Figure 1.

Figure 1 Voice Response Workflow Designer

Figure 1** Voice Response Workflow Designer **(Click the image for a larger view)

For an IVR application that responds to inbound calls, the first activity is typically AnswerCall and the last activity is typically DisconnectCall, so these activities are added to the Speech Workflow automatically upon creating a new Voice Response Workflow. Speech activities execute in order, from top to bottom, by default. Although Voice Response Workflow applications are based on Windows WF, they only support sequential workflows, not state machine workflows.

The two most commonly used activities are Statement and QuestionAnswer. Using just these two activities you can develop a fairly simple IVR application with minimal effort. Both activities speak to the user either using Text-to-Speech (TTS) or prerecorded prompts. The QuestionAnswer activity varies from the Statement activity in that while the Statement activity only speaks to the user, the QuestionAnswer activity lets the user give a response either through speech recognition or touch-tone (Dual Tone Multi-Frequency or DTMF) key presses.

Building Prompts

Each activity that provides information to the user through either TTS or prerecorded prompts has a property named MainPrompt, which is of type PromptBuilder. MainPrompt defines what is spoken to the user when the activity is first executed. The PromptBuilder type provides multiple methods for fine-tuning how a prompt will sound. The three most commonly used methods are SetText, ClearContent, and AppendText. Here is an example of using SetText to create the message "Thank you for calling.":

this.statementActivity.MainPrompt.SetText("Thank you for calling.");

These two lines use ClearContent and AppendText to send the same "Thank you for calling." message:

this.statementActivity.MainPrompt.ClearContent();
this.statementActivity.MainPrompt.AppendText("Thank you for calling.");

Both of these snippets accomplish the same objective, so you can use either one. My general rule is, if I have a line containing ClearContent immediately followed by an AppendText method, I'll replace it with the SetText method for the simple reason that one line of code is easier to maintain than two.

In a Statement activity (see Figure 2), there are a standard set of events that fire each time the prompt is rendered. Setting a static prompt in any of these events would cause your app to write the same value of the prompt each time the prompt was rendered.

Figure 2 Statement Activity Order of Event Execution

Figure 2** Statement Activity Order of Event Execution **(Click the image for a larger view)

It's more efficient to set static prompts before the activity is initialized, such as in the Initialize event of the workflow instead of the activity. If the prompt is dependent on a dynamic value of some sort, it's still more efficient to set the value before the activity is initialized. In fact, the most efficient method for setting static prompts is to use the Visual Studio tools, which put the prompt text in a resource file that is loaded only once when the constructor of the workflow calls the InitializeComponent method.

As for dynamic prompts, the story starts to change when you look at activities that not only speak to the user but also look for user input. A good example is the QuestionAnswer activity. While QuestionAnswer follows the same order of execution, it also accepts user input. This means that if the user says something the system doesn't recognize, the TurnStarting event fires again. This would be inefficient, as you are setting the same dynamic prompt again as well as duplicating the use of the dynamic resource, such as a database lookup. If the dynamic prompt must be set at the time of activity execution, set it during the Executing event over the TurnStarting event. This is because the Executing event is only fired once per session, while the TurnStarting event is fired every time the QuestionAnswer activity is rendered.

The text that is actually spoken is dependent on how many times the activity has been played to the user and what kind of response they gave. You can define what is spoken at a given point in time by assigning the text you want spoken for a corresponding prompt type.

The QuestionAnswer activity has four more prompt types than a Statement activity: Silence, Escalated Silence, No Recognition, and Escalated No Recognition. While MainPrompt is mandatory to set, these prompts are optional. The Silence prompt is spoken to the user when a user doesn't give any response before the time specified in the InitialTimeout property. If the user still doesn't give a response, the Escalated Silence prompt will play. The No Recognition and Escalated No Recognition prompts follow the same rules as the Silence and Escalated Silence prompts, but are used when the user gives a response the application doesn't recognize.

Dynamic prompts create some problems when converting the text resource into spoken words, and two of the most common problems are dates and times. Typical data sources store dates in a readable format such as 01/01/2008. However, when spoken, you want to hear "January first two thousand eight." The solution for translating this commonly understood date format into a commonly understood spoken format is the AppendTextWithHint method. This method accepts the same string parameter as AppendText does, but it also takes a SayAs parameter. The SayAs parameter is a simple Enum containing the different options for manipulating readable text into spoken text. In the example of converting a date, the code would look like the following:

this.statementActivity.MainPrompt.AppendTextWithHint(
    "01/01/2008", Microsoft.SpeechServer.Synthesis.SayAs.Date);

Prerecorded Prompts

By default, your Voice Response Workflow will use the TTS engine's voice to speak the text defined in your prompts. Using the TTS engine, however, requires a lot of processing power and can end up sounding like every other system using TTS.

Your other option is to use prerecorded prompts. While you can code your prompts to play a WAV file instead of assigning text, it's better to simply add a prompt database project to the solution. A prompt database enables you to easily create, edit, and manage the WAV files used in your application as well as storing them in a compressed format. This compressed format allows the prompts to be loaded faster. If you use a prompt database, you won't have to make any coding changes to your application's prompts.

When you import or create WAV files in the prompt database, you can assign the transcription of the WAV file. When the application encounters text to be spoken, it looks in the prompt data for a match on the transcription field. For example, if the application needs to speak "Thank you for calling." and there is a transcription of "Thank you for calling." in the prompt database, it will play the associated WAV file instead of rendering the text using the TTS.

The process of creating and importing these WAV files can be tedious if your application has numerous prompts. Let's consider the following prompts:

  • "I'll transfer you to a representative."
  • "I'll transfer you to a loan officer."
  • "A representative will be with your shortly."
  • "A loan officer will be with you shortly."

Instead of creating four different prerecorded prompts you could use an extraction technique. Extraction allows you to combine all or part of a phrase from one WAV file with all or part of a phrase from another. Extraction doesn't happen automatically; you must specify in the transcription field which part of the transcription can use extraction. So instead of having four different WAV files in the prompt database you can simply have two WAV files with extraction brackets, as shown in Figure 3.

Figure 3 Prompts with Extraction Annotated

Figure 3** Prompts with Extraction Annotated **(Click the image for a larger view)

When the application encounters the phrase "I'll transfer you to a," it will automatically combine that specific audio portion with another portion of audio. It will even mix prerecorded prompts with TTS if part of the encountered prompt isn't in the prompt database. So if you later add another prompt that says, "I'll transfer you to a mortgage specialist," the phrase "I'll transfer you to a" will use the WAV file, while the "mortgage specialist" phrase can be rendered using the TTS engine.

Keyword and Conversational Grammars

Now that your application can talk, you'll want it to be able to accept responses from the user either through speech recognition or DTMF. Speech Server 2007 does not automatically recognize everything the user says, so you must specify acceptable user responses. These are defined in your application's grammar. In a speech recognition app, there are two types of grammars: keyword and conversational. Keyword grammars are based on the use of specific words in the response. Keyword grammars are good for asking the user a very direct question, such as, "Do you want Account Information, Loan Inquiry, or to speak with a representative?"

If the user says anything other than the list of acceptable answers, even if the user's response is very similar to what the application is looking for, the application would probably respond with something like, "I didn't understand what you said. Please say ..."

Conversational grammars, on the other hand, try to address this problem by taking some of the responsibility off the user and putting it back on the application to figure out what the user wants. Instead of asking a direct question, you ask the user an open-ended question: "How can I help you?"

This is a more natural way of asking a question, and you will probably get a more natural and complex answer back from the user. The Conversional Grammar Builder allows you to build conversational grammar easily, taking much of the effort out of working with a statistical language model yourself.

The Conversational Grammar Builder, shown in Figure 4, is divided in two sections: Keywords and Answers. In the Answers section, you have three options: Concept Answer, Keyword Answer, and Command. Concept Answers differ from the Keyword Answers in that Keyword Answers require the caller to say the exact phrase to trigger recognition while Concept Answers allow the user to give a response similar to one of the Answer phrases you have predefined.

Figure 4 Conversational Grammar Builder

Figure 4** Conversational Grammar Builder **(Click the image for a larger view)

You'll want to add one Concept Answer for the main menu prompt for this example (see Figure 4). The Concept Answer typically represents one question you are asking the user to give a response to, in this case the main menu question, "How can I help you?"

Next, you'll need to add one Concept to the Concept Answer for each of the possible choices the application will allow for this prompt. For the Main Menu QuestionAnswer activity created earlier, I would have one Concept Answer named MainMenu and three concepts named AccountInformation, LoanInquiry, and Representative.

Now consider this phrase: "I'd like to pay my loan." If this phrase was evaluated simply on keywords, it might be recognized as an inquiry for a new loan, when really the user is trying to make a payment on their account, which probably resides as a secondary menu choice under AccountInformation.

The Answer Examples pane is where you can add phrases that a user is likely to say. You'll need an example phrase for each of the defined concepts. In the Representative node, you might have a list of the following phrases:

  • "I need to talk to a person."
  • "Can I speak with someone?"
  • "I'd like to talk with a representative."

Next, you need to consider the use of keywords, not for recognition, but to expand the Answer Examples phrases that you've entered. Notice that words person, someone, and representative all refer to the same object, a representative. What if the user says "I need to talk to a representative"? Instead of creating three different sentences for each Answer Example, you can create keywords so the sentences are more dynamic. The use of keywords is not necessary when using a Concept node in your conversational grammar, but it will save you time when developing the answer phrases. Figure 4 shows how we have consolidated three phrases into one phrase by using keywords.

In the Keywords area of the Conversational Grammar Builder, you'll need a keyword container that corresponds with the Representative node under Answers. In this case, I called it RepresentativeKeys. Under this node, you might want to have a keyword for Person, and in the Keyword Phrase area you add words or phrases that fall under the main keyword. For example, the Person keyword might have a list of words which a user might use to describe a representative, including person, representative, and someone.

The goal of the keyword is not to distinguish the between these words, but a list of words that a user might use to describe a representative. If the application needs to determine the difference between the words person and representative, each would be an additional keyword container node under Keywords.

Now on the Representative concept node you can reference the RepresentativeKeys keyword node, and then click on the Parse button. This will replace the keywords found in the concept phrases and replace them with brackets indicating that the word isn't static, but a variable belonging to the referenced keyword. A green checkmark indicates that a keyword is being used in a phrase, while a red "x" indicates that no keywords are being used. For concept type grammars, keywords are not required, so it's acceptable and very likely that the phrases may not reference any keywords.

You can reference the completed grammar in a QuestionAnswer activity or any other activity that accepts user input. This can be done through the Visual Studio environment or from your code-beside. Like prompts, if you set grammar via your code and not through Visual Studio, keep in mind the order of event execution; you don't need to repeatedly set the grammar.

Handling Responses

Now that your application can talk and listen, you need to get the results of what your user said. To get these results via your code, you employ the RecognitionResult property of the QuestionAnswer activity. You can retrieve the exact phrase the user said via the Text property:

this.questionAnswerActivity.RecognitionResults.Text;

If the user were to respond, "I need to talk to a person," the Text property of the recognition results would contain "I need to talk to a person." How does your app know that the phrase means that they want to speak with a representative? This is where Semantic values come into play. Your app doesn't care whether the user said, "I need to talk to a person" or "Can I speak with someone?" It only needs to know that they want to speak with a representative.

Instead of the Text property, you can use the SemanticResult property and provide a key value. In this case, the Concept Answer name, MainMenu, is the key value:

this.questionAnswerActivity.RecognitionResults.Semantics[
  "MainMenu"].Value.ToString();

This code would return the name of the Concept. In the case of the user saying, "Can I speak with someone?," it would return "Representative."

You have retrieved the results, but now you need to do something with those results. Typically in Voice Response Applications you'll want to create branches, representing each of the possible choices. If you referenced the grammar via Visual Studio you can right-click on the QuestionAnswer activity and choose Generate Branching. This will automatically create each of the branches and create the criteria for each branch. Alternatively, you can create each branch yourself and set the criteria regarding when the user will enter the branch based on the recognition results. For example, from the MainMenu you might have a Representative branch, which would transfer the user to a representative upon successful recognition.

Debugging and Testing

Voice Response Workflow applications give you the same debugging experience you are already accustomed to. Unlike most other applications, however, you need to be able to hear and speak to an IVR application. Speech Server adds a Voice Response Debugging Window into Visual Studio, as shown in Figure 5.

Figure 5 Debugging Window

Figure 5** Debugging Window **

This debugging phone lets you test your app without actually deploying it or even having any telephone lines connected. You can test your grammars using a microphone or simply enter the text you want the app to test its grammars against. While it seems like a simple feature, most other IVR platforms overlook the ability to test and debug without deploying. To learn more about Speech Server 2007 and download the tools, see the Microsoft Speech Technologies site at microsoft.com/uc/products/speechserver.mspx.

Michael Dunn is a Senior Consultant at Microsoft. He is also the author of the book Pro: Microsoft Speech Server 2007. You can contact him via his blog at blogs.msdn.com/midunn.