Speech Enabled ASP.NET Commerce Starter Kit: Design and Implementation

 

Vertigo Software, Inc.

Updated August 2004

Applies to:
   Microsoft Visual Studio .NET
   Microsoft ASP.NET
   Microsoft .NET Speech SDK

Summary: This article describes the design and architectural decisions for the Speech-Enabled Commerce sample application. The document also includes a detailed review and explanation of the code. (34 printed pages)

Download the Speech-enabled ASP.NET Commerce Starter Kit.

Contents

Purpose
Using the Web-Based Application as a Development Blueprint
Designing the Application
How It Works
Prompt Databases
Running the Application
Lessons Learned
For More Information

Purpose

What is the Speech Enabled ASP.NET Commerce Starter Kit?

The Speech Enabled ASP.NET Commerce Starter Kit application (referred to as "CommerceVoice" for the rest of this article) is a voice-only version of the "IBuySpy" Commerce Sample in the ASP.NET Starter Kit. The application is built using the Microsoft Speech Application SDK.

Goals

  • Show how to create a voice-only service from an existing Web application: CommerceVoice leverages the existing business and data layers of the ASP.NET Starter Kit Commerce Sample it is based on, making only minor modifications. To this end, the Web-based version is included with this sample to illustrate how the two presentation-layers work together simultaneously on the same data. It is possible to place an order in the voice-only application and see that order immediately reflected in the Web version.
  • Demonstrate best-practice programming and design techniques for using the Speech Application SDK: The Speech SDK provides a rich base of tools for developing speech applications. These tools also allow the programmer a great deal of flexibility in making design decisions. The developers of this sample have made it a priority to show a consistent set of best practices for developing voice-only applications.

Key Features

  • Ordering products quickly and easily by product number.
  • Browsing and learning about products through a voice-based catalog.
  • Shopping cart functionality, including the ability to add, review, modify, and remove products.
  • Reviewing Previous Orders, including totals, dates, and products ordered.
  • Account login security using Microsoft Windows Authentication.
  • Leveraging pre-existing business layer and data layer code.

This article discusses the CommerceVoice application in-depth, and provides insight from the perspective of the creators on the process of building voice-only applications in general. It includes lessons learned from the testing, design, and development stages, as well as thoughts about the differences between building visual applications for the Web and speech applications for telephony.

Using the Web-Based Application as a Development Blueprint

CommerceVoice shares its business-layers and data-layers with the Web-based ASP.NET Starter Kit sample. More information on that application can be found here.

The sample is a fictional commerce Website for selling spy-related products called "IBuySpy." The purpose of the Web version is to showcase best-practice techniques for building a Web-based application in ASP.NET.

Code-Reuse

In essence, a voice-only version of an existing application is in fact a new presentation layer. The user interface becomes auditory rather than graphical. This means that the business logic and the data layer should essentially remain unchanged.

With a few exceptions, we have followed this concept as a development guideline. In the sample Microsoft Visual Studio .NET solution, note that the CommerceVoice project file includes a reference to the Components folder in CommerceWeb (the included Web-based version). Since the two applications share this same code, orders that occur in one interface are immediately reflected in the other.

Figure 1. Code reuse

During the course of application development, we found it necessary to make minor additions to the business and data layers. These changes are outlined in the Lessons Learned section at the end of this document.

Suggested Enhancements

While the CommerceVoice sample application provides the implementation for the core features originally implemented in the Commerce Starter Kit, the following are ideas for extending the functionality of the CommerceVoice application:

  • Add a Search Feature: Use the product names from the database to construct a grammar that allows users to find a product quickly. The grammar might be a long list of items, or more of a broad tree of items.
  • Add the 'Most Popular Item List' Feature: Use the built-in capabilities of the Commerce Web components to prompt the user with the most popular products that week. Determine where and how to prompt the user without interfering with the overall usability of the application.
  • Support New User Registration: In the current CommerceVoice implementation the user must already have an account with IBuySpy.com in order to use the voice-only application. Extend the CommerceVoice application to allow the user to create an account. The challenge here would be how to handle the entry of the user's personal information (name, username, password, PIN, and so on). One solution would be to limit the amount of information required to setup an account.
  • Add Product Reviews: Add a way for users to review products using CommerceVoice. One way to implement this is to allow the user to record a WAV file that would be attached to a product.

Designing the Application

We designed the system for a group of target users ranging from novices with little or no experience using voice-only systems to technophiles with a lot of experience. With this in mind, we tried to include advanced features to enable an experienced user to navigate the system quickly, while keeping the system simple and well-explained enough so that a novice would not feel lost.

Target User and Voice Personality

For the personality of the speaking voice, we had two goals:

  • Speed: The recorded sentences should be spoken at pace that a new user can easily understand and also have sufficient time to commit several commands to memory. An appropriate speaking pace helps usability by striking a balance between speaking so fast that users miss options and speaking so slowly that they begin to lose attention.
  • Mood: The system's voice should be friendly, patient and may use a bit of accentuation. Any voice-based system should make a user feel good using the system, both for the sake of usability and of providing a good user experience with the company.

Designing a voice-only system is much different from designing traditional GUI-based applications. Whereas a Web page is a two-dimensional interface, the voice medium is one-dimensional. For example, a table of data on a Web page needs to be read item by item over the phone. As one designer put it, the challenge becomes, "How do you chop things up to establish a coherent flow of information? How do you express content in a way that the user can digest, understand, and then act upon?"

Start with a User-Centered Design Approach

We started our design process by following our standard methodology of user-centered design. The 80/20 rule is a good guide: 80% of the users use 20% of the application. We focused on ideal scenarios and common user paths rather than considering exceptional cases in the preliminary stages. We acted out sample dialogues that helped us get a better sense of how a typical conversation might go.

From these sample dialogues, we began creating flow charts for each major component of the system. The following diagram illustrates the high level flow of the application:

Figure 2. Navigation flow diagram

In addition to the flow diagram above, several global commands are available to the user throughout the application:

  • Main Menu: Returns the user to the main menu.
  • Help: Provides the user with context-sensitive help text at any prompt.
  • Instructions: Provides instructions on the basic usage of the system and global commands available to them at any point.
  • Repeat: Repeats the most relevant last prompt. If the last prompt informed the user that his/her input was invalid, the repeat text will provide users with the previous question prompt instead of repeating the error message.
  • Representative: Transfers the user to a customer service representative.
  • Goodbye: Ends the call.

Special Case: Implicit Confirmation

One of the more interesting navigational scenarios in the Commerce Application occurs when the user enters a product ID after saying "Start Shopping" from the main menu. We wanted to take advantage of the Speech Application SDK's "implicit confirmation" feature here: if the product ID is recognized with high confidence and the recognized ID exists in the system, we want to bypass the explicit confirmation of that prompt. A typical scenario might look like this (the complete flow diagram is represented on the right):

  • System: If you know the three-digit product number of the item you want, say it now. If not, say browse.
  • User: 3 5 5 (Mumbled, recognized with low confidence).
  • System: I understood, 3, 5, 5, Rain Racer 2000. Is this correct?
  • User: No, I said 3 5 9 (Clearer, recognized with high confidence).
  • System: You selected product 3 5 9, Escape Vehicle (Water). How many would you like?

Figure 3. Implicit confirmation flow diagram

This scenario makes use of a combination of the Speech Application SDK's answers, extra answers, and confirmations user input types. It makes complicated flow control situations possible.

Prompt Design

The design team found creation of a prompt specification document to be a challenge in itself. The number of paths available to the user at any one prompt leads to a complicated flowchart diagram that, while technically accurate, loses a sense of the conversation flow that the designers had worked to achieve. The design team arrived at a compromise specification that allowed them to illustrate an ideal scenario while also handling exceptions. The following example illustrates the beginning of the "Start Shopping" scenario from the main menu:

Table 1. Prompt: Main Menu

Expected User Input "Start Shopping"
Recognition System Response
Recognized Expected Input Remember, you can start over by saying main menu. If you know the three-digit product number of the item you want, say it now. If not, say browse.
Recognized Alternate Input: "Help" You have reached the IBuySpy store. Our store is pretty simple. If you want to shop, say start shopping. To review your previous orders say review previous orders.

Table 2. Prompt: Start Shopping

Expected User Input "3 5 5"
Recognition System Response
Recognized Expected Input You selected product 3 5 5, Rain Racer 2000. How many would you like?
Recognized Alternate Input: "Help" You can place orders quickly by saying the three-digit product number. Say each digit with a clear pause between each number or enter it on your touch-tone phone. If you don't know the Product Number, say Browse.

This format of specifying functionality makes it very easy to conduct "Wizard-of-Oz" style testing. In this scenario, the test subject calls a tester who has the functional documents in front of him/her. The tester acts as the system, prompting the test subject as the system would and responding to their input likewise. Trouble spots are easily identified and fixed using this style of testing.

How It Works

The following section is devoted to the architecture of the system. We start with an explanation of common user controls and common script files. Then we will go into detail on the browse feature, which provides a good encapsulation of many of the programming techniques used throughout the application. Finally, we'll review some of the coding conventions and practices we used as best-practice techniques for development.

Common Files: User Controls

Two ASP.NET user controls are included on almost every page in our application. Together they encapsulate much of the functionality of the site, and each deserves discussion. Like designing Web applications, user controls in the Speech Application SDK can be used to provide a consistent user experience while saving a great deal of code.

GlobalSpeechElements.aspx

The GlobalSpeechElements user control is required on every page of the application (except for Goodbye.aspx and RepresentativeXfer.aspx, which do little more than read a prompt and transfer the user away). It contains the main SpeechControlSettings control that defines common properties of the controls used throughout the application, as well as global command controls and common script files that provide client-side functional components.

  • MainSpeechSettings: The Speech Application SDK style control is a powerful way of defining global application settings and assigning globally scoped functionality. In the Commerce sample we have four different styles:

    • BaseCommandSettings: This style is applied to all command controls. Its one attribute sets the AcceptCommandThreshold at .6, meaning that any command must be recognized with at least a 60% confidence rating to be accepted.

      <speech:SpeechControlSettingsItem ID="BaseCommandSettings">
                  <Command AcceptCommandThreshold="0.6">
                     <Grammar Lang="en-us"></Grammar>
                  </Command>
               </speech:SpeechControlSettingsItem>
      
    • GlobalCommandSettings: This style is applied only to the six global styles contained in GlobalSpeechElements. This style inherits the attributes of BaseCommandStyle and adds a dynamically set scope attribute. We want global commands to apply to all controls on any page they are included in, so we set the scope to be the parent page's ID at runtime.

      <speech:SpeechControlSettingsItem Settings="BaseCommandSettings" 
                                              ID="GlobalCommandSettings">
               <Command Scope='<%# GetParentPageID() %>'></Command>
      </speech:SpeechControlSettingsItem>
      
    • BaseQASettings: This style is applied to all QA controls that accept user input (QA controls which do not accept user input are called "Statements" and use the StatementQA style below). In addition to setting timeout and confidence thresholds, this style also defines the OnClientActive event handler for all QA controls. HandleNoRecoAndSilence is a JScript event handler that monitors a user's unsuccessful attempts to say a valid response and transfers the user to customer service after enough unsuccessful events. It is described in the section on Common Script files below.

      <speech:SpeechControlSettingsItem ID="BaseQASettings">
               <QA OnClientActive="HandleNoRecoAndSilence">
                  <Reco InitialTimeout="5000"></Reco>
                  <Answers Reject="0.2"></Answers>
               </QA>
            </speech:SpeechControlSettingsItem>
      
    • StatementQASettings: For QA controls that do not accept user input, we want to disable BargeIn—the act of interrupting a prompt before it ends with a response—and turn on PlayOnce, which ensures the prompt is not repeated. Normal QA controls are activated when their semantic item is empty; since Statement QA controls have no semantic item, the control would be played over and over again if PlayOnce was turned off.

      <speech:SpeechControlSettingsItem ID="StatementQASettings">
               <QA PlayOnce="True">
                  <Prompt BargeIn="False"></Prompt>
               </QA>
            </speech:SpeechControlSettingsItem>
      
  • NavStatementQASettings: DataTableNavigator controls within CommerceVoice are preceded by an initial statement QA. This QA gives a brief introduction to the DataTableNavigator content. Since the initial statement accepts no input, we immediately activate the DataTableNavigator after it completes. To do this, we set two timeouts: first, EndSilence indicates that the QA should wait only 100 milliseconds for a response. Second, BabbleTimeout stops recognition on any user input after 1 second.

    <speech:SpeechControlSettingsItem ID="NavStatementQASettings">
             <QA XpathDenyConfirms="" 
                 XpathAcceptConfirms="" 
                 AllowCommands="False">
    <Reco EndSilence="100" BabbleTimeout="5000"></Reco>
             </QA>
          </speech:SpeechControlSettingsItem>
    
  • Global Commands: The global commands in GlobalSpeechElements (described in the Navigation Design section) each have associated with them a command grammar file that defines how the command is activated.

    Figure 4. Global commands

    Commands fall into two categories: those that affect the current prompt (HelpCmd, InstructionsCmd, RepeatCmd), and those that trigger an event (RepresentativeCmd, GoodbyeCmd, and navigation commands like MainMenuCmd). For the former, the prompt function looks for a particular Type value in the History array parameter and creates an appropriate command. For the latter, the command's associated OnClientCommand event handler is executed.

    <speech:command id="RepresentativeCmd" 
                    xpathtrigger="/SML/RepresentativeCmd" 
                    Settings="GlobalCommandSettings"
                    type="Representative" 
                    runat="server" 
                    onclientcommand="OnRepresentativeCmd">
          <Prompt ID="RepresentativeCmd_Prompt"></Prompt>
          <Grammar Src="Grammars/GlobalCommands.grxml"
                   ID="RepresentativeCmd_Grammar1">
          </Grammar>
          <DtmfGrammar ID="RepresentativeCmd_DtmfGrammar"></DtmfGrammar>
    </speech:command>
    
  • Common Script File Includes: GlobalSpeechElements is an ideal place to include references to all global script files. These files constitute all global client-side event handlers and prompt generation/formatting routines for the application. Since they are included in the control, individual pages can rely on their availability without explicitly including them.

    <script type="text/jscript" src="debug.js"></script>
    <script type="text/jscript" src="speech.js"></script>
    <script type="text/jscript" src="PromptGenerator.js"></script>
    <script type="text/jscript" src="CommerceV.js"></script>
    

Use of DataTableNavigator Controls

We use the DataTableNavigator control to provide a dynamically generated list of items from which the user may browse or select. The DataTableNavigator application control provides most of the functionality we need automatically, including data binding, preset grammars, and item selection.

  • Initial Statements: An initial statement QA usually precedes the DataTableNavigator in the call flow. We use it to introduce the list. Our original intent was to include this statement as part of the DataTableNavigator first item prompt. We found that if a user mumbled during the initial statement, barge-in would stop playback and the first item was skipped. We separated the initial statement from the DataTableNavigator control to ensure that, even if users mumble during the initial statement, they will still hear the first item after the system recovers.

  • DataTableNavigator: The DataTableNavigator control takes care of the tasks associated with reading and navigating through the list of items associated with the control. It also handles item selection, either by saying "Select," or in some cases by saying the name of the item itself.

    Selection is enabled by setting the Access Mode property of the DataTableNavigator to Fetch. This mode allows the user to select an item and, when selected, the DataTableNavigator semantic item is filled with information about the selected item. In contrast, the Select mode is used to jump directly to an item in the list without leaving the DataTableNavigator control.

    In CommerceVoice, data binding occurs in the codebehind file for the page. We read data in from the data layer and set the following DataTableNavigator properties:

    • DataSource: Stores a DataTable object that is bound to the control when we call DataBind().
    • DataBindField: When the user selects an item, the DataTableNavigator semantic item is populated with the value of this field for the selected item.
    • DataTextField: If the user can say the name of an item to select it (rather than saying, "Select") DataTextField specifies which field provides this name.
    • DataHeaderFields: Specifies which fields will be used to read the item in the list. Since we use prompt functions, this field is not used explicitly.
    • DataContentFields: Specifies which fields will be used to read details of an item in the list. Since we use prompt functions, this field is not used explicitly.

After the DataTableNavigator completes, we determine the reason for completion in the OnClientComplete event handler. Typically, the event handler looks like this:

   if(siCategory.value == "Exit")
   {
      // The user said "Cancel." 
   }
   else if(siCategory.value != "")
   {
      // Get the value of DataBindField for the selected semantic item
      var selectedItemValue = siCategory.value;

      // Access other fields for the selected item by retrieving 
      // the selected item index (set upon item selection by the control)
      var selectedItemIndex = parseInt(siCategory.attributes["index"]);
   }

Common Files: Client-Side Scripting

The globally scoped client-side script files for the application are:

  • Speech.js: NoReco/Silence event handler and object accessors, and string manipulation routines
  • Debug.js: Client-side debugging utilities
  • CommerceV.js: Global Navigation Event Handlers
  • PromptGenerator.js: Prompt Generation Utility

A few of the more interesting functions of these scripts are outlined below:

HandleNoRecoAndSilence (Speech.js)

HandleNoRecoAndSilence takes care of handling cases where the user repeatedly responds to a prompt with silence or with an unrecognizable input. To avoid frustration, we don't want to repeat the same prompt over and over again. This function, executed each time a QA is made active, checks the command history for consecutive invalid inputs. If the number of invalid inputs exceeds a maximum (in this application, 3), we redirect the user to a Customer Service Representative.

This function is defined as the OnClientActive event handler for the BaseQAStyle in the GlobalSpeechElement's MainStyleSheet. Each QA that accepts user input must use this style in order for the function to be called correctly.

function HandleNoRecoAndSilence()
{
   var History = RunSpeech.ActiveQA.History;
   if (History.length >= representativeXferCount)
   {
      var command;
      for (var i=1; i <= representativeXferCount; i++)
      {
         command = GetHistoryItem(History,i);
         if (command != "Silence" && command != "NoReco")
            break;
      }
      if (i == representativeXferCount+1)
         Goto(representativeXferPage,"");
   }
}

DataTableNavigator Functions (Speech.js)

Speech.js contains the following functions to make working with the DataTableNavigator application control easier:

  • GetNavigator(navigatorName): Returns a DataTableNavigator object reference given its name as a string.
  • GetNavigatorCount(navigatorName): Returns the count of items in the given DataTableNavigator.
  • GetNavigatorData(navigatorName, columnName, index): Returns the data contained in the DataTableNavigator named navigatorName, the row specified by index, and the column specified by columnName.
  • GetNavigatorDataAtIndex(navigatorName, columnName): Returns the data contained in the DataTableNavigator named navigatorName, the currently selected row, and the column specified by columnName.

Prompt Generation (PromptGenerator.js)

Prompt Generation is perhaps the most central element when creating a successful voice-only application. Providing a consistent voice interface is essential to creating a successful user experience. PromptGenerator.js does just this by encapsulating all common prompt-generation functionality in one place.

A prompt function in a typical page will always return the result of a call PromptGenerator.Generate() as its prompt:

return PromptGenerator.Generate(
   RunSpeech.ActiveQA.History, 
   "Prompt Text Here", 
   "Help Text Here"
);

Notice that the prompt function passes both its main prompt and its help prompt into the function every time. PromptGenerator.Generate() decides the appropriate prompt to play given the current command history:

function PromptGenerator.Generate(History, text, help)
{
   help += " You can always say Instructions for more options."
   
   var prevCommand = GetHistoryItem(History,2);

   switch( GetHistoryItem(History,1) )
   {
      case "NoReco":
         if (prevCommand == "Silence" || prevCommand == "NoReco")
            return "Sorry, I still don't understand you.  " + help;
         else
            return "Sorry, I am having trouble understanding you. " +
               "If you need help, say help. " + text;
      case "Silence":
         if (prevCommand == "Silence" || prevCommand == "NoReco")
            return "Sorry, I still don't hear you.  " + help;
         else
            return "Sorry, I am having trouble hearing you. " +
               "If you need help, say help. " + text;
      case "Help":
         PromptGenerator.RepeatPrompt = help;
         return help;
      case "Instructions":
         var instructionsPrompt = "Okay, here are a few instructions...";
         PromptGenerator.RepeatPrompt = instructionsPrompt + text; 
         return instructionsPrompt;
      case "Repeat":
         return "I repeat: " + PromptGenerator.RepeatPrompt;
      default:
         PromptGenerator.RepeatPrompt = text;
         return text;
   }
}

Note   Some of the longer strings have been shortened in the above code sample to save space.

A note on "Repeat": The PromptGenerator.RepeatPrompt variable stores the current text that will be read if the user says "Repeat." The first time the function is executed for any prompt, the RepeatPrompt will be set to the standard text. The RepeatPrompt is then only reset when the user says "Help" or "Instructions."

Other PromptGenerator functions: PromptGenerator also includes a number of other functions for generating prompts in the application. They include:

  • GenerateNavigator(History, text, help): This function adds to the functionality of Generate() by including standard prompts commonly needed while in a DataTableNavigator control. These prompts include additional help text and messages for when the user tries to navigate beyond the boundaries of the DataTableNavigator.
  • ConvertNumberToWords(number, isMoney): In order to generate recorded prompts for all possible number values, we must convert numbers (for example, 123,456) to a readable string (for example, "one hundred twenty three thousand four hundred fifty six"). This reduces the number of unique words that must be recorded to a manageable amount.
  • ConvertDateToWords(dateString): Like ConvertNumberToWords, this function converts dates to a prompt-ready format (for example, "12/12/02" becomes "December Twelfth Two Thousand Two")

Designing Your Grammar

Items in your grammar files define what words and phrases are recognized. When the Speech engine matches an item from the grammar file, it returns SML, or Speech Markup Language, which your application uses to extract definitive values from the text that the user spoke. Having too strict a grammar will result in no flexibility from the user's perspective in regards to what they can say; however, too many unnecessary grammar items can lead to lower speech recognition.

Preambles and Postambles

Very often, you will want to allow a generic "preamble" text said before the main item, and "postamble" text said after the main item. For instance, if the main command is "Buy Stock," you would want to allow the user to say, "May I Buy Stock please?"

Typically, you can use one grammar (.grxml) file for your preambles and one for your postambles. Within your other grammar rules, you can then reference the pre-ambles and post-ambles by using RuleRef elements.

Tip   Make the preambles and postambles generic and robust enough that you don't limit your users' experience, but keep them reasonable in size so that you don't risk lowering the speech recognition for your main elements.

Use the Grammar Editor tool to graphically set up grammar files. You begin by setting up a phrase or a list of phrases. Then you add a semantic tag following key elements to indicate when each phrase is recognized.

Figure 5. Detail of the Grammar Editor tool

We found that the following strategies helped us in grammar development:

  • Typically, if we only need to recognize that a text phrase has been matched, especially in the case of commands, we create a semantic tag that adds a sub-property with the empty string value. For example, if you want to capture when the user says "Help," you can simply return the following SML:

    <SML confidence="1.000" text="help" utteranceConfidence="1.000">
          <HelpCmd></HelpCmd>
    </SML>
    

    The control associated with this grammar file recognizes the phrase, and returns the SML element HelpCmd; the code-behind or client-side script makes a decision based on the SML element being returned, rather than the value.

  • Never match a grammar based on the root note /SML. Since every matched grammar returns this node as its root, your semantic item will be matched in every case.

  • Use rule references within grammar files to avoid duplicating the same rule across different speech controls.

    Tip   You must make sure that a rule to be referenced is a public rule, which you can set through the properties pane.

    Figure 6. Using rule references within a grammar

    A common grammars file is included with the Speech Application SDK, both in an XML file version (cmnrules.grxml) and in a smaller, faster compiled version (cmnrules.cfg). We copied the compiled version into our project and used it for commonly used grammar elements, such as digits and letters in the alphabet.

Coding Conventions

Server-Side Programming

Unlike traditional ASP.NET programming, the Speech Application SDK is primarily a client-side programming platform. Although its controls are instantiated and their properties manipulated on the server-side, controlling flow from one control to another is primarily a client-side task.

The controls offer opportunities to post back to the server automatically, including the SemanticItem's AutoPostBack property and an automatic postback when all QAs on a page are satisfied. As a convention, though, we chose to avoid postbacks except when we needed to access data or business layer functions. Most of our code is written through client-side event handlers, using SpeechCommon.Submit() to post back explicitly when data was needed from the server.

Client-Side Scripts

Because Jscript lacks many of the scoping restrictions found in C# or Visual Basic .NET, it is possible when programming on the client-side to perform a certain task in many different places. The SpeechCommon object is accessible from any client-side script, and its Submit() method can be executed from event handlers, prompt functions, or any helper routines, as well. For this and other reasons, we have followed a set of guidelines for the usage of these various components:

  • Prompt Functions Are Only For Generating Prompts: Never perform an action inside a prompt function that is not directly related to the generation and formatting of a prompt: no navigation flow, semantic item manipulation, and so on. Besides good practice, the other key reason for reserving prompt functions only for generating prompts is validation. If prompt functions contain calls to SpeechCommon or other in-memory objects, those objects must be declared and their references included in the "Validation References," for the prompt function. If these references are not included, validation will fail for the function. As a rule, the only functions referenced by prompt functions are in PromptGenerator.js.

    One exception to this rule was necessary. DataTableNavigator application controls do not expose events that are equivalent to OnClientActive, or which fire each time a prompt function is about to be executed. For QA controls, we use OnClientActive to call HandleNoRecoAndSilence, which monitors consecutive invalid input for a QA. We expect future versions of the SDK to expose this type of event in the DataTableNavigator control, but until then, we call HandleNoRecoAndSilence from PromptGenerator.GenerateNavigator.

  • No Inline Prompts: Inline prompt functions are simple to configure, but they should only used when the prompt associated with the control is static and will never change. Since most prompts in CommerceVoice use PromptGenerator for error handling, we avoid the use of inline prompts except where this functionality isn't needed (Goodbye.aspx is one example).

  • Control of Flow Handled in Event Handlers: Flow control is the most important function of event handlers and client activation functions. Most applications that have any complexity require a more complicated flow control than the standard question-and-answer format afforded by laying QA controls down in sequence on a page. For the most part, we achieved this control by manipulating the semantic state within event handlers.

Naming Conventions

We used the following naming conventions throughout our application for consistency:

  • QA Controls: The QA Control can be used for a variety of purposes. We distinguish these purposes by their functions: traditional question-and-answer controls fill a semantic item with the result of user input, confirmations confirm a pre-filled semantic item, and statements are output-only; they do not accept user input.
    • Question-And-Answer: <Name>QA (AddToCartQA)
    • Confirm: <Name>Confirm (NumberOfItemsConfirm)
    • Statement: <Name>Statement (RestartBrowseStatement)
  • DataTableNavigator Controls: <Name>Nav (CategoryNav)
  • Commands: <Name>Command (BrowseCommand)
  • Semantic Items: si<Name> (siProductID)

Jscript and C# server-side code use naming conventions standard in those environments.

In-Depth: Browse Feature

Next, we'll show how all of these common elements are used to build the Browse feature. In the CommerceVoice application the user can shop for products by browsing the product catalog. First, the user selects a category from the list of categories and then selects a product from the list of products in that category. Once the product is selected, users can find out more about that product and add it to their shopping cart. The interaction is shown in the diagram below.

Figure 7. Catalog browsing flow diagram

In the CommerceVoice application, there are seven categories and an average of six products per category. Keeping the list of categories and products relatively short seemed to help the usability of the application.

Page Setup

Like almost all pages in the CommerceVoice application, we began the Browse page by adding a new C# Web Form to our project. We then placed a GlobalSpeechElements user control onto the page that provides global commands and the style sheet used for the speech controls. Grouping common elements like this into a user control accelerated development and provided consistency across the application. Nothing else was required to use the GlobalSpeechElements user control on the Browse page.

Semantic Items

A semantic map control is added next that contains the semantic items we use on the page. Semantic items are generally associated with a particular speech control and contain the user's answer to a question. Text extracted from the SML document returned by the grammar file is placed into the semantic item. For the Browse page, the following semantic items are used:

Table 3. Semantic items for the Browse page

Semantic Item Name Control Description
siCategory CategoryNav Holds the category selected by the user.
siProduct ProductNav Holds the product selected by the user.
siAddToCart AddToCartQA Used to determine if user said 'Add to Cart.'
siNumberOfItems NumberOfItemsQA Number of items of selected product to add to shopping cart.
siIgnore... Various siIgnore... semantic items are used whenever we want a QA that accepts only commands or accepts no input at all. The semantic item is matched to /SML/Dummy, a non-existent node that cannot be matched.

Semantic Item States

Semantic items play an important role in controlling the flow of execution on a speech enabled Web form. Semantic items can have three states:

  • Empty: Value is not filled (this is the default state)
  • Needs Confirmation: Value is filled-in but confidence is below threshold
  • Confirmed: The value is filled in and is confirmed

When the page executes, the RunSpeech engine controls the flow of execution for the controls on the page (for example, it determines which control to execute next). If the state of all semantic items associated with a QA control is Empty, the RunSpeech engine will activate that QA control. Otherwise, that control will be skipped. In this way, programmatically setting the state of semantic items on a page allows us to customize the flow of execution.

Page Execution

When the Browse page first loads, the list of categories is retrieved from the database and loaded into the CategoryNav DataTableNavigator control.

private void LoadCategories()
{
   DataTable dt = new DataTable();
   dt.Columns.Add("CategoryID", typeof(int));
   dt.Columns.Add("CategoryName", typeof(string));
   dt.Columns.Add("CategoryDescription", typeof(string));
   dt.Columns.Add("Products", typeof(int));

   ProductsDB products = new ProductsDB();
   SqlDataReader drCategories = products.GetProductCategories();

   using(drCategories) // Dispose drCategories when done
   {
      // Fill a DataTable with the results contained 
      // in the SQLDataReader object.  The SQLDataReader 
      // cannot be used as a DataSource for the 
      // DataTableNavigator application control.
      while (drCategories.Read())
         dt.Rows.Add(new object[4] { 
            drCategories[0], 
            drCategories[1], 
            drCategories[2], 
            drCategories[3] });
   }

   CategoryNav.DataSource = dt;
   CategoryNav.DataBindField = "CategoryID";
   CategoryNav.DataTextField = "CategoryName";
   CategoryNav.DataContentFields= 
      new StringArrayList("CategoryID,CategoryDescription,Products");
   CategoryNav.DataHeaderFields = new StringArrayList("CategoryName");
            
   CategoryNav.Visible = true;
   CategoryNav.DataBind();
}

The ProductsDB component that is being reused from the CommerceWeb application returns a DataReader, which the DataTableNavigator control does not support as a data source. As a result, we insert the categories into a DataTable and assign that as the data source of the DataTableNavigator.

Selecting Categories and Products

Now that we've loaded the CategoryNav DataTableNavigator with categories, we prompt the user to select a category from the list. The CategoryNav control allows for an initial prompt to be read to the user and then proceeds to read each category in the list. The CategoryNav_prompt_question function is indicative of DataTableNavigator prompt functions used throughout the CommerceVoice application.

{
    var text = "";

    switch(GetHistoryItem(History, 1))   
    {
        case "NVG_contents":
            text = "";
            SetHistoryItem(History, 1, "NoReco");   
            break;

        case "NVG_headers":
            SetHistoryItem(History, 1, "");      
            // Fall through.

        default:
            if (itemIndex == 0)
                text = "First category: " + categoryName;
            else if (itemIndex == itemCount-1)
                text = "Last category: " + categoryName;
            else
                text = categoryName;
    }

    return PromptGenerator.GenerateNavigator(
        History,
        text,
        categoryName + ". <withtag tag='CategoryCount'>" + 
        PromptGenerator.ConvertNumberToWords(products) + "</withtag> " + 
        description + " ");
}

The prompt function determines what to read back to the user based on the command History parameter. In this function, when they request the contents of the category (for example, by saying "Read"), we treat the user's response as a no recognition condition. Also, because DataTableNavigator application controls have their own unique error handling, which does not apply to normal QA controls, the PromptGenerator.GenerateNavigator function is called instead of PromptGenerator.Generate.

Parameters are used to retrieve the category name that is used to read the categories to the user, and for more detailed information when the user asks for help.

Figure 8. Parameter names and values

The DataTableNavigator control allows us to store multiple columns of information for each category, such as the number of products in a category and the category description.

The DataTableNavigator treats silence as the "Next" command when reading categories to the user if shortInitialTimeout is non-zero. If the user is silent, the next category is read. This is the default behavior of the DataTableNavigator application control encapsulated by the DataTableNavigator control.

When the user selects a category from the list, the OnClientComplete client-side event handler for the DatatableNavigator is executed.

function CategoryComplete()
{
   if(siCategory.value == "Exit")
      window.location.href = "MainMenu.aspx";
   else if(siCategory.value != "")
   {
      var si = siCategory.value;
      var ind;  // index of selected item 

      // retrieve the index from siCategory
      // (the index attribute should always be set, 
      // so we can count on it existing).
      ind = parseInt(siCategory.attributes["index"]);

      // Set attribute to be used in later prompt functions
      siCategory.attributes["CategoryName"] = 
         CategoryNav.Data["CategoryName"][ind];
      
      // This sets the text and calls WritePostBackData, which saves the 
      // information to ClientViewState.
      siCategory.SetText(si, true);
   }
}

We use the attribute collection associated with the semantic item to store related information. In this case, the category ID is stored along with the category name in the semantic item. The true parameter in the SetText function call changes the state of the semantic item to Confirmed.

Retrieving the List of Products

The AutoPostBack property for the siCategory semantic item is set to True. This means when the state changes to NeedsConfirmation or Confirmed, the page is automatically posted back to the server. We've also configured a server-side semantic item changed event handler. This event handler loads the list of products for that category.

Adding Items to the Cart

The AddToCartQA is used to read the selected product and price to the user and to determine if the user wants to add the product to their shopping cart. First, we assign the BaseQAStyle to the QA defined in GlobalSpeechElements1. As described earlier, this provides us with common threshold settings and adds support to handle the three mumbles or silences in-a-row case.

IsProductSelected is a Client Activation function for this QA. RunSpeech calls this function to determine if the QA is available for activation. Returning true allows RunSpeech to activate the control. We only want this QA to be active if the user has selected a product.

function IsProductSelected (testedQA)
{
   return siProduct.IsConfirmed();
}

The value of the siAddToCart semantic item is only used to control program flow on the page. If the item is empty and the user has selected a product, RunSpeech will activate the AddToCartQA. Once the user tells the system they wish to add the product to their cart, the siAddToCart semantic item is no longer empty and RunSpeech moves on to the next QA on the page.

There are two additional commands scoped only to the AddToCart QA: the Description command and the First command. The Description command is used to play back a description of the product. Because the product descriptions are very long, prompts are constructed in a special way described in the Recording Long Prompts section of this document.

The other command scoped to this QA is the First command. When the user is in the DataTableNavigator control, hearing a list of products, they can say "First" to navigate to the first item in the list. We wanted that same functionality for this QA. When the user says "First," the OnFirstCmd client handler function is called.

function OnFirstCmd (smlNode)
{
   StartOverProduct()
} 

function StartOverProduct()
{
   // Reset Product Nav and initial statement.
   siIgnoreProductStatement.Clear();
   siProduct.Clear();   

   var ProductNav = GetNavigator("ProductNav");
   ProductNav.Index = 0;
   if(siProduct.attributes["index"] != null)
      delete siProduct.attributes["index"];
}

Notice that the category semantic item is not cleared and the CategoryNav DataTableNavigator is not activated since we want to use the same category. Also, since the CategoryNav and ProductNav controls are already loaded with category and product data, we do not need to post back to the server to retrieve this data from the database again.

Specifying the Number of Items

Now that the user wants to add the product to their shopping cart, we ask them how many items of the selected product they wish to add. We use the cardinal_999 rule from the Grammar Library (the cmnrules.cfg file) that ships with the Speech Application SDK. Refer to the SDK documentation for more information on the Grammar Library. The siNumberOfItems semantic item is filled with the number of items for this product the user wishes to add to their shopping cart.

At this point, we allow the user to say "Cancel" to return to the list of products. Instead of returning them to the first item in the product list, we return them to the item they previously selected. The Cancel command is scoped only to the NumberOfItemsQA. The ReturnToProductList client handler function is called when the user says Cancel.

function ReturnToProductList (smlNode)

function ReturnToProductList (smlNode)
{
   StartOverProduct();
   siAddToCart.Clear();
}

When the product is selected we save the index of the product in the list in the attributes collection of the siProduct semantic item. ReturnToProductList sets the index of the ProductNavigator to this saved value so that when RunSpeech activates the ProductNav control again, the previously selected product will be read back to the user. The siAddToCart semantic item is also cleared so that when the user selects another product they are prompted to add the item to their cart.

Confirming the Number of Items

The NumberOfItemsConfirm QA confirms the siNumberOfItems semantic item set in the previous QA. Note that in the Answers tab of the NumberOfItemsQA, the Confirm Threshold for the siNumberOfItems semantic item is set to 1.

Figure 9. The Confirm Threshold

By setting the Confirm Threshold to 1, we require the semantic item to always be confirmed. Setting the Confirm Threshold to a lower value would require confirmation of the item based on the confidence level. For instance, setting the Confirm Threshold to .5 would only require a confidence level of .5 or greater to automatically confirm the item. In that case, the NumberOfItemsConfirm QA would be skipped by RunSpeech since the item was already confirmed. Setting the Confirm Threshold to 1 ensures that this QA will never be skipped by RunSpeech.

NumberOfItemsConfirm QA

The grammar for the NumberOfItemsConfirm QA allows the user to say "Yes," "No" or "No, I said three" (or any valid cardinal number). You can see this by looking at the QuantityConfirm grammar.

Figure 10. The QuantityConfirm grammar

Saying "Yes" confirms the item. RunSpeech will automatically set the state of the siNumberOfItems semantic item to Confirmed. Saying "No" will set the semantic item's state to Empty. In this case, since the semantic item's state is Empty, the previous QA (NumberOfItemsQA) would be activated again and the user would be prompted for the quantity.

Saying No and a different number will set the state of the semantic item to Empty, but would also fill its value with the new quantity without activating the previous QA. This provides the user familiar with the CommerceVoice application to correct the quantity value quickly. This is easy to set up using the property builder for the NumberOfItemsConfirm QA.

Figure 11. Settings for siNumberOfItems semantic item

First, notice that the Confirms tab is used instead of the Answers tab. This tells RunSpeech to confirm the semantic items in the list. We use XPath to tell RunSpeech what to extract from the SML document that is returned from the grammar file. For a simple Yes/No confirm, all that is required is a grammar that returns yes or no. We also allow the user to also specify a new quantity by providing an XPath trigger for siNumberOfItems.

Prompt Databases

The standard Text-To-Speech (TTS) engine may work well for development and debugging, but recorded prompts make a voice-only application truly user-friendly. Though the process can be tedious, prompt validation utilities and recording engine of Microsoft make the process easy.

Validation

Thorough validation is important to make sure that no prompts are being missed. A few general strategies enabled us to make sure that our prompt generation functions were being validated completely and accurately:

  • No object-references within prompt functions: Except for calls to PromptGenerator.js, we never make calls to script objects within the body of our prompt functions. Instead, our prompt function arguments are defined so that all function calls are made before the inner prompt function is executed. This avoids errors on validation that prevent prompts from appearing. Example: In the snapshot below, note the call to insertSpaces() in the productID variable declaration. A product ID ("355") must be separated into its component digits to be read correctly by recorded prompts. We make the call to the helper function that does this in the variable declaration and provide an already-formatted version of the productID ("3, 5, 5") as the validation value.

    Figure 12. Using the insertSpaces() function to avoid validation errors

  • Stand-In Validation Values: When it comes to validation values with a large number of potential values (for example, numbers, dates, product names, and so on) we always provide a stand-in validation value that represents the entire set for the validator. We then make sure that the entire set is recorded in the prompt database. For instance, by using "Rain Racer 2000" for the product name whenever it is passed into a prompt function, we need only record this one product name for debugging purposes. When the product is ready for testing or professional voice-talent, we then go through and add the rest of the product names.

Achieving Realistic Inflection

The following techniques allow us to make our prompts play as smoothly as possible when reading strings that involve combining many different recordings (for example, "[This product costs] [two] [dollars and] [fifteen] [cents].")

Note   Throughout this section, individual prompt extractions are identified with brackets, just as they are in the prompt editor.)

  • Record Extractions in Context: Prompt extractions usually sound more realistic when spoken in context. While it may be tempting to record common single words like "items," "dollars," and "products," as individual recordings, they will sound much better when recorded along with the text that will accompany them when they are used in a prompt: "one [item]," "two [items]," and so on. In one highly effective example, we recorded all of our large number terms in one recording: "one [million] three [thousand] five [hundred] twenty five dollars."

  • Recognize and Group Common Word Pairings: When recording singular words like "item," "dollar," and "product," we almost always group them with "one" as they will always be used this way. Our extractions become, "[one item]," "[one dollar]," and "[one product]."

  • Use Prompt Tags: Although we may have recorded "two" and "thousand" in other extractions, these two words together constitute part of any prompt that includes a current date. So, it makes sense to include "two thousand" as an individual extraction itself. Further, by using the tags "YearComplete" and "YearIncomplete," we can record it twice and distinguish between its use in "January First, Two Thousand," where its inflection should drop at the end, and, "January First, Two Thousand Two," where its inflection should rise at the end. We then insert a tag reference in the prompt generation routine. (The following snippet is taken from ConvertYearToWords() in PromptGenerator.js):

    if (year >= 2000)
    {
       year -= 2000;
       yearString = "two thousand";
       if (year == 0)
          return "<withtag tag='YearComplete'>" + yearString + 
            "</withtag>";
       else
          yearString = "<withtag tag='YearIncomplete'>" + yearString 
            "</withtag>";
    }
    
  • Use Display Text to Your Advantage: To achieve high-quality extractions when recording sentences, we modified the display text column of our transcriptions to indicate where the extractions were. As an example, the transcription "[Order number] 5 8 3 [This order has] one item" has the display text, "Order number, 5 8 3, This order has, one item." Commas are inserted between extractions. During recording, the voice talent can pause at the appropriate places so that the extractions are recorded clearly.

Recording Long Prompts

Although the prompt editor's automatic alignment feature is a powerful tool, it becomes unwieldy when recording lengthy prompts. In CommerceVoice, we needed to record the descriptions of all the products in the product catalog. Each description averaged over sixty words. In addition, using the alignment tool would require us to update our prompt database each time a description changed in the database, a costly and time-consuming prospect.

Instead of using the alignment engine, we bypassed the issue by aligning the entire description with one alignment: PRODUCT_DESCRIPTION_<PRODUCTID>:

Figure 13. The alignment tool

The prompt editor requires that each individual word of the transcription matches an individual alignment within the waveform, so the transcription text must match this alignment:

Figure 14. Checking waveforms alignments in the prompt editor

Finally, in our prompt functions, we refer to descriptions by outputting this product description keyword, simply the concatenation of the PRODUCT_DESCRIPTION_ prefix and the product ID:

 
if (GetHistoryItem(History,1) == "Description")
   text = "Here is the description: PRODUCT_DESCRIPTION_" 
              + productIDNoSpaces;

Besides the obvious advantage of avoiding large numbers of alignments, we also avoid having to retrieve product descriptions from the database in order to play them. We also separate the voice-only version of the description from the text version, allowing us to make changes to the Website without having to re-record our prompts every time.

Running the Application

Our user tests were designed with two main goals in mind:

  • Verify that the system performed well in real-life scenarios: The main goal is simply to verify that testers can manage the basic tasks that real customers would want to perform.
  • Exercise the full feature-set of the application: In addition to testing standard goals, it was important to make sure that the complete feature set of the application was tested as well. Testers were guided to parts of the system that might not necessarily be on a most-likely-path scenario, in order to make sure that the entirety of the system worked as expected.

To accomplish these goals, we gave our testers scenarios that included both common tasks and special-case scenarios designed to guide the user toward special situations. A sample script might look like this:

TASK ONE (Product-Number-Driven Ordering)

  1. You noticed a product in a magazine that is sold at the IBuySpy store. The product number was listed as 3-6-0. Purchase this product.

TASK TWO (Catalog-Based Ordering and Shopping Cart Review)

  1. Needing more power to persuade, you want to buy the Persuasive Pencil, a product found in the Communications category. Please purchase one unit.
  2. Sustaining damage to your car on your last mission, you want to purchase the Universal Repair System, but only if it is safe for cars. Check the product description and if it is safe, purchase two.
  3. Deciding that it may also make sense to have a Persuasive Pencil for your vacation home, you decide it makes more sense to buy two pencils instead of one. Update your order.
  4. Finish the transaction and make a note of the order ID. End the call.

TASK THREE (Review Previous Orders)

  1. With the order ID from the previous task, check to see if your order shipped.
  2. You've become confused about how many Persuasive Pencils you ordered. Check the order details to verify the quantity shipped.

Test subjects were given account numbers and PINs to log into their account, but otherwise were left alone to complete the tasks. Tests were repeated with a number of different test subjects and over a number of successive product revisions.

Lessons Learned

We learned a great deal about building voice-only applications through the process of building these samples. Here we note some of the major points in the areas of user testing, design, and development.

Testing

The testing and tuning phase is important in any application, but in terms of design, it is especially important in voice applications. We found that tuning our prompts, accept thresholds, and timeouts were key to making the application useful. Here are a few suggestions on how to conduct effective testing and tuning for voice-only systems.

Properly Configure Testing Equipment First

Many of our early user tests generated numerous usability problems that were due to improper configuration of the microphone. The microphone was too sensitive, picking up background noise, feedback from the speaker output, and slight utterances as user input. Users became increasingly frustrated as they found it difficult to hear a prompt in its entirety. This affected test results significantly.

Select Testers Carefully

We found that testing subjects brought a variety of expectations to the testing process. Developers whom we used as subjects often made assumptions about the way the system was working and became confused with ambiguous prompts like, "Would you like to start shopping or review your previous orders?" They preferred more explicit choices: "Say start shopping to start shopping or review orders to review your account history." Testers with a less technical background preferred less structured prompting; they felt they were speaking with a more friendly system.

To conduct effective tests, make sure the user group you are testing matches the target user group for your application.

Design

The most important lesson designing the application was the importance of tuning the prompt design throughout development. From the first stages of implementation through user testing of the completed system, we made changes to prompts to achieve a more fluid program flow. Our experience speaking with other teams who have attempted similar projects is that this is a fundamental part of voice-only application development.

With that in mind, here are a few points that will make the tuning process much more efficient:

  • Long Prompts Don't Equal Helpful Prompts: At the outset, our design team approached the goal of a friendly interface by writing friendly text. Testing quickly revealed that verbose prompts were a serious impediment to usability. By keeping prompts short, users understood better what to do.
  • Express Sentiment with Tone/Inflection: We found that helpfulness is best expressed through intonation and inflection, rather than extra words. A prompt like, "I'm sorry. I still didn't understand you. My fault again," expresses an apologetic sentiment on paper quite well, but spoken, it becomes excessive. This prompt became, "I'm sorry. I still didn't understand you," and we let the inflection of the speaker express the emotion. A good rule of thumb: speak prompts first before writing them down.
  • Build Cases For Invalid (but likely) Responses: Our tests surprised us when a majority of users answered, "Yes," to the question, "Would you like to start shopping or review your previous orders?" We realized that part of the problem was the way in which the question was asked, but still, we built in a command to accept that response and provide a helpful response.
  • Keep the Number of Options Small: We found that listing more than three or four choices in a prompt dramatically reduced usability. Users would get confused and would not remember their choices. We made every effort to reduce the number of options offered to a user in any given prompt.
  • Maintain a Prompt Style Guide: Design teams are used to maintaining style guides for their designs, and voice-only applications should be no exception. Having a consistent set of prompt styles and standard phrasings is paramount to creating a sense of familiarity for the user. Our team recommends an iterative process: modify the guide liberally in the early stages of a project as new cases arise. Then, toward the later stages, tweak new cases to fit the existing rules. This process should lead to a consistent user experience throughout your system.

Development

We needed to make several changes to our development strategy worth noting here.

Necessary Modifications to the Business and Data Layers

The concept of building a voice-only presentation layer as a replacement for a GUI necessitates a few changes to the database and business logic layers we didn't foresee. These changes both relate to the types of data required by the particular constraints of the voice medium:

  • Pluralized Names: In a GUI context, quantities of items are usually expressed in some sort of table format that very closely resembles a table in a database:

    Product Name Quantity
    Counterfeit Creation Wallet 2
    Contact Lenses 4

    In a voice-only context, while it is possible to read this information as, "Product Name: Counterfeit Creation Wallet, Quantity: 2, Product Name: Contact Lenses, Quantity: 4" it is preferable to read it as, "Two Counterfeit Creation Wallets, and Four Contact Lenses." We added a productNamePlural field to our Products table to enable this change.

  • Different Login Information: The Web version of the store accepts an e-mail address and password as its login information. Both of these pieces of information are not easily expressed in a voice context. We replaced these fields with Account Number and PIN fields, which also necessitated database changes.

    It becomes evident that these changes are changes to user-interface elements stored in the data layer. In essence, a product name is really a GUI identifier for the product row. In the voice-only context, these identifiers may not always apply and so may require changes to the database layer.

For More Information

The complete documentation and source code can be obtained here.