Microsoft Speech Server Deployment: Speech-Enabled Sample Applications

Article
06/30/2006

Vertigo Software, Inc.

August 2004

Applies to:
   Microsoft Speech Application Software Development Kit (SDK)
   Microsoft Speech Server
   Microsoft ASP.NET Commerce Starter Kit
Summary: This article describes the process of deploying a speech-enabled application built using the Microsoft Speech Application Software Development Kit (SDK) to Microsoft Speech Server. Its content is based on the first-hand experience of the developers of the Speech-enabled sample applications. (10 printed pages)

Purpose
Brief Introduction: The Microsoft Speech Server
Deploying the Applications to the Speech Server
Preparing for Deployment: Telephony Application Simulator
Adding Call Controls to Existing Applications
Best Practices for Design and Performance Tuning
Best Practices for Caching Resources
Suggestions for Further Reading

Purpose

After completing the latest version of the Speech-enabled ASP.NET Commerce Starter Kit and Speech-enabled Fitch & Mather Stocks (referred to hereafter as "CommerceVoice" and "FMStocksVoice," respectively) our team deployed the applications to the Microsoft Speech Server. This deployment involved both learning how to use the Server effectively and making necessary modifications to both sample applications. For this paper we have the following goals:

Show how to deploy an application on the Microsoft Speech Server. We aim to demonstrate some of the changes that need to be made to an application when deploying to the server. We also review the steps for deploying an application.
Explain best practice techniques for working with the Speech Server. From development to production, we will review best-practice techniques for ensuring that server deployment is a straightforward process.
To serve as an addendum to the Speech-enabled Sample Applications Whitepapers. This document is written for developers who need to deploy their speech-enabled applications to the Speech Server. From that perspective, its content picks up where those whitepapers left off. Please refer to their respective whitepapers for more information on these applications.

Brief Introduction: The Microsoft Speech Server

Note Detailed information on the Microsoft Speech Server can be found here.

The Microsoft Speech Server consists of two services that interact with the Web application to provide functionality in a telephony environment:

Telephony Application Services (TAS): This service manages the application flow by communicating with the Web browser for content, the Telephony Interface Manager (TIM) for user interaction, and with Speech Engine Services (SES) for prompt playback and speech recognition. It acts as a coordinator for all the other components.
Speech Engine Services (SES): SES manages only the tasks of prompt generation and speech recognition. It communicates with TAS to determine active grammars and prompt resources, and with the Web server to access those resources.

These two services can exist on a single server, or they can be distributed across multiple servers for scalability. Multiple instances of each service can be distributed across multiple machines and load-balanced using either hardware or software load-balancing.

Our Server Setup

For the purposes of our deployment, we used a single Windows Server 2003 box to hold both TAS and SES. This box also contained an Intel 4-Port Analog telephony board (model D41JCTLS). We used Intel's Dialogic Call Manager software as our Telephony Interface Manager (TIM).

Our Web server was a separate box running Windows XP Professional.

Figure 1. Web server setup

Since the purpose of our deployment was simply to ensure our sample applications ran correctly on the Speech Server, we did not focus on scalability across multiple servers.

Deploying the Applications to the Speech Server

Deployment of a Web application to the Speech Server involves the following steps:

Install the Telephony Board and connect to a phone line.
Install the Telephony Interface Manager (TIM).

The TIM is provided by a third-party vendor (current providers include Intel and Intervoice). You should configure and test the TIM before installing the Speech Server software.
Install the Speech Server Software (TAS and SES).

Depending on your server setup, TAS and SES may be installed to the same or different computers, though TAS must be located on the same computer(s) that the TIM and the telephony board are installed on. Install any prerequisite software and run the Speech Server MSI, selecting the components you wish to install.
Deploy the ASP.NET Speech Web Application.

Install your Web application to a server accessible by both TAS and SES. This can be the same box or a different server (or multiple Web servers configured with load balancing).
Create a management console and add the Microsoft Speech Server Snap-In.

Add your speech server to both the Speech Engine Services and Telephony Application Services subfolders. If TAS and SES exist on separate boxes or if you are load balancing either system, you should add each individual server.
Configure Trusted Sites and set Outbound Start Page.

In the management console for Telephony Application Services, right click on your server and select All Tasks -> Trusted Sites... Add a reference to your Web server.

Then set your Outbound Start Page in the properties for your Telephony Application Services server.

Note If you have multiple servers, you can replicate the settings on one server to all servers by right clicking either the Speech Engine Services or Telephony Application Services sub-folders and selecting Replicate...

Figure 2. Configure trusted sites

Figure 3. Set Outbound Start Page
Restart your TAS and SES services from the management console.

Preparing for Deployment: Telephony Application Simulator

The Speech Application Software Development Kit (SDK) includes a tool to ease the process of deploying to the Speech Server called the Telephony Application Simulator (TASim). This tool is meant to provide an accurate simulation of how the application will run on the Speech Server. It simulates the functionality of Telephony Application Services and also works with the Speech Debugging Tool.

TASim also provides the only desktop support for SMEX-based controls and other functionality, including the following that are used in the sample applications:

AnswerCall control
TransferCall control
DisconnectCall control
DTMF input

We used TASim exclusively to develop and test our applications before deploying them to the Speech Server.

Benefits of Debugging Your Application with the Telephony Application Simulator (TASim)

TASim is the recommended way for debugging your application on the desktop. Some key benefits of using TASim instead of Internet Explorer are:

Accurate SALT interpreter: TASim provides the benefit of an accurate SALT interpreter environment. Since speech pages can contain the same DOM-based, HTML-based, and ASP.NET-based elements as Web pages, we found it very useful to know if the SALT-interpreter provided support for particular page elements. For instance, we did not know if DOM-based commands (for example, redirecting by assigning to location.href) would work in the SALT environment (it does).
Support for DTMF: DTMF input is a standard interface component for a voice-only application. IE does not support this type of input, so TASim is the exclusive environment for debugging applications with DTMF components on the desktop.
Support for SMEX-based call control information: TASim allows you to specify the number being dialed, as well as caller-ID information, at the beginning of a debugging session. This information can be used to drive your application logic. TASim also allows you to test controls like TransferCall, which, though they don't actually transfer the call, are simulated by the environment.

While TASim is very helpful, there are a few things you need to know about it:

Slower application flow: Timeouts, load times, and execution lag are not entirely accurate with respect to the Speech Server. On slower machines, the execution speed can be quite slow. We recommend that you use a fast machine for development and make sure to test it on the Speech Server.
Speech Engine Services Emulation: TASim is a simulation tool for Telephony Application Services but not for Speech Engine Services. This means that bugs with grammar recognition and prompt playback may appear when deploying to the Speech Server that do not appear when debugging in TASim. More information on these follows.
TASim-debugging must be configured locally: When creating a new Speech Web Application in Visual Studio, the environment automatically configures the debugger settings to execute in TASim. If you are coordinating simultaneous development amongst multiple developers via SourceSafe, these settings are not stored in SourceSafe (they are stored in the Visual Studio Web cache). Developers must configure these settings on their boxes individually.

Adding Call Controls to Existing Applications

As we deployed our sample applications to the Speech Server, we found that we needed to pass SMEX-based call controls from page to page.

Our sample applications distribute each major feature of the application (Buy Stock, Sell Stock, Check Portfolio, and Research Quotes, for example) to distinct ASP.NET pages. When transferring between pages, it is necessary to do a redirection to a new ASP.NET page.

When making this redirection, SMEX "call controls" are passed by means of query string parameters to the new page in order to maintain the call context. Failing to pass this context will cause SMEX-based controls, such as TransferCall and DisconnectCall, to fail.

We created the following function to automatically add necessary call control parameters to a URL and perform the redirection:

// Handles all client-side page redirects.
function Goto(url)
{
   var href = url;

   // Pass SMEX call control information to new page.
   href += "?CallID=" + RunSpeech.CurrentCall().Get("CallID");
   href += "&DeviceID=" + RunSpeech.CurrentCall().Get("DeviceID");
   href += "&CallingDevice=" +
              RunSpeech.CurrentCall().Get("CallingDevice");
   href += "&CalledDevice=" + 
              RunSpeech.CurrentCall().Get("CalledDevice");
   href += "&MonitorCrossRefID=" + 
              RunSpeech.CurrentCall().Get("MonitorCrossRefID");
   
   // If there are any additional parameters, assume they are of 
      // the form "name=value" and append them to the URL.
   for(var i= 1; i < arguments.length; i++)
      href += "&" + arguments[i];

   location.href = href;
}

Fortunately, this aspect of the Speech Server is represented in TASim and will show up during desktop debugging. Unless you are using SMEX-based controls, though, it's easy to miss.

Best Practices for Design and Performance Tuning

Because the differences between TASim and the Speech Server environment restrict developers from accurate performance tuning on the desktop, we suggest the following techniques for making the process of tuning as easy as possible:

Use SpeechControlSettings for Easy Performance Tuning

The SpeechControlSettings control allows you to assign property settings for the controls in your application in one control. In the sample applications, we created styles for a number of different purposes, including global commands, standard QAs, statement QAs (QAs that don't accept user input), and so forth.

Since the Speech Server behaves differently from TASim with respect to timing, you will need to adjust timeout settings after deployment. By making liberal use of the SpeechControlSettings control, you can modify timeouts across your application from one control.

Logging

The Speech Server allows you to log a variety of data, including events and recordings of user input. From a development perspective, you can use these logging tools in much the same way you use the output window in the Speech Debugger. See the Speech Server documentation for more information on how to enable logging.

Logging may be useful if you wish to record information specific to your application that the Speech Server does not automatically log. For instance, if your company has a specific format for logged data, you can use explicit logging to record relevant information in this format.

To record your own custom events, use the LogMessage function in your client-side code. In the sample applications, we log a message each time a call is connected with the following code (associated with the OnClientConnected event for the AnswerCall control):

function myClientConnected( obj, 
                            callid, 
                            networkCallingDevice, 
                            networkCalledDevice) 
{ 
   LogMessage("", 
              "Call has been Connected: " + 
               RunSpeech.CurrentCall().Get("MonitorCrossRefID") + " - " + 
               RunSpeech.CurrentCall().Get("DeviceID") + " - " + 
               callid + " - " + 
               networkCalledDevice + " - " + 
               networkCallingDevice ); 
   Goto("SignIn.aspx"); 
}

Testing on the Speech Server: Clearing Cached Resources

As you test your application on the Speech Server, you will inevitably need to make changes to your application and retest. After making your changes, make sure to clear cached resources (All Tasks -> Clear Cached Resources) for the following services if you have made any of the associated changes:

TAS: Any changes to .aspx or .ascx pages or associated Jscript files.
SES: Changes to grammar files or prompt databases.

Best Practices for Caching Resources

The following techniques can be used to increase the performance of your speech application:

Use Manifest.xml

Included in a new Speech Web Application in Visual Studio is a file named manifest.xml. This file specifies the resources that SES will access as it processes your application. Include references to both your grammar files and prompt databases in this file:

<?xml version="1.0" encoding="utf-8"?>

<!-- To improve performance, the Microsoft Speech Server can pre-load and 
cache application resources, such as grammar files and prompt databases. 
Use this file to specify resources to pre-load. -->

<manifest>
  <application name="FMStocksVoice" frontpage="Default.aspx">
    <resourceset type="TelephonyRecognizer">
      <resource src="Grammars/BuyStockPage.grxml"/>
      <resource src="Grammars/CheckPortfolioPage.grxml"/>
      <resource src="Grammars/Common.grxml"/>
      <resource src="Grammars/Companies.grxml"/>
      <resource src="Grammars/Library.grxml"/>
      <resource src="Grammars/MainMenuPage.grxml"/>
      <resource src="Grammars/ResearchQuotes.grxml"/>
      <resource src="Grammars/SellStockPage.grxml"/>
      <resource src="Grammars/SignInPage.grxml"/>
    </resourceset>
    
    <resourceset type="Voice">
      <resource src="FMStocksPrompts/Debug/FMStocksPrompts.prompts"/>
    </resourceset>
  </application>
</manifest>

In the properties for your SES server, specify the location of manifest.xml under Preloaded Resource Manifest so SES can preload these resources.

Create Small Prompt Databases

In general, try to reduce the size of individual prompt databases by creating multiple smaller databases. While doing so will not speed up your application initially, if you need to make changes to prompts after your application has been deployed, it will reduce the amount of time SES needs to reload the affected prompt database(s).

In our sample applications, we separated the prompt databases into application prompts that do not change, and company or product prompts that may change quite often.

Use Pre-Compiled Grammars

You can speed up the time it takes SES to process your grammar files by pre-compiling them using the command line tool SrGSGc.exe. This tool creates a file with the .cfg extension; replace references to .grxml files with their .cfg counterparts to enable the use of the pre-compiled versions.

We have not chosen to pre-compile grammars for our sample applications to increase discoverability for developers. Depending on the needs of your application, you may or may not choose to follow this advice.

Suggestions for Further Reading

This whitepaper is by no means an exhaustive exploration of the Microsoft Speech Server. Specifically, we have not explored the area of scalability and tuning your application for use across a farm of servers. This information and more can be obtained from a variety of references available here.

The information contained in this document represents the current view of Microsoft Corp. on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This white paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise), or for any purpose, without the express written permission of Microsoft.

Microsoft may have patents, patent applications, trademarks, copyrights or other intellectual property rights covering subject matter in these documents. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give any license to these patents, trademarks, copyrights or other intellectual property.

Microsoft, Visual Studio, Windows, IntelliSense, Visual Basic, Visual C#, MSDN, Windows NT, and JScript are either registered trademarks or trademarks of Microsoft Corp. in the United States and/or other countries.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Microsoft Corp. One Microsoft Way Redmond, WA 98052-6399 USA