SALT Programmer's Reference

Article
07/05/2006

Speech-enabled Web pages can be created using two different approaches. The first approach uses Web forms server controls, also commonly called Web server controls. The second approach uses textual programming much like conventional HTML. This approach uses Speech Application Language Tags (SALT), an extension of HTML, and introduces several new high-level tags. This topic describes the first approach briefly. It also provides a detailed overview of SALT and why this standard is important to speech application development.

Web server controls and speech controls are self-contained elements of a Web page. A Web server control is an element such as a button or text box that encapsulates other features such as properties, events, and methods. Changing characteristics of an element no longer requires writing the HTML directly for each change but rather just editing the necessary property of the element. Web server controls are similar to controls in Microsoft Visual Basic and now Visual Studio .NET 2003. For example, to change the name of a button, developers need only to open the item's property box and type the new name. Web server controls, however, are available only through an ASP.NET page. The compromise for this simplicity is greater involvement by the server, because code is generated each time the page is rendered. The advantage is that the server work is transparent to the user.

ASP.NET Speech Controls are a special form of Web server controls. Speech Controls add speech capability to existing Web server controls. The basic behavior of the Web server control is unchanged and unaffected but new properties are added that enable speech. Using Web server controls and Speech Controls offers several advantages. First, the application developer can use graphical interface tools such as the Visual Studio .NET 2003 Integrated Development Environment (IDE). Pages can be designed graphically with the associated HTML that is generated by Visual Studio .NET 2003. Second, and more importantly, the resulting page is an ASP.NET page. The full power of the ASP.NET server is available for the page. This means being able to use a standard control that has been speech enabled. It also means any browser can be used to access the page. The ASP.NET server generates the correct HTML for the specific browser accessing the page. Users and customers do not need to worry if they have the correct browser version. Developers also benefit because they only need to design a single page rather than different pages for different browser capabilities.

SALT: Speech Application Language Tags

The approach that Microsoft takes to speech-enable the Web is built around an emerging standard: Speech Application Language Tags (SALT). The SALT Forum (http://www.saltforum.org) has produced the SALT version 1.0 specification and contributed it to the standards body known as the World Wide Web Consortium (W3C). The speech markup used in the Microsoft Speech Application SDK is the implementation by Microsoft of the SALT 1.0 specification. These tags extend HTML and XHTML with a small number of elements and objects that add speech recognition input, audio and text-to-speech playback, and dual tone multi-frequency (DTMF) input to a Web application. (This first version does not implement all parts of the SALT 1.0 specification, and it implements some parts slightly differently.)

The Microsoft SASDK allows developers to build SALT applications using ASP.NET server-side controls. Using these controls means that the developer does not need to know the details of SALT to build a simple speech-enabled Web application. However, most developers will find it helpful to understand some background to SALT markup.

The following sections describe an architecture for implementing SALT applications, demonstrate how speech-enabled Web applications are built, and then provide an overview of the proposed tags.

SALT Architecture

There are four possible components in implementing a speech-enabled Web application using SALT.

A Web server. The Web server generates Web pages containing HTML, SALT, and embedded script. The script controls the dialogue flow for voice-only interactions. For example, the script defines the order for playing audio prompts to a caller, assuming there are several prompts on a page.
A telephony server. Telephony Application Services connects to the telephone network. The server incorporates a voice-only SALT interpreter to interpret the HTML, SALT markup, and script. The browser can run in a separate process or thread for each caller. Of course, the voice-only SALT interpreter interprets only a subset of HTML because much HTML refers to GUI and is not relevant to a voice-only SALT interpreter.
A speech server. Speech Engine Services recognize speech, and play audio prompts and responses back to the user.
The client device. Clients include, for example, a Windows Mobile-based Pocket PC or desktop PC running a version of Microsoft Internet Explorer that is capable of interpreting HTML and SALT.

The following diagram summarizes these elements and their interactions.

What Is SALT?

SALT is an extension to HTML that enables developers to add a spoken dialogue interface to Web applications. Using SALT, speech applications can be written for two types of browsers. The first is a voice-only browser in which speech is the only form of communication for the user. This browser is commonly connected to one or more telephone lines. The second type is the multimodal browser. Multimodal browsers are devices that use a graphic user interface (GUI) in combination with speech commands. Multimodal browsers can run on devices such as Pocket PCs, mobile phones, portable computers, and desktop computers.

SALT is a small set of Extensible Markup Language (XML) elements that apply a speech interface to a document using HTML. Web application developers can use SALT effectively with HTML, XHTML, cHTML, wireless markup language (WML), or pages derived from any other standard generalized markup language (SGML). SALT markup also provides DTMF for telephony browsers running voice-only applications.

There are four main top-level elements of the Microsoft SALT markup.

Tag Name	Description
prompt	Configures the text-to-speech engine and plays speech output.
listen	Configures the speech recognizer. executes recognition, and handles recognition events.
dtmf	Configures and controls DTMF in telephony applications.
smex	Conducts general purpose communications between Speech Platform components.

In addition, there are several other elements that are child components of the four top-level elements. These components are the grammar element, the content element, the param element, the record element, and the value element. In addition, Microsoft provides a proprietary extension to SALT that creates an audiometer element, which provides a visual cue that recognition is taking place.

Why use SALT?

Any Web developer wanting to speech-enable an application can use SALT. SALT markup is a great solution for adding speech because it can leverage the scripting and event model inherent in HTML to implement the interactive flow with the user. These are some of the benefits of using SALT markup:

Reuse of application logic. Because the speech interface is a thin markup layer, which applies a purely presentational logic, the code used for the business logic of the application can be reused across different modalities and devices.
Rapid development. The developer needs to learn very little extra, so mastering SALT is a rapid process. Developers can use existing Web development tools easily for the development of SALT applications.
Speech + GUI. The simple addition of speech capabilities to the visual page provides a way of instantly creating new multimodal (speech + GUI) applications or using existing visual applications.

How To Use SALT

Web application developers can use the Microsoft implementation of SALT in two different modes, according to the capabilities of the target browser. These modes are declarative or object.

Declarative mode is for browsers that do not support scripting or event capabilities. This mode employs exclusively declarative syntax.
Object mode is for browsers that support scripting and events. Any recent version of Microsoft Internet Explorer is an example of this mode. Developers use SALT markup with client-side event handling and script functions for a finer level of control over the speech interactions. Each SALT element comes with a set of Document Object Model (DOM) properties, events and methods for use in this mode.

The following two scenarios outline the use of SALT with some very simple code samples. For a more extensive description of the elements used in these examples, see the reference documentation supplied with the SASDK.

Multimodal Speech and GUI

For multimodal applications, developers can add SALT markup to a visual page resulting in speech support for input and output. This ability speech-enables individual HTML controls for tap-and-talk capabilities for completing forms. Users can tap the Name control and speak their name; the control then displays the name. Developers can add more complex mixed initiative capabilities. For example, a user can complete several forms with one utterance. In this scenario, a user can say, I live at 123 Main Street in Springfield. This command can place the correct information in the address and city controls simultaneously.

In the following example, the application needs to obtain a city name from the user and store its value in an input element on the page. A listen element, here named recoCity, is added to an HTML page containing the input element, named txtBoxCity. The recoCity listen element holds a grammar element that refers to the list of possible cities from which the user can choose, and a bind element to copy the value that is received from recognition into the txtBoxCity element. The actual recognition starts with a browser event—a click in a textbox, in this example—that activates the input audio with the relevant grammar element.

<html xmlns:salt="http://www.saltforum.org/2002/SALT">
  <head>
    <object id="SpeechTags" CLASSID="clsid:DCF68E5B-84A1-4047-98A4-0A72276D19CC" VIEWASTEXT></object>
  </head>
  <body
    <?import namespace="salt" implementation="#SpeechTags" />
    <input name="txtBoxCity" type="text" onclick=recoCity.Start()/>
 
    <!- Speech Application Language Tags --> 
    <salt:listen id=recoCity>
      <salt:grammar src="city.grxml" />
      <salt:bind targetelement="txtBoxCity" value="//city" />
    </salt:listen>
  </body>
</html>

The SALT elements are not intended to have a default visual representation on the browser. For multimodal applications, SALT authors can designate which controls are speech-enabled, perhaps with a graphic.

Telephony

For applications without a visual display, the application drives interactions with the user by prompting for required information. The HTML scripting and event model performs this function. Using scripting and the event model, the full programmatic control of client-side (or server-side) code is available to application developers for the management of prompt playing, grammar activation and processing of recognition results.

The RunAsk() function activates prompts and recognitions until the values of the input fields are filled. In the following example, the system needs two input values and asks for each value until it obtains both. The binding of the recognition results into the relevant input fields is accomplished programmatically, by the script functions procOriginCity() and procDestCity() triggered by onreco events of the relevant listen elements. The following code is an example of how a simple system-initiative dialogue (a sequence of specific questions or prompts) guides the user.

<html xmlns:salt="http://www.saltforum.org/2002/SALT">
  <head>
    <object id="SpeechTags" CLASSID="clsid:DCF68E5B-84A1-4047-98A4-0A72276D19CC" VIEWASTEXT></object>
  </head>
  <body onload=RunAsk();>
    <?import namespace="salt" implementation="#SpeechTags" />    
    <input name="txtBoxOriginCity" type="text" />
    <input name="txtBoxDestCity" type="text" />
 
    <!- Speech Application Language Tags --> 
    <salt:prompt id=askOriginCity> Where from? </ salt:prompt>
    < salt:prompt id=askDestCity> Where to? </ salt:prompt>
 
    < salt:listen id=recoOriginCity onreco=procOriginCity()>
      < salt:grammar src="city.grxml" />
    </ salt:listen>
 
    < salt:listen id=recoDestCity onreco= procDestCity()>
      < salt:grammar src="city.grxml" />
    </ salt:listen>
  
    <script language="JScript">
    <!--
      function RunAsk() {
        if (txtboxOriginCity.value==) {
          askOriginCity.Start(); 
          recoOriginCity.Start();
        } else if (txtboxDestCity.value==) {
            askDestCity.Start(); 
            recoDestCity.Start();
        } 
      }
      function procOriginCity() {
        txtBoxOriginCity.value = recoOriginCity.text;
        RunAsk();
      }
      function procDestCity() {
        txtBoxDestCity.value = recoDestCity.text;
      }
    -->
    </script>
  </body>
</html>

Other event handlers are available in the listen and prompt elements to manage false recognitions, user silences, and other situations requiring some form of recovery. For telephony dialogues, there is also a messaging interface for managing telephony call control.