Speech Application Language Tags (SALT)

 

Microsoft Corporation

July 2003

Applies to:

   Microsoft® Speech Application SDK

Summary: This article provides a brief overview of the SALT (Speech Application Language Tags) standard. This lightweight set of extensions to existing markup languages plays an important role in the Microsoft Speech Technologies platform, including the Microsoft Speech Application SDK and the Microsoft Speech Server. (3 printed pages)

Introduction

SALT (Speech Application Language Tags) is an extension of HTML and other markup languages (cHTML, XHTML, WML) that adds a powerful speech interface to Web pages, while maintaining and leveraging all the advantages of the Web application model. These tags are designed to be used for both voice-only browsers (for example, a browser accessed over the telephone) and multimodal browsers.

Multimodal access will enable users to interact with an application in a variety of ways: they will be able to input data using speech, a keyboard, keypad, mouse, and/or stylus, and produce data as synthesized speech, audio, plain text, motion video, and/or graphics. Each of these modes will be able to be used independently or concurrently.

The full specification for SALT is currently being developed by the Salt Forum, an open industry initiative committed to developing a royalty-free, platform-independent standard that will make possible multimodal and telephony-enabled access to information, applications, and Web services from PCs, telephones, tablet PCs, and wireless personal digital assistants (PDAs).

For full details on SALT, please visit the Salt Forum site and download the current version of the SALT specification.

What SALT Is

The following is an excerpt from the SALT 1.0 specification, a working draft. To get a full copy of the draft, please visit the Salt Forum site.

SALT (Speech Application Language Tags) is a small set of XML elements, with associated attributes and DOM object properties, events, and methods, which may be used in conjunction with a source markup document to apply a speech interface to the source page. The SALT formalism and semantics are independent of the nature of the source document, so SALT can be used equally effectively within HTML and all its flavors, or with WML, or with any other SGML-derived markup.

Table 1: The top-level elements of SALT

<prompt ...> 
for speech synthesis configuration and prompt playing
<listen ...> 
for speech recognizer configuration, recognition execution and post-processing, and recording
<dtmf ...> 
for configuration and control of DTMF collection
<smex ...> 
for general-purpose communication with platform components

The input elements <listen> and <dtmf> also contain grammars and binding controls,

Table 2: Grammars and binding controls contained within <listen> and <dtmf>

<grammar ...> 
for specifying input grammar resources
<bind ...> for processing of recognition results

and <listen> also contains the facility to record audio input.

Table 3: Recording element contained within <listen>

<record ...>
for recording audio input

A call control object is also provided for control of telephony functionality.

There are several advantages to using SALT with a mature display language such as HTML. Most notably:

  1. The event and scripting models supported by visual browsers can be used by SALT applications to implement dialog flow and other forms of interaction processing without the need for extra markup.
  2. The addition of speech capabilities to the visual page provides a simple and intuitive means of creating multimodal applications.

In this way, SALT is a lightweight specification which adds a powerful speech interface to Web pages, while maintaining and leveraging all the advantages of the Web application model.

SALT also provides DTMF and call control capabilities for telephony browsers running voice-only applications through a set of DOM objects properties, methods, and events.

How SALT Works

Two major scenarios for the use of SALT are outlined below. These two scenarios are excerpted from the SALT 1.0 specification, a working draft. To get a full copy of the draft, please visit the SALT Forum site.

Multimodal

For multimodal applications, SALT can be added to a visual page to support speech input and/or output. This is a way to speech-enable individual controls for "push-to-talk" form-filling scenarios, or to add more complex mixed initiative capabilities if necessary.

A SALT recognition may be started by a browser event such as pen-down on a textbox, for example, which activates a grammar relevant to the textbox, and binds the recognition result in the textbox.

Telephony

For applications without a visual display, SALT manages the interactional flow of the dialog and the extent of user initiative by using the HTML eventing and scripting model. In this way, the full programmatic control of client-side (or server-side) code is available to application authors for the management of prompt playing and grammar activation.

For more information

For full details, please visit the Salt Forum site and download the current version of the SALT specification.