Building VoiceXML Applications

Building VoiceXML Applications

Now that you have a solid understanding of the VoiceXML architecture, it is time to look more closely at the VoiceXML language itself. If you have experience with other XML-based markup languages, such as WML or XHTML, or even other tag-based languages, such as HTML or HDML, creating VoiceXML applications will not pose much difficulty for you. The main adjustment will be to the development of the user interface. Rather than sending content to a screen, it is spoken over a telephone. For this reason, it is very important to create clean, intuitive applications so users do not become frustrated and hang up. VoiceXML includes elements to help with this, as we will discuss later in this section.


We are not going to go into depth on every aspect of the VoiceXML language. This type of information can be readily found at many of the sites outlined in the Helpful Links section at the end of this chapter. Instead, we will look at the general language concepts of VoiceXML, then examine code in an example application.

Language Concepts

Before examining some VoiceXML code, let's take a quick look at the basic concepts behind a VoiceXML application.


Once a user connects to the VoiceXML gateway, a session is started. This session is maintained as new VoiceXML documents are loaded and unloaded. The session ends only when requested by the user, the VoiceXML document, or the voice gateway. Each platform will have default session characteristics, many of which can be controlled by VoiceXML logic.


VoiceXML applications are constructed of one or many dialogs. Each dialog represents some form of conversational state with the user. After completing one dialog, you move on to another dialog, until the application is complete. There are two types of dialogs: forms and menus. A form collects user input, and a menu gives the user options to choose from. If at any point no dialog is specified, the VoiceXML application terminates automatically.

VoiceXML also has subdialogs. These are very similar to function calls, allowing the application to call out to a new dialog, then return to the original form. All of the variables, grammar, and state information is available upon returning to the calling document. Subdialogs can be used to create a set of components that may be used from several applications.


A VoiceXML application is a set of documents that share the same root document. This root document is automatically loaded any time a user interacts with any document in the application, and remains loaded until the user transitions to a document outside of the application. While it is loaded, the root document's variables are available to the other documents. It is also possible to specify grammars to be active for the duration of the application.


Grammars make it possible to specify valid inputs from the user. Each dialog will contain at least one speech and/or DTMF grammar. In simple applications, only the dialog's grammars are active for that dialog. In the more complex, mixed-initiative applications, it is possible to have active grammars outside of the dialog being executed. Mixed-initiative refers to applications in which both the user and the gateway determine what will happen next.

Grammar creation is a very important aspect of designing intuitive, robust VoiceXML applications. It is essential to create grammars that accurately reflect the typical speech inputs from a user. If this is not achieved, either the prompts or the grammar will have to be changed. In some cases, the built-in grammars may be sufficient. VoiceXML 2.0 has the following built-in grammars: boolean, date, digits, currency, number, and time. In addition, many of the voice gateway vendors have proprietary grammars that you can use.


When the normal execution of an application is interrupted, an event is thrown. In most cases, events are used when the user fails to respond to a prompt, or when the response is not suitable. They are also used when a user requests help or wants to exit the application. When an event is triggered, the <catch> element allows you to specify what the reaction should be. If there is no handler at the dialog level, the event can be caught at a higher level, since events follow an inheritance model.


Links enable you to create mixed-initiative applications. They specify a grammar that is active when the user is within scope of the link. When the input matches the grammar, the user is redirected to the specified destination URI.


If you require additional control over an application, which is not provided by standard VoiceXML elements, you can use scripting in the form of ECMAScript. This allows you to do such things as collect values of several fields in a single response.

Application Example

Listing 15.1 contains a relatively straightforward VoiceXML application. It demonstrates several of the concepts discussed in the previous section. This application allows a user to input data to a time-tracking system. Consultants who need to keep track of their hours as they work on customer projects would use such an application.

Listing 15.1: Sample VoiceXML code for a time entry system.
Start example
1.  <?xml version="1.0"?>
2.  <vxml version="2.0">
3.  <meta name=" author" content=" Martyn Mallick"/>
4.  <form>
5.    <block>
6.      Welcome to the voice time entry system.
7.      <goto next="#options"/>
8.    </block>
9.  </form>
10. <!-- allow user to choose one of three options -->
11. <menu id=" options" dtmf=" true">
12.   <prompt> What would you like to do? Say one of:
13.   <choice next="#entry">add entry</choice>
14.   <choice next="">
delete entry</choice>
15.   <choice next="">
list entries </choice>
16.   <noinput count="1"> <reprompt/></noinput>
17.   <noinput count="2"> Please state what action you would like
18. </menu>
19. <!-- collect data for new time entry -->
20. <form id=" entry">

21.   <catch event=" nomatch noinput" count="3">
22.     <prompt> Sorry, too many attempts. Please try again later.
Goodbye.  </prompt>
23.     <throw event=" telephone.disconnect.hangup"/>
24.   </catch>
25.   <field name=" jobtype">
26.     <prompt>What is the job type for your entry? </prompt>
27.       <option>design</option>
28.       <option>development</option>
29.       <option>meeting</option>
30.       <option>travel</option>
31.       <option>vacation</option>
32.     <help>You must enter a valid job code to continue. Your options
are design, development, meeting, travel, and vacation.
33.   </field>
34.   <field name=" hours" type="digits">
35.     <prompt> How many hours for job <value expr="jobtype"/>?
36.     <help> use the keypad to enter the number of hours worked
37.   </field>
38.   <field name=" proceed" type=" boolean">
39.   <prompt>Do you want to proceed with the entry for <value
expr="hours"/> hours for job type <value expr="jobtype"/>?  </prompt>
40.     <filled>
41.       <if cond=" proceed">
42.         <prompt bargein=" false">
43.            Your entry is being entered into the time system.
44.         </prompt>
45.         <!-- submit time entry to servlet for entry into database --
46.         <submit next="/servlet/entry" namelist=" jobcode hours"/>
47.       </if>
48.       <clear namelist=" jobcode hours proceed"/>
49.       <goto next="#options"/>
50.     </filled>
51.   </field>
52. </form>
53. </vxml>
End example

This code is for demonstration purposes only; it is by no means a complete application.

The first line of any document is the XML version number:

<?xml version="1.0"?>

This line is followed by the opening tag of the VoiceXML document, <vxml>, which includes the VoiceXML version number. The current version of VoiceXML is 2.0. The rest of the document is enclosed between the <vxml> and </vxml> tags. In our application, we have several form dialogs and one menu dialog. The first dialog in our application (lines 4 to 9) simply outputs a welcome message to the user, then transfers the application to dialog with the id="options". The greeting is contained within a <block> element. In this case, the text is spoken using text-to-speech, although it is also possible to specify an audio file to be executed.

On line 10 is a comment. As with any application, properly commenting your code will make it much easier to understand for other developers. On line 11 starts the second type of dialog, a menu. The menu gives the user a choice of actions, which are listed on lines 13, 14, and 15. Each choice has a next attribute that specifies the location of the dialog or document that will be executed if that choice is selected. Also between the <choice> and </choice> tags is the text to which the user response will be compared. For example, if the user says "add entry," the entry form within this document will be executed. If the user says "delete entry," the delete.vxml document at the specified URI will be executed. One tag that is helpful is the <enumerate/> tag. Here it is contained within the <prompt> element on line 12. When it is reached, each of the options within the menu will be spoken to the user.

On lines 16 and 17 are <noinput> tags. These are executed if the user does not enter any input when prompted to in the menu. The first time this happens, the choice selection will be repeated, as specified on line 16 using the <reprompt/> tag. The second time there is no input, another prompt will be spoken to the user. Being able to change the prompts can prove helpful for rewording the request in case the user was unclear what was expected.

Lines 19 to 52 contain the entry form. If "add entry" was selected in the menu, this is the location that is executed. The <form> element contains an id with the name entry. This is used to move between forms. On lines 21 to 24 is a <catch> element that will be executed if a nomatch or noinput occurs three consecutive times. If executed, the <prompt> on line 22 will be spoken, followed by line 23 containing a throw of the predefined event that will hang up the telephone, thereby ending the session. Since events use an inheritance model, the <catch> element can be executed for any field within this form.

On lines 25 to 33 is the first <field></field> element. The purpose of these tags is to obtain user input. The name specified on line 25 is the variable that will contain the selected option. So, for this example, the variable jobtype will contain design, development, meeting, travel, or vacation, as specified in the <option> tags on lines 27 to 31. In this example, rather than defining a grammar with five entries, we have used the <option> tag. It is a suitable replacement for a grammar in cases where there are only a few choices. When there is a larger set of choices, a grammar may be more suitable. When using a grammar, you can choose to use an inline grammar or point to an external file that contains the grammar. Again, the decision often comes down to the size of the grammar itself.

Another tag that we see for the first time is the <help> tag. The contents within the <help> and </help> tags are executed at any time during the field when the user says the word "help." Using these tags is an abbreviation for the tag <catch event="help">.

Since the jobtype field did not specify an action, the next element in the dialog is executed. In this case, that is the <field> on lines 34 to 37. In addition to setting the name attribute, we also specify that we want to use the built-in grammar of type="digits". This allows the user to enter any sequence of digits, either using the keypad or speaking in response to the prompt. The <prompt> on line 35 asks the user to enter the number of hours for the jobtype that was specified in the previous field. We are able to access the jobtype variable using the <value expr="jobtype"/> tag.

The final field on this form asks the user to confirm his or her entry before it is sent to a servlet for input into a database. On line 38 the <field> again uses a built-in grammar, this time of type="boolean". This specifies that the input can either be yes or no (or a similar variant such as yeah or nah). The variable used to store the user input is named proceed.

We again use the <value> tag to access variable data in the <prompt>, allowing the user to confirm whether the speech recognition engine heard the inputs correctly. Unlike the previous fields, on line 40 we now use the <filled> tag to define the actions to take once the data is correctly entered. The <filled> element is commonly used to specify an action to perform when some combination of fields are filled by user input. In our case, it is when the user inputs either yes or no in response to the prompt.

If the response was yes, proceed is set to true, making the <if> statement from lines 41 to 47 execute. When the time entry is confirmed, a <prompt> tells the user that his or her entry is being entered to the time system. We specified the bargein="false" attribute in the <prompt> element. This ensures that the entire <prompt> is read before the user can interrupt it. This is useful when it is important that the user hears certain information, such as a licensing agreement or confirmation number. On line 46, the data entries are sent to a servlet for processing.

If the response to the <prompt> was no, then the <if> condition would not be met, so the next line of code executed would be line 48. On this line we clear all of the variables before going back to the options menu using the <goto> tag on line 49. From lines 50 to 53 the open tags are closed, ending with the </vxml> tag to complete our VoiceXML document.


The best resource for detailed information on VoiceXML 2.0 is the VoiceXML 2.0 specification located at

Now that we have examined the VoiceXML code, let's "listen" to a typical conversation using the code in Listing 15.1. In the following dialog, "Gateway" represents the output from the voice gateway and "User" represents the input by the user of the application.

  • Gateway: Welcome to the voice-time entry system.

  • Gateway: What would you like to do? Say one of "add entry, delete entry, list entries."

  • User: (no input)

  • Gateway: What would you like to do? Say one of "add entry, delete entry, list entries."

  • User: (no input)

  • Gateway: Please state what action you would like: add entry, delete entry, list entries.

  • User: Add entry (or presses 1 on keypad).

  • Gateway: What is the job type for your entry?

  • User: Help.

  • Gateway: You must enter a valid job type to continue. Your options are design, development, meeting, travel, and vacation.

  • Gateway: What is the job type for your entry?

  • User: Sleep.

  • Gateway: I'm sorry, I didn't get that. (this is a platform-specific response)

  • Gateway: What is the job type for your entry?

  • User: Design.

  • Gateway: How many hours for job design?

  • User: One seven (or enters 17 on keypad).

  • Gateway: Do you want to proceed with the entry for 17 hours for job type design?

  • User: Yes.

  • Gateway: Your entry is being entered into the time system.

  • Gateway: What would you like to do? Say one of "add entry, delete entry, list entry."

  • User: Goodbye (on most voice platforms exits the application).