17.1 PC Audio Types

Sound cards support two categories of audio, which are detailed in the following sections:

Waveform audio: Waveform audio files, also called simply sound files, store actual audio data. When you record waveform audio, the sound card encodes the analog audio data in digital format and stores it as a file. When you play waveform audio, the sound card reads the digital audio data contained in the file and converts it to analog audio, which is then reproduced on speakers or headphones. Waveform audio files can store any type of audio, including speech, singing, instrumental music, and sound effects. The playback quality of waveform audio depends primarily on how much detail was captured in the original recording and how much of that data, if any, was lost when compressing the data before storing it on disk. Uncompressed waveform audio files (such as .WAV files) are large, requiring as much as 10 MB per minute of audio stored. Compressed audio files may be 1/20 that size or smaller, although high compression generally results in lower sound quality.
MIDI audio: Rather than storing actual audio data, Musical Instrument Digital Interface (MIDI) files store instructions that a sound card can use to create audio on the fly. MIDI audio files store only instrumental music and sound effects, not speech or singing. Originally used almost solely by professional musicians, MIDI is now commonly used by games and other applications for background music and sound effects, so MIDI support has become an important sound card issue. Because MIDI sound is created synthetically by the sound card, playback quality of MIDI files depends both on the quality of the MIDI file itself and on the features and quality of the MIDI support in the sound card. A MIDI file of an orchestral concert, for example, may sound like a child's toy when played by a cheap sound card, but may closely resemble a CD recording when played by a high-end sound card. MIDI audio files are small, requiring only a few KB per minute of audio stored.

17.1.1 Waveform Audio

Waveform audio files are created using a process called sampling or digitizing to convert analog sound to digital format. Sampling takes periodic snapshots, or samples, of the instantaneous state of the analog signal, encodes the data, and stores the audio in digital form. Just as digital images can be stored at different resolutions according to their intended use, audio data can be stored at different resolutions to trade off sound quality against file size. Five parameters determine the quality of digital sound files and how much space they occupy:

Sample size

Sample size specifies how much data is stored for each sample. A larger sample size stores more information about each sample, contributing to higher sound quality. Sample size is specified as the number of bits stored for each sample. CD audio, for example, uses 16-bit samples, which allow the waveform amplitude to be specified as one of 65,536 discrete values. All sound cards support at least 16-bit samples.

Sampling rate

Sampling rate specifies how often samples are taken. Sampling rate is specified in Hz (Hertz, or cycles/second) or kHz (kilohertz, 1000 Hertz). Higher-frequency data inherently changes more often. Changes that occur between samples are lost, so the sampling rate determines the highest-frequency sounds that can be sampled. Two samples are required to capture a change, so the highest frequency that can be sampled, called the Nyquist frequency, is half the sampling rate. For example, a 10,000 Hz sampling rate captures sounds no higher than 5,000 Hz. In practice, the danger is that higher frequencies will be improperly sampled, leading to distortion, so real-world implementations filter the analog signal to cut off audio frequencies higher than some arbitrary fraction of the Nyquist frequency?for example, by filtering all frequencies higher than 4,500 Hz when using a 10,000 Hz sampling rate. CD audio, for example, uses a 44,100 Hz sampling rate, which provides a Nyquist frequency of 22,050 Hz, allowing full-bandwidth response up to ~ 20,000 Hz after filtering. All sound cards support at least 44,100 Hz sampling, and many support the Digital Audio Tape (DAT) standard of 48,000 Hz.

Sampling method

Sampling method specifies how samples are taken and encoded. For example, Windows WAV files use either Pulse Coded Modulation (PCM), a linear method that encodes the absolute value of each sample as an 8-bit or 16-bit value, or Adaptive Delta PCM (ADPCM), which encodes 4-bit samples based on the differences (delta) between one sample and the preceding sample. ADPCM generates smaller files, but at the expense of reduced audio quality and the increased processor overhead needed to encode and decode the data.

Recording format

Recording format specifies how data is structured and encoded within the file and what means of compression, if any, is used. Common formats, indicated by filename extensions, include WAV (Windows audio); AU (Sun audio format, commonly used by Unix systems and on the Internet); AIFF or AIF (Audio Interchange File Format, used by Apple and SGI); RA (RealAudio, a proprietary streaming audio format); MP3 (MPEG-1 Layer 3); and OGG (Ogg Vorbis). Some formats use lossless compression, which provides lower compression ratios, but allows all the original data to be recovered. Others use lossy compression, which sacrifices some less-important data in order to produce the smallest possible file sizes. Some, such as PCM WAV, do not compress the data at all.

Compressed formats, such as MP3 and OGG, may use fixed bitrate (FBR) compression (also called constant bitrate [CBR] compression), variable bitrate (VBR) compression, or both (although not at the same time). FBR compresses each second of source material to the same amount of disk space, regardless of the contents of that material. For example, after FBR compression, 10 seconds of silence occupies the same amount of disk space as 10 seconds of complex chamber music. VBR dynamically varies compression ratio according to the complexity of the material being compressed. For example, after VBR compression, 10 seconds of silence may occupy only a few bytes of disk space, while 10 seconds of chamber music may occupy many kilobytes. VBR typically provides better sound quality than FBR for a given file size because VBR uses space more efficiently.

Either compression type may use selectable compression ratios or a fixed ratio. For example, standard MP3 uses FBR compression, but most MP3 compressors allow you to select among various fixed bitrates, typically from 64 kilobits/second (kb/s) to 320 kb/s. FBR compression is exact. If you compress 100 seconds of audio at 256 kb/s, the resulting file always occupies 25,600 kilobits. Conversely, VBR compression is approximate because compression varies with the complexity of the source material. If you use VBR to compress 100 seconds of audio at a nominal 256 kb/s, the resulting file will probably occupy about 25,600 kilobits, but can be larger or smaller depending on how easily the source material could be compressed.

Some VBR applications use an arbitrary number to specify compression ratio. For example, Ogg Vorbis allows you to specify quality on a scale of 0 through 10, where 0-quality is roughly equivalent to 64 kb/s FBR, 5-quality to 160 kb/s FBR, and 10-quality to 400 kb/s FBR.

The large size of uncompressed audio files means that most common waveform audio formats use some form of compression. An algorithm used to compress and decompress digital audio data is called a codec, short for coder/decoder. Windows 98, for example, includes the following codecs: CCITT G.711 A-Law and m-Law; DSP Group TrueSpeech; GSM 6.10; IMA ADPCM; Microsoft ADPCM; and Microsoft PCM (which is technically not a codec). You needn't worry about which codec to use when playing audio; the player application automatically selects the codec appropriate for the file being played. When you record audio, the application you use allows you to select from the codecs supported by the format you choose.

Number of channels

Depending on the recording setup, one channel (monaural or mono sound), two channels (stereo sound), or more can be recorded. Additional channels provide audio separation, which increases the realism of the sound during playback. Various formats store one, two, four, or five audio channels. Some formats store only two channels, but with additional data that can be used to simulate additional channels.

Table 17-1 lists the three standard Windows recording modes for PCM WAV, which is the most common uncompressed waveform audio format, and representative MP3 and OGG modes. MP3 at 256 kb/s uses a little more storage than Windows' AM radio mode, but produces sound files that are nearly CD quality. OGG-3 produces files that average about 17.5% smaller than 128 kb/s MP3 files, but provide superior sound quality. OGG-5 produces files that average about 40% smaller than 256 kb/s MP3 files, but provide comparable sound quality. OGG-10 produces files that average about one-third the size of uncompressed .WAV files, but provide sound quality that to our ears is indistinguishable from the original CD audio, even when played on a high-quality home audio system. MP3 and OGG bitrates are approximate.

Table 17-1. Uncompressed WAV modes versus compressed MP3 and OGG modes
Quality	Sample Size	Rate	Channels	Bytes/min	Compression
Telephone	8-bit	11,025 Hz	1 (mono)	661,500	PCM (1:1)
AM radio	8-bit	22,050 Hz	1 (mono)	1,323,000	PCM (1:1)
CD audio	16-bit	44,100 Hz	2 (stereo)	10,584,000	PCM (1:1)
MP3 (64 kb/s)	16-bit	44,100 Hz	2 (stereo)	~ 500,000	MP3 (~20:1)
MP3 (128 kb/s)	16-bit	44,100 Hz	2 (stereo)	~1,000,000	MP3 (~10:1)
MP3 (256 kb/s)	16-bit	44,100 Hz	2 (stereo)	~2,000,000	MP3 (~5:1)
OGG-0	16-bit	44,100 Hz	2 (stereo)	~ 500,000	OGG (~20:1)
OGG-3	16-bit	44,100 Hz	2 (stereo)	~ 825,000	OGG (~12:1)
OGG-5	16-bit	44,100 Hz	2 (stereo)	~ 1,200,000	OGG (~8:1)
OGG-10	16-bit	44,100 Hz	2 (stereo)	~ 3,000,000	OGG (~3:1)

17.1.2 MIDI Audio

A MIDI file is the digital equivalent of sheet music. Rather than containing actual audio data, a MIDI file contains detailed instructions for creating the sounds represented by that file. And, just as the same sheet music played by different musicians can sound different, the exact sounds produced by a MIDI file depend on which sound card you use to play it.

Three PC MIDI standards exist. The first, General MIDI, is the official standard, actually predates multimedia PCs, and is the oldest and most comprehensive standard. The other two standards are Basic MIDI and Extended MIDI. Both are Microsoft standards and, despite the name of the latter, both are subsets of the General MIDI standard. In the early days of sound cards, General MIDI support was an unrealistically high target, so many sound cards implemented only one of the Microsoft MIDI subsets. All current sound cards we know of support full General MIDI.

MIDI was developed about 20 years ago, originally as a method to provide a standard interface between electronic music keyboards and electronic sound generators such as Moog synthesizers. A MIDI interface supports 16 channels, allowing up to 16 instruments or groups of instruments (selected from a palette of 128 available instruments) to play simultaneously. MIDI interfaces can be stacked. Some MIDI devices support 16 or more interfaces simultaneously, allowing 256 or more channels.

The MIDI specification defines both a serial communication protocol and the formatting of the MIDI messages transferred via that protocol. MIDI transfers 8-bit data at 31,250 bps over a 5 mA current loop, using optoisolators to electrically isolate MIDI devices from each other. All MIDI devices use a standard 5-pin DIN connector, but the MIDI port on a sound card is simply a subset of the pins on the standard DB-15 gameport connector (see Chapter 21). That means a gameport-to-MIDI adapter is needed to connect a sound card to an external MIDI device such as a MIDI keyboard.

MIDI messages are simply a string of ASCII bytes encoded to represent the important characteristics of a musical score, including instrument to be used, note to be played, volume, and so on. MIDI messages usually comprise a status byte followed by one, two, or three data bytes, but a MIDI feature called Running Status allows any number of additional bytes received to be treated as data bytes until a second status byte is received. Here are the functions of those byte types:

Status byte: MIDI messages always begin with a status byte, which identifies the type of message and is flagged as a status byte by having the high-order bit set to 1. The most significant (high-order) four bits (nibble) of this byte define the action to be taken, such as a command to turn a note on or off or to modify the characteristics of a note that is already playing. The least significant nibble defines the channel to which the message is addressed, which in turn determines the instrument to be used to play the note. Although represented in binary as a 4-bit value between 0 and 15, channels are actually designated 1 through 16.
Data byte: A data byte is flagged as such by having its high-order bit set to zero, which limits it to communicating 128 states. What those states represent depends on the command type of the status byte. When it follows a Note On command, for example, the first data byte defines the pitch of the note. Assuming standard Western tuning (A=440 Hz), this byte can assume any of 128 values from C-sharp/D-flat (17.32 Hz) to G (25087.69 Hz). The second data byte specifies velocity, or how hard the key was pressed, which corresponds generally to volume, depending on the MIDI device and instrument. The note continues playing until a status byte with a Note Off command for that note is received, although it may, under programmatic control, decay to inaudibility in the interim.