Section 9.1. Multimedia Concepts

This section very quickly covers some concepts relevant to digital audio , video , and sound cards . Understanding these basics will help you follow the rest of the material in this chapter.

9.1.1. Digital Sampling

Sound is produced when waves of varying pressure travel though a medium, usually air. It is inherently an analog phenomenon, meaning that the changes in air pressure can vary continuously over a range of values.

Modern computers are digital, meaning they operate on discrete values, essentially the binary ones and zeroes that are manipulated by the central processing unit (CPU). In order for a computer to manipulate sound, then, it needs to convert the analog sound information into digital format.

A hardware device called an analog-to-digital converter converts analog signals, such as the continuously varying electrical signals from a microphone, to digital format that can be manipulated by a computer. Similarly, a digital-to-analog converter converts digital values into analog form so they can be sent to an analog output device such as a speaker. Sound cards typically contain several analog-to-digital and digital-to-analog converters .

The process of converting analog signals to digital form consists of taking measurements, or samples, of the values at regular periods of time, and storing these samples as numbers. The process of analog-to-digital conversion is not perfect, however, and introduces some loss or distortion. Two important factors that affect how accurately the analog signal is represented in digital form are the sample size and sampling rate.

The sample size is the range of values of numbers that is used to represent the digital samples, usually expressed in bits. For example, an 8-bit sample converts the analog sound values into one of 2⁸, or 256, discrete values. A 16-bit sample size represents the sound using 2¹⁶, or 65,536, different values. A larger sample size allows the sound to be represented more accurately, reducing the sampling error that occurs when the analog signal is represented as discrete values. The trade-off with using a larger sample size is that the samples require more storage (and the hardware is typically more complex and therefore expensive).

The sample rate is the speed at which the analog signals are periodically measured over time. It is properly expressed as samples per second, although sometimes informally but less accurately expressed in Hertz (Hz) . A lower sample rate will lose more information about the original analog signal, a higher sample rate will more accurately represent it. The sampling theorem states that to accurately represent an analog signal it must be sampled at at least twice the rate of the highest frequency present in the original signal.

The range of human hearing is from approximately 20 to 20,000 Hz under ideal situations. To accurately represent sound for human listening, then, a sample rate of twice 20,000 Hz should be adequate. CD player technology uses 44,100 samples per second, which is in agreement with this simple calculation. Human speech has little information above 4000 Hz. Digital telephone systems typically use a sample rate of 8000 samples per second, which is perfectly adequate for conveying speech. The trade-off involved with using different sample rates is the additional storage requirement and more complex hardware needed as the sample rate increases.

Other issues that arise when storing sound in digital format are the number of channels and the encoding format. To support stereo sound, two channels are required. Some audio systems use four or more channels.

Often sounds need to be combined or changed in volume. This is the process of mixing, and can be done in analog form (e.g., a volume control) or in digital form by the computer. Conceptually, two digital samples can be mixed together simply by adding them, and volume can be changed by multiplying by a constant value.

Up to now we've discussed storing audio as digital samples. Other techniques are also commonly used. FM synthesis is an older technique that produces sound using hardware that manipulates different waveforms such as sine and triangle waves. The hardware to do this is quite simple and was popular with the first generation of computer sound cards for generating music. Many sound cards still support FM synthesis for backward compatibility. Some newer cards use a technique called wavetable synthesis that improves on FM synthesis by generating the sounds using digital samples stored in the sound card itself.

MIDI stands for Musical Instrument Digital Interface. It is a standard protocol for allowing electronic musical instruments to communicate. Typical MIDI devices are music keyboards, synthesizers, and drum machines. MIDI works with events representing such things as a key on a music keyboard being pressed, rather than storing actual sound samples. MIDI events can be stored in a MIDI file, providing a way to represent a song in a very compact format. MIDI is most popular with professional musicians, although many consumer sound cards support the MIDI bus interface.

9.1.2. File Formats

We've talked about sound samples, which typically come from a sound card and are stored in a computer's memory. To store them permanently, they need to be represented as files. There are various methods for doing this.

The most straightforward method is to store the samples directly as bytes in a file, often referred to as raw sound files. The samples themselves can be encoded in different formats. We've already mentioned sample size, with 8-bit and 16-bit samples being the most common. For a given sample size, they might be encoded using signed or unsigned representation. When the storage takes more than 1 byte, the ordering convention must be specified. These issues are important when transferring digital audio between programs or computers, to ensure they agree on a common format.

A problem with raw sound files is that the file itself does not indicate the sample size, sampling rate, or data representation. To interpret the file correctly, this information needs to be known. Self-describing formats such as WAV add additional information to the file in the form of a header to indicate this information so that applications can determine how to interpret the data from the file itself. These formats standardize how to represent sound information in a way that can be transferred between different computers and operating systems.

Storing the sound samples in the file has the advantage of making the sound data easy to work with, but has the disadvantage that it can quickly become quite large. We earlier mentioned CD audio which uses a 16-bit sample size and a 44,100 sample per second rate, with two channels (stereo). One hour of this Compact Disc Digital Audio (CDDA ) data represents more than 600 megabytes of data. To make the storage of sound more manageable, various schemes for compressing audio have been devised. One approach is to simply compress the data using the same compression algorithms used for computer data. However, by taking into account the characteristics of human hearing, it possible to compress audio more efficiently by removing components of the sound that are not audible. This is called lossy compression, because information is lost during the compression process, but when properly implemented there can be a major reduction of data size with little noticeable loss in audio quality. This is the approach that is used with MPEG-1 level 3 audio (MP3), which can achieve compression levels of 10:1 over the original digital audio. Another lossy compression algorithm that achieves similar results is Ogg Vorbis, which is popular with many Linux users because it avoids patent issues with MP3 encoding. Other compression algorithms are optimized for human speech, such as the GSM encoding used by some digital telephone systems. The algorithms used for encoding and decoding audio are sometimes referred to as codecs . Some codecs are based on open standards, such as Ogg and MP3, which can be implemented according to a published specification. Other codes are proprietary, with the format a trade secret held by the developer and people who license the technology. Examples of proprietary codecs are Real Networks' RealAudio, Microsoft's WMA, and Apple's QuickTime.

We've focused mainly on audio up to now. Briefly turning to video, the storing of image data has much in common with sound files. In the case of images, the samples are pixels (picture elements), which represent color using samples of a specific bit depth. Large bit depths can more accurately represent the shades of color at the expense of more storage requirement. Common image bit depths are 8, 16, 24, and 32 bits. A bitmap file simply stores the image pixels in some predefined format. As with audio, there are raw image formats and self-describing formats that contain additional information that allows the file format to be determined.

Compression of image files uses various techniques. Standard compression schemes such as zip and gzip can be used. Run-length encoding, which describes sequences of pixels having the same color, is a good choice for images that contain areas having the same color, such as line drawings. As with audio, there are lossy compression schemes, such as JPEG compression, which is optimized for photographic-type images and designed to provide high compression with little noticeable effect on the image.

To extend still images to video, one can imagine simply stringing together many images arranged in time sequence. Clearly, this quickly generates extremely large files. Compression schemes such as that used for DVD movies use sophisticated algorithms that store some complete images, as well as a mathematical representation of the differences between adjacent frames that allows the images to be re-created. These are lossy encoding algorithms. In addition to the video, a movie also contains one or more sound tracks and other information, such as captioning.

We mentioned Compact Disc Digital Audio, which stores about 600 MB of sound samples on a disc. The ubiquitous CD-ROM uses the same physical format to store computer data, using a filesystem known as the ISO 9660 format. This is a simple directory structure, similar to MS-DOS. The Rock Ridge extensions to ISO 9660 were developed to allow storing of longer filenames and more attributes, making the format suitable for Unix-compatible systems. Microsoft's Joliet filesystem performs a similar function and is used on various flavors of Windows. A CD-ROM can be formatted with both the Rock Ridge and Joliet extensions, making it readable on both Unix-compatible and Windows-compatible systems.

CD-ROMs are produced in a manufacturing facility using expensive equipment. CD-R (compact disc recordable) allows recording of data on a disc using an inexpensive drive, which can be read on a standard CD-ROM drive. CD-RW (compact disc rewritable) extends this with a disc that can be blanked (erased) many times and rewritten with new data.

DVD-ROM drives allow storing of about 4.7 GB of data on the same physical format used for DVD movies. With suitable decoding hardware or software, a PC with a DVD-ROM drive can also view DVD movies. Recently, dual-layer DVD-ROM drives have become available, which double the storage capacity.

Like CD-R, DVD has been extended for recording, but with two different formats, known as DVD-R and DVD+R. At the time of writing, both formats were popular, and some combo drives supported both formats. Similarly, a rewritable DVD has been developed or rather, two different formats, known as DVD-RW and DVD+RW. Finally, a format known as DVD-RAM offers a random-access read/write media similar to hard disk storage.

DVD-ROM drives can be formatted with a (large) ISO 9660 filesystem, optionally with Rock Ridge or Joliet extensions. They often, however, use the UDF (Universal Disc Format) file system, which is used by DVD movies and is better suited to large storage media.

For applications where multimedia is to be sent live via the Internet, often broadcast to multiple users, sending entire files is not suitable. Streaming media refers to systems where audio, or other media, is sent and played back in real time.

9.1.3. Multimedia Hardware

Now that we've discussed digital audio concepts, let's look at the hardware used. Sound cards follow a similar history as other peripheral cards for PCs. The first-generation cards used the ISA bus, and most aimed to be compatible with the Sound Blaster series from Creative Labs. The introduction of the ISA Plug and Play (PNP) standard allowed many sound cards to adopt this format and simplify configuration by eliminating the need for hardware jumpers. Modern sound cards now typically use the PCI bus, either as separate peripheral cards or as on-board sound hardware that resides on the motherboard but is accessed through the PCI bus. USB sound devices are also now available, some providing traditional sound card functions as well as peripherals such as loudspeakers that can be controlled through the USB bus.

Some sound cards now support higher-end features such as surround sound using as many as six sound channels, and digital inputs and outputs that can connect to home theater systems. This is beyond the scope of what can be covered in this chapter.

In the realm of video, there is obviously the ubiquitous video card, many of which offer 3D acceleration, large amounts of on-board memory, and sometimes more than one video output (multi-head).

TV tuner cards can decode television signals and output them to a video monitor, often via a video card so the image can be mixed with the computer video. Video capture cards can record video in real time for storage on hard disk and later playback.

Although the mouse and keyboard are the most common input devices, Linux also supports a number of touch screens, digitizing tablets, and joysticks.

Many scanners are supported on Linux. Older models generally use a SCSI or parallel port interface. Some of these use proprietary protocols and are not supported on Linux. Newer scanners tend to use USB, although some high-end professional models instead use FireWire (Apple's term for a standard also known as IEEE 1394) for higher throughput.

Digital cameras have had some support under Linux, improving over time as more drivers are developed and cameras move to more standardized protocols. Older models used serial and occasionally SCSI interfaces. Newer units employ USB if they provide a direct cable interface at all. They also generally use one of several standard flash memory modules, which can be removed and read on a computer with a suitable adapter that connects to a USB or PCMCIA port. With the adoption of a standard USB mass storage protocol, all compliant devices should be supported under Linux. The Linux kernel represents USB mass storage devices as if they were SCSI devices.