Measured Metrics: What to Measure

Cisco IP SLA is an embedded feature set in Cisco IOS Software that allows you to analyze service levels for IP applications and services. It is one of those Cisco device instrumentation features with a long history. IOS 11.2 introduced the Response Time Reporter (RTR), which supported three functions: ICMP Ping, ICMP Echo Path, and SSCP (IBM SNA native echo). In those days, multiple customers migrated their dedicated IBM SNA infrastructure to an IP network and realized how limited IP reporting functions were compared to IBM's SNA network. RTR addressed this issue and significantly increased functionality over the years. Cisco renamed RTR Service Assurance Agent (SAA) in Cisco IOS Software Release 12.0(5)T. New features were continuously added and, in 2004, Cisco changed the name to IP SLA. Despite the name changes, the basic principle of IP SLA remained the same: an active measurement that uses injected test packets (synthetic traffic) marked with a time stamp to calculate performance metrics. The results allow indirect assessment of the network, such as Service-Level Agreements (SLA) and QoS class definitions. IP SLA consists of two components, both implemented in Cisco devices:

  • Mandatory source device, which generates, receives, and analyzes the traffic.

  • The IP SLA Responder, which is optionally used to increase accuracy and measurement details. By adding time stamps to the measurement packets at the destination device, the IP SLA Responder allows the elimination of the measurement packet processing time on the destination device. The IP SLA Responder listens on any standard or user-defined port for UDP, TCP, and Frame Relay packets generated by the IP SLA source.

IP SLA is an example of a device instrumentation technique that does not overlap between accounting and performance, because it is dedicated to performance measurement. Therefore, it is a complementary solution to accounting features such as NetFlow.

IP SLA allows you to measure the performance characteristics of the existing infrastructure. It offers valuable information for network architects to redesign traffic within the network and confidently build end-to-end application-aware SLAs. IP SLA is applicable to both service providers and enterprises for performance management.

Highlighted features of IP SLA are as follows:

  • Network performance monitoring— Measures delay, jitter, packet loss, packet ordering, and packet corruption in the network.

  • SLA monitoring— Provides service-level monitoring and verification.

  • IP service network health assessment— Verifies that the existing QoS settings are sufficient for new IP services.

  • Edge-to-edge network availability monitoring— Provides proactive verification and connectivity testing of network resources (for example, indicates the network availability of a web server).

  • Voice over IP (VoIP) performance monitoring— Analyzes critical parameters for a VoIP deployment, which are not only jitter and packet loss, but also the Mean Opinion Score (MOS) and Impairment/Calculated Planning Impairment Factor (ICPIF) values. ICPIF was defined by ITU-T G.11. The ICPIF value represents predefined combinations of loss and delay.

  • Application-aware monitoring— IP SLA can emulate traffic up to the application level—for example, DNS, DHCP, and web server requests—and can measure the related performance statistics.

  • Accuracy— Microsecond granularity for jitter delay measurements offers the required precision for business-critical applications.

  • Flexible operations— Offer various kinds of scheduling, alerting, and triggered measurements.

  • Pervasiveness— IP SLA is implemented in Cisco networking devices ranging from low-end to high-end routers and switches. This avoids the deployment and management of dedicated measurement boxes.

  • Troubleshooting of network operation— Provides consistent, reliable measurement that proactively identifies performance and connectivity problems.

Because the terms probe, operation, and packet in the official Cisco IP SLA documentation might lead to confusion, this book uses these terms as follows:

  • Test packets— Synthetic traffic generated by the IP SLA devices (Source and/or Responder). Packets can contain a single request (DHCP) or can consist of a stream of packets (jitter measurement).

  • Operation— An action performed by the IP SLA source. An operation consists of one or multiple test packets, resulting in a single result (such as round-trip time [RTT]) or a series of results (such as buckets of distributions). The Cisco documentation calls these probes.

  • Frequency of operations— The interval between two successive executions of the same operation instance. For an ad hoc measurement, an operation can run once, whereas monitoring performance trends requires continuously running operations.

The fundamentals of active versus passive monitoring were introduced in Chapter 2, "Data Collection Methodology." This section builds on the theoretical foundation and applies it to IP SLA, which uses an active monitoring approach to measure network performance. Cisco IP SLA sends test packets across the network to analyze performance between multiple network locations or across multiple network paths. It simulates network and application services and collects network performance data in real time. The operations have configurable IP and application layer options such as source and destination IP address, UDP and TCP port numbers, type of service (ToS) byte (including Differentiated Services Code Point [DSCP] and IP precedence bits), virtual private network (VPN) Virtual Routing and Forwarding (VRF), and HTTP web address. Results are stored in the Cisco device and are available through the CLI and SNMP MIBs. Multiple performance monitoring applications support IP SLA, such as CiscoWorks Internetwork Performance Monitor (IPM), IP Solution Center (ISC), and many products from other vendors.

IP SLA collects the following performance metrics:

  • Delay (both round-trip and one-way)

  • Jitter (one-way)

  • Packet loss (one-way)

  • Packet sequencing (packet ordering)

  • Packet corruption detection

  • Path (per hop)

  • Connectivity (one-way)

  • FTP server or HTTP website download time

  • Voice quality scores (MOS, ICPIF)

The various IP SLA operations can be classified as follows:

  • ICMP-based operations for Echo, Path Echo, and Path Jitter.

  • UDP-based operations, such as echo, jitter, DNS, and DHCP.

  • TCP-based operations, such as TCP Connect, FTP, HTTP, and DLSw+.

  • Layer 2 operations, such as Frame Relay, ATM, and MPLS.

  • VoIP-related operations, such as VoIP Jitter, VoIP Gatekeeper Registration Delay Monitoring, and VoIP Call Setup (Post-Dial Delay) Monitoring. The new RTP-based VoIP operation was introduced in Cisco IOS Software Release 12.4(4)T.

Identifying the correct measurement metrics can be a challenging task. When customers consider deploying SLA, they usually look for guidance on standard parameters. Unfortunately, these standard parameters exist only partially. ITU-T recommendation G.114 defines a maximum of 150-ms end-to-end transmission time (mouth to ear) for voice applications and an upper limit of 400 ms for most other applications. However, most customers' requirements are as individual as their network design; therefore, customization of SLA parameters might be required in some cases. For a jump start, some basic principles exist for choosing generic performance metrics. If SLAs are already in place, best practice suggests monitoring the agreed-upon SLA parameters if possible. If SLAs are negotiated between a service provider and a consumer, they should be measurable with services such as IP SLA.

The most common and generic metrics are delay (one-way and round-trip), jitter, and packet loss. Specific considerations are related to measurement accuracy. Finally, application-specific metrics related to network services can be included in your SLA statements, including DNS, DHCP, TCP connect, and HTTP. The following sections describe these metrics in more detail.

Network Delay

Network delay describes how long it takes packets to traverse the network and reach the destination. The total network delay consists of network transmission, serialization, and processing delay at the source and target devices. If the destination device is not heavily loaded, you can assume that for long-distance connections or a connection with lowbandwidth links, the network delay dominates the results of the total round-trip delay. In a symmetric routing design where all packets traverse the same hops for both directions, the network delays in each direction should be almost consistent. In this case, round-trip delay measurement could be appropriate. Because round-trip delay does not measure delay per direction, it has limitations in networks where asymmetric routing is applied. In this case, packets from source to destination take a different path than the return traffic. This can be measured with one-way delay operations, which might identify that the delay in one direction is considerably different from the delay in the reverse direction. However, even with a perfectly symmetric route, you may have queuing delays. Those delays are almost never symmetric; instead, they typically occur predominantly in one direction. Therefore, for problem diagnosis or troubleshooting, it is advisable to have one-way delays.

Round-trip times are much easier to measure, so many performance applications report them. Therefore, round-trip measurement is the initial choice for general monitoring, and one-way delay is the right choice for in-depth analysis.

Jitter

Jitter, also known as IP Packet Delay Variation (IPDV), measures the delay variation between packets. It is a relevant parameter for interactive voice and video applications. Jitter describes interpacket delay variation. When multiple packets are sent consecutively from the source to the destination device, such as 10 ms apart, under ideal circumstances in the network, the destination should be receiving them 10 ms apart. Under realistic circumstances, delays in the network, such as queuing and arriving through alternate routes, cause the arrival delay between packets to be greater than or less than 10 ms. Using this example, a positive jitter value indicates that the packets arrived more than 10 ms apart. For example, packets arriving 12 ms apart cause positive jitter of 2 ms. If the packets arrive 8 ms apart, they cause negative jitter of 2 ms. For delay-sensitive applications such as VoIP, positive jitter values are undesirable, and a jitter value of 0 is ideal. Jitter is covered in more detail in Chapter 15, "Voice Scenarios," together with voice measurement standards, such as MOS and ICPIF scores.

Applicability of the term jitter is much broader than packet transmission performance, with "unwanted signal variation" as a general definition. Indeed, jitter has been used to describe frequency or phase variations, such as the data stream rate variations or carrier signal phase noise. The term IP Packet Delay Variation (IPDV) is almost self-describing and is more precise. This is why both RFC 3393 (IP Packet Delay Variation Metric for IPPM) and ITU-T Y.1540 (IP packet transfer and availability performance parameters) prefer this term. To be consistent with the Cisco documentation and CLI, this chapter uses the term jitter.

Packet Loss

Packet loss happens when a network element drops packets instead of forwarding them. This could occur because of overload situations when a router or switch cannot accept any incoming data. Alternatively, based on QoS or security-related policies, the network element might intentionally drop packets with specific characteristics. The impact of packet loss differs with each type of application. TCP-based data transmission suffers from performance degradation due to packet retransmission, and voice sessions seem chopped under heavy packet loss.

Measurement Accuracy

Measurement accuracy is affected by the processing delay at the network elements (typically on the order of milliseconds), the system clock accuracy of the measurement devices, and the method of time synchronization between peers. The Network Time Protocol (NTP) and Global Positioning System (GPS) are common methods for system clock synchronization. The use of GPS is recommended over NTP for time synchronization, specifically for measurement across a WAN. NTP focuses on clock accuracy over long-time scales, which can come at the expense of short-term clock skew and drift. Errors on the order of milliseconds, such as those generated by NTP-based synchronization, primarily affect the precision of one-way metrics where the accurate synchronization of the clocks between the two devices is essential. However, several performance metrics, such as round-trip time, interarrival time, and delay variation, are less sensitive to clock synchronization accuracy.

TCP Connect

TCP connect describes how long it takes a TCP request to be served at the destination server. This is an essential part of application sessions over TCP. This metric focuses on the network and application level. The result is the sum of the network delay and the processing time at the destination to serve the TCP request.

DHCP and DNS Response Time

DHCP and DNS response time are service layer metrics. Even though requesting an IP address through DHCP is usually limited to one operation per user session, it is critical, because in case of a DHCP server failure, users get no network connectivity. For cable and DSL providers that use DHCP for dynamic address allocation for users, monitoring the DHCP server is vital for the users' network access. In contrast to DHCP, DNS requests occur multiple times per session because web pages are usually designed to retrieve information from multiple servers. Even a network with underutilized high-speed links appears dramatically slow if the DNS serves requests slowly. DHCP and DNS monitoring are significant components of an SLA.

HTTP Response Time

The HTTP response time links to business services by providing the display time for specific websites. In the Internet, your competitor is just a mouse click away. Therefore, identifying performance issues with your public website can have an immediate impact on the organization's revenue.

Linking Metrics to Applications

From a practical perspective, you probably want to link the metrics to applications next. The following list offers guidance and direction:

  • Data transmission— Measure delay and packet loss—if possible, per class of service.

  • VoIP— Measure jitter and voice quality scores (such as MOS and ICPIF) and monitor voice server and gateway response times.

  • Streaming video— Measure one-way delay, packet loss, and out-of-sequence arrival of packets.

  • Network services— Measure DHCP and DNS server response times.



Part II: Implementations on the Cisco Devices