Problem Space

Although at first glance network-wide capacity planning seems to be required for only Internet service providers (ISP), it also applies to any intranet that offers SLAs to customers. The users of the ISP's network are external customers who signed a contract, whereas customers of an intranet are internal users with an implicit contract. In other words, related to capacity planning, the difference between enterprise and service provider networks is merely a question of terminology. Enterprise companies manage applications for internal users, where SLAs might have been defined. A service provider manages services for external customers with a negotiation on specific SLAs. Although terminology differences between enterprises and service providers exist, the applications/services offered might actually be the same: VoIP traffic, video, business-critical. Figure 14-1, which displays an ISP network, could easily be modified for an enterprise network as follows: the server farms are called data centers, the PoPs are called remote offices, and the core network becomes a Layer 2 or Layer 3 MPLS service offered by an ISP. One observation is that networks designed for specific SLA definitions should ensure that these SLAs are guaranteed—closely monitored, with violations reported.

Network performance monitoring requires knowledge of the core traffic matrix—specifically, to determine if the SLAs are respected during connectivity problems in the network. An ISP that wants to propose SLAs to its customers should first evaluate network performance parameters such as delay, jitter, and packet loss under normal operation. The results, adjusted with extra tolerance (for example, the 95th percentile), are turned into an SLA based on marketing and business requirements. On top of being able to provide specific SLAs at the current time, network-wide capacity planning is required to evaluate whether the SLAs will still be respected in a month, 6 months, or a year based on the new traffic extrapolation.

The next important question is whether the SLAs will still be respected in case of a link or router problem. Even if the Mean Time Between Failures (MTBF) metric improves over time, you can never completely eliminate nonfaulty network modules, malfunctioning routers, link outages, or human errors. A relatively simple software bug can also cause severe performance degradations. Therefore, assuming 100 percent availability at the network elements level is an unrealistic scenario. Note that high-availability functions such as device and link redundancy, backup operations centers, and clear procedures can help you get closer to 100 percent availability.

You should evaluate the SLA during network problems by combining the core traffic matrix with a simulation tool. The core traffic matrix is a table that shows the traffic volumes between the traffic origin and destination in a network. The core traffic matrix can either represent the current traffic (based on the current measurements) or consider future needs by multiplying the current core traffic matrix by a factor that represents the traffic growth. By adding this information to an application that takes into account the network topology and routing information, you can visualize the traffic flows related to the topology. If the tool also includes the link speeds, it can deduce and display link utilization with a color scheme. This allows quick visual discovery of (future) bottlenecks in the network. The simulation function of such a tool is another relevant part. When link or router problems are simulated, the routing information is recomputed, and the data from the core traffic matrix is mapped to the new topology and routing information. Next, new potential bottlenecks are calculated, and the question "Will the traffic still reach the destination in case of a network link or router failure?" can be answered. An important question can be answered afterward: "During a network or link failure, will the traffic for certain SLAs still be transmitted with the defined performance parameters?" Indeed, if during a short network outage the traffic of the gold, silver, and bronze classes is unaffected and only the best-effort traffic is influenced, at least the SLAs are respected, and no penalties need to be paid to the customers. Simulating possible what-if catastrophic scenarios in the network is the best way to proactively analyze whether the SLAs are respected during major outages. After the different scenarios have identified the potential (future) weakest points in the network, the administrator should select one or several of the following actions, depending on the defined SLAs and monetary resources:

Increase the link bandwidths
Tune the IGP metrics
Invest in high-availability solutions
Introduce traffic engineering (see the section "Traffic Profiling and Engineering" in Chapter 1)
Adjust the BGP exit points (see the section "Peering and Transit Agreements" in Chapter 1)
Develop load-balancing strategies