WLAN Services Scaling

In the general 802.1x/EAP (Extensible Authentication Protocol) and virtual private network (VPN) designs, large deployments can be taxing on the scalability and availability of network services. Domain Name System (DNS) and Dynamic Host Control Protocol (DHCP) are typically already deployed in a high availability and scalability manner within enterprises due to wired LAN's reliance on these services. The following sections discuss how to address and scale the Remote Authentication Dial-In User Service (RADIUS) and VPN gateway services.

RADIUS Best Practices

RADIUS authentication scalability and high availability is of primary concern to any network designer who must support a WLAN of any size. Several factors impact the scalability and availability of a RADIUS system, including the following:

Authentication types being used
RADIUS client configuration for primary and secondary RADIUS servers
RADIUS logging level
Type of database used for authentication
Backup, replication, and synchronization strategy to be employed

Different EAP authentication types put varying amounts of strain on the RADIUS server. For instance, EAP Transport Layer Security (EAP TLS) puts more of a strain on a system than the Lightweight Extensible Authentication Protocol (LEAP) due to the way the EAP types do the mutual authentication. Because of the strain of the EAP authentication types on the server's resources, it might become necessary to load-balance RADIUS requests from APs to multiple RADIUS servers. Figure 11-2 depicts the topology of a group of RADIUS servers being load-balanced for scalability.

Figure 11-2. RADIUS Servers Load-Balanced

Several load balancers (Cisco Content Services Switch, Content Services Module, and so on) support the capability to query the RADIUS server with a fixed username, password, and shared secret to validate that the RADIUS process is still functional. This query is called a "keep alive," and it determines if the RADIUS server should stay in rotation for load balancing. If you are looking to use the RADIUS cluster to collect RADIUS accounting information, remember that you must have a persistent, or "sticky," connection to the same authentication server to make sure that the RADIUS accounting start/stop records from the AP go to the same RADIUS server. This persistence typically is achieved via a source IP persistence configuration. The network administrator must make sure that the load balancer installed in his network can support this persistence with RADIUS accounting.

Note

Sticky or stickiness is a term used in server load balancing to denote when a client must remain attached to a real server because there is some sort of state information on the real server with regard to the client. An example is the e-commerce "shopping cart" with which most web purchasers are familiar. The application logic of a shopping cart can sometimes rely on the client always communicating with a particular web server to keep the shopping cart status accurately. SSL-based VPN gateways require that a user be connected to the same SSL VPN gateway until he logs out of the VPN session to provide up-to-date authentication, authorization, and connectivity information.

In addition to having scalability and high availability in the data center, the AP has the capability to designate multiple RADIUS servers in its configuration for local high availability and simple RADIUS scalability. These RADIUS servers are listed in order of preference; the AP tries the first server on the list and proceeds to the second server if the first is unresponsive and times out. A timeout is triggered by the AP settings that include the number of retransmission attempts (three, by default) and a transmission delay between authentication attempts (five seconds, by default). With this capability, the network designer can perform static load-balancing and availability through the static configuration. Figure 11-3 depicts how a network administrator can use the local RADIUS server definition to achieve this load balancing and high availability.

Figure 11-3. AP Configuration for Load Balancing and Failover

Note

In the central switch design, the WLSM acts as the RADIUS client to the RADIUS servers. In this case, you cannot load balance via a source IP address, because only the IP address of the WLSM will be in the RADIUS authentication request. In this design, the network designer should select another unique RADIUS attribute on which to make a load-balancing decision. For instance, some load balancers can use the calling-station-id in the RADIUS request to provide stickiness to a proper RADIUS server.

The network designer can achieve simple load balancing by designating that different zones within the enterprise should use different primary RADIUS servers. Zone A uses the IP address of server A as its primary and server B as its secondary. APs in Zone B have server B as their primary server and server A as their secondary server. The IP addresses for the servers that the network designer designates in the AP configuration can be the virtual IP address of a RADIUS load-balancing device in a data center, which can handle dynamic load balancing within the data center.

Another factor in server scalability is the RADIUS logging level. The RADIUS logging level is a value that determines how much information the RADIUS server writes to its log for every RADIUS transaction or server event. The higher the logging level, the more data that is written to the RADIUS log. A high logging level can significantly increase the processing load on the server, which in turn can cause a significant degradation in the servers' authentication performance.

In many instances, the RADIUS server is merely a consolidation and pass-through device for WLAN authentication. With this in mind, network designers must understand that the type and location of the database storing WLAN user authentication credentials impacts both the scalability and high availability of RADIUS from the WLAN client perspective. In most instances, the backend server and underlying database have appropriate scaling and availability built into them because they are also leveraged for other network services like network operating systems or LDAP. However, in some instances, WLAN authentication and RADIUS have seemed to perform poorly because the introduction of the WLAN has overburdened the backend database or because the backend database is not located properly in the network to give timely responses to the RADIUS infrastructure.

Finally, if the user database is stored locally to the RADIUS server, network designers must understand how database replication affects the performance and availability of the RADIUS deployments. The most important thing to note when utilizing a replication strategy is that some RADIUS servers suspend the authentication service during portions of the database replication from a master to a secondary server. In the Cisco Secure Access Control Server (ACS) for Windows database, replication is a top-down design with a cascading method of replication being implemented; this minimizes the time that the authentication server is suspended. Additionally, the network designer might choose not to allow the master server from serving authentication requests directly from clients. This reduces the impact of the authentication service being suspended on the master and allows the master to replicate directly to all of its secondary servers. Finally, if the master and secondary servers reside in differing time zones, the network designer should consider using cascading servers within a time zone. The master server can replicate to a primary server within a time zone during off hours. The primary server within the time zone can then cascade the replication within the time zone. This reduces the impact of the database replication.

VPN Best Practices

If VPN overlays have been selected as either the primary or secondary means of securing a WLAN, a network designer must deal with several issues with regard to large WLAN deployments.

The first issue is scalability of the VPN gateway service. Early VPN gateway devices limited the amount of encryption throughput (10s of Mbps to 100 Mbps) and the number of simultaneous sessions (several hundreds) that they could perform. This limitation led to the development and deployment of several load-balancing technologies. Later releases of VPN gateway products dramatically increased the encryption throughput (multiple Gbps) and simultaneous sessions (several thousand). However, these products still command a price premium and are sometimes not feasible for use as VPN gateway devices to secure a WLAN; therefore, VPN load balancing is still actively deployed for scalability purposes and high availability.

The network administrator's second issue is high availability. Because a VPN is selected to secure the WLAN, it must be available to WLAN users at all times. With this in mind, there are two methods for providing high availability: local high availability and site high availability. Local high availability refers to a way to provide VPN service to a local environment. This environment can be within a data center or even within a large campus. Typically, local high availability is delivered with some sort of local load-balancing device or some extensions to the VPN protocol. Site high availability refers to providing a VPN service for a general VPN service name. So, if a WLAN user is traveling between locations, the network designer wants to provide a similar method of offering VPN service regardless of the location from which the user might be connecting to a WLAN. Because most VPN gateways use a DNS name to establish connectivity, DNS availability becomes a concern for providing site availability. In addition, DNS can be utilized to deliver site-based load balancing for VPN services. In a large WLAN environment, the DNS-based load balancer can make DNS resolution decisions based on the source IP address of the DNS server. For instance, a DNS-based load balancer might resolve a DNS request to VPN cluster A if the request comes from DNS server 1, while resolving a DNS request to a VPN cluster if the request comes from DNS server 2. Figure 11-4 depicts this DNS-based site load balancing.

Figure 11-4. Site-Based VPN Load Balancing

The following discussions deal with the idea of local high availability. Because the primary VPN deployments are IPSec and SSL-based VPN, these are the only VPN technologies discussed in this section.

IPSec VPN Clustering

You can also use IPSec load balancing built into the IPSec solution. Cisco offers IPSec client load balancing in its VPN 3000. Figure 11-5 depicts the topology of a VPN cluster using the Cisco IPSec clustering and load-balancing technology in the preceding platforms.

Figure 11-5. IPSec VPN Cluster

A VPN cluster includes two or more IPSec gateways. A Cisco VPN cluster has one master and multiple secondary devices. The master is designated via an election process among the clustering devices. IPSec clients connect to a cluster virtual IP address. The virtual IP address is an address that is shared among the cluster. The master in the cluster handles any initial IPSec client request. Upon receiving a client request to establish an IPSec tunnel, the master looks at the latest load and availability information that it receives from each secondary device in its keepalive message exchange and its own load-balancing information. Based on the information it finds, the master uses an algorithm to select an IPsec gateway with which the client should continue its IKE negotiation. The master then sends a response to the client that redirects the client to continue its IPSec negotiation with the selected VPN gateway's real IP address. The cluster handles high availability by continually having the master check on the secondaries' status. The secondaries are responsible for sending their status and load information to the master. The slaves, in turn, expect to hear from the master device on a periodic basis. If the master device should fail, the secondary with the highest priority will be the new master so that subsequent IPSec client requests can be serviced. If the secondary devices have identical priority settings, the device with the lowest IP address becomes the master device. All communication among the cluster can be protected with IPSec using a shared secret for authentication.

Tip

The virtual IP address of a VPN 3000 cluster does not respond to a ping echo-request. This can lead to lots of frustration and confusion during troubleshooting if this is not known.

The load is calculated as a percentage of current active sessions divided by the configured maximum-allowed connections.

IPSec External Load Balancer

IPSec can be load-balanced in two ways: An external load-balancing device or a load-balancing algorithm can be built into the IPSec device itself. The former typically is more useful with a mix of IPSec clients because the latter solution requires proprietary extensions to the IPSec protocol. The latter is more cost effective because you are not required to purchase or manage extra devices. In either instance, it is assumed that all the VPN gateways are configured identically to authenticate and authorize all remote clients that might be directed to connect to the VPN gateway.

When using an external load-balancing device, the primary benefit is interoperability (as previously noted) and high availability. Figure 11-6 depicts the basic topology of a load-balancing device balancing connections to a group of IPSec gateways.

Figure 11-6. External IPSec Load Balancer

When an IPSec peer needs to establish a VPN tunnel through an IPSec load-balancing device, an Encapulating Security Payload (ESP) flow must "follow" its corresponding Internet Key Exchange (IKE) flow; otherwise, the IPSec tunnel will not work. Because of this requirement, the IPSec load-balancing device must match ESP flows to IKE flows. This means that if an IKE flow from client A goes to IPSec gateway B, the corresponding ESP packets from client A must go to IPsec gateway B. ESP flows can alternatively be encapsulated over UDP to transverse Network Address Translation (NAT) and Port Address Translation (PAT) devices. The IPSec load balancer must support this User Datagram Protocol (UDP) wrapping because this NAT transparency feature is default in many IPSec client configurations. In regards to scalability, since IPSec is a stateless protocol, the load-balancing device has no idea how to determine load on the VPN gateway device. The load-balancing device cannot look at the content of the IPSec session and derive load information from the content stream. So, the load balancing device can only use the least-connections algorithm to determine which VPN gateway is least loaded. With regard to high availability, most load balancers do not have the capability to conduct a full IPSec negotiation to determine whether the network behind the VPN gateway is available; therefore, most load-balancing devices determine the availability of the individual VPN gateways via a basic health check, such as with a ping to the device or an initial IKE connection to UDP port 500.

SSL External Load Balancing

SSL VPNs have scaling and high availability challenges that are similar to those of IPSec VPNs. Because SSL VPNs emphasize the idea of a "clientless" VPN, the client does not contain intelligent code to handle load-balancing items such as the IPSec redirection of the VPN 3000. For this reason, most SSL VPN load balancing is done with the use of external load-balancing devices. Load-balancing devices have been used in heavy SSL environments such as e-commerce for many years, so there is a well-developed solution set around providing scalable and available SSL solutions. Figure 11-7 depicts the basic topology of a load-balancing device balancing connections to a group of SSL gateways.

Figure 11-7. External SSL Load Balancer

SSL typically is load balanced with a least-connections load-balancing algorithm with a stickiness based on source IP address. High availability depends on the SSL health checks that are available on the load-balancing device. These health checks can vary from a simple health check like a ping to a more advanced health check like opening a connection to the SSL VPN device on the SSL VPN port (typically 443), starting the SSL handshake, and then disconnecting.