High availability is a system-design protocol that guarantees a certain amount of operational uptime during a given period. The design attempts to minimize unplanned downtime—the time users are unable to access resources. In almost all cases, high availability is provided through the implementation of duplicate equipment (multiple servers, multiple NICs, etc.)
Fault tolerance means that even if one component fails, you won't lose access to the resource it provides. To implement fault tolerance, you need to employ multiple devices or connections that all provide a way to access the same resource(s).
Load Balancing refers to a technique used to spread work out to multiple computers, network links, or other devices. Using load balancing, you can provide an active/passive server cluster in which only one server is active and handling requests. For example, your favorite Internet site might actually consist of 20 servers that all appear to be the same exact site because that site's owner wants to ensure that its users always experience quick access. You can accomplish this on a network by installing multiple, redundant links to ensure that network traffic is spread across several paths and to maximize the bandwidth on each link.
Think of this as similar to having two or more different freeways that will both get you to your destination equally well—if one is really busy, just take the other one.
Multipathing is the process of configuring multiple network connections between a system and its storage device. The idea behind multipathing is to provide a backup path in case the preferred connection goes down
Both Host A and Host B have multiple host bus adapters (NICs) and multiple connections through multiple switches and are mapped to multiple storage processors as well. This is a highly fault-tolerant arrangement that can survive an HBA failure, a path failure, a switch failure, and a storage processor failure.
Network Interface Card (NIC) Teaming
NIC teamingallows multiple network interfaces to be placed into a team for the purposes of bandwidth aggregation and/or traffic failover to prevent connectivity loss in the event of a network component failure.
The cards can be set to active/active state, where both cards are load balancing, or active/passive, where one card is on standby in case the primary card fails. Most of the time, the NIC team will use a multicast address to send and receive data, but it can also use a broadcast address so all cards receive the data at the same time.
The major benefits of NIC teaming are load balancing (redistributing traffic over networks) and failover (ensuring network continuity in the event of system hardware failure) without the need for multiple physical connections. Essentially, NIC teaming is a strategic plan that can increase uptime without adding more physical connections to the mix.
Redundant Hardware/Clusters
By now it must be clear that redundancy is a good thing. While this concept can be applied to network connections, it can also be applied to hardware components and even complete servers.
Switches
As you saw in the last section, multiple switches can be deployed to provide for failover if a switch fails. When this is done, it sometimes creates what is called a switching loop. Luckily, there is a protocol called Spanning Tree Protocol that can prevent these loops from forming. There are two forms of switch redundancy, switch stacking and switch clusters.
Switch Stacking
Switch stacking is the process of connecting multiple switches together (usually in a stack) and managing them as a single switch. Figure 15.4 shows a typical configuration.
The stack members work together as a unified system. Layer 2 and layer 3 protocols present the entire switch stack as a single entity to the network.
A switch stack always has one active switch and one standby switch. If the active switch becomes unavailable, the standby switch assumes the role of the active switch and continues to keep the stack operational.
Switch Clustering
A switch cluster is another option. This is a set of connected and cluster-capable switches that are managed as a single entity without interconnecting stack cables. This is possible by using Cluster Management Protocol (CMP).
The switches in the cluster use the switch clustering technology so that you can configure and troubleshoot a group of different switch platforms through a single IP address.
In those switches, one switch plays the role of cluster command switch, and the other switches are cluster member switches that are managed by the command switch.
Routers
Routers can also be set up in a redundant fashion. When we provide router redundancy, we call it providing first-hop redundancy since the router will be the first hop from any system to get to a destination. To accomplish first-hop redundancy requires an FHRP protocol.
First-hop redundancy protocols (FHRPs) work by giving you a way to configure multiple physical routers all under one logical router by using a virtual IP address; The same concept is seen in Switch redundancy. First hop is a reference to the default router being the first router. When multiple routers are grouped together under the same IP & MAC address, if one router goes down (Active Router) then one of the Standby Passive Routers becomes the new Active. And because these multiple routers are assigned the same virtual IP address, the switches or hosts do not have to change their default gateway configurations.
Firewalls
A Firewall Cluster is a group of firewall nodes that work as a single logical entity to share the load of traffic processing and provide redundancy. Clustering guarantees the availability of network services to the users.
Facilities and Infrastructure Support
When infrastructure support equipment is purchased and deployed, the ultimate success of the deployment can depend on selecting the proper equipment, determining its proper location in the facility, and installing it correctly.
One risk that all organizations should prepare for is the loss of power.
Uninterruptible Power Supply (UPS) All infrastructure systems should be connected to uninterruptible power supplies (UPSs). These devices can immediately supply power from a battery backup when a loss of power is detected.
You should keep in mind, however, that these devices are not designed as a long-term solution. They are designed to provide power long enough for you to either shut the system down gracefully or turn on a power generator.
Power Distribution Units (PDUs) Power distribution units (PDUs) simply provide a means of distributing power from the input to a plurality of outlets. Very Similar to an Extension cord.
Intelligent PDUs normally have an intelligence module that allows for remote management of power metering information, power outlet on/off control, and/or alarms. Some advanced PDUs allow users to manage external sensors such as temperature, humidity, and airflow.
Generator
As you learned earlier in this chapter, a UPS is not designed for long-term power supply. The battery will run out. This should be supplemented with a backup generator if more than an hour or so of backup is required. The amount of backup time supplied by a generator is limited only by the amount of fuel you keep on hand.
HVAC
The Heating and Air-Conditioning systems must support the massive amounts of computing equipment deployed by most enterprises. Computing equipment and infrastructure devices like routers and switches do not like the following conditions:
Heat. Excessive heat causes reboots and crashes.
High humidity. It causes corrosion problems with connections.
Low humidity. Dry conditions encourage static electricity, which can damage equipment
Fire Suppression
It is well worth the money to protect your data center with a fire-suppression system. The following types of systems exist:
Wet pipe systems use water contained in pipes to extinguish the fire.
Dry pipe systems hold the water in a holding tank instead of in the pipes.
Preaction systems operate like a dry pipe system except that the sprinkler head holds a thermal-fusible link that must melt before the water is released.
Deluge systems allow large amounts of water to be released into the room, which obviously makes this not a good choice where computing equipment will be located.
Redundancy and High Availability (HA) Concepts
In the following sections, you'll find a survey of topics that all relate in some way to addressing risks that can be mitigated with redundancy and high availability techniques.
Recovery Sites
Not all secondary sites are created equally. They can vary in functionality and cost. We're going to explore four types of sites: cold sites, warm sites, hot sites, and cloud sites.
Cold Site
A Cold Siteis a leased facility that contains only electrical and communications wiring, air conditioning, plumbing, and raised flooring.
No communications equipment, No networking hardware, and no computers are installed at a cold site until it is necessary to bring the site to full operation. For this reason, a cold site takes much longer to restore than a hot or warm site.
A cold site provides the slowest recovery, but it is the least expensive to maintain. It is also the most difficult to test.
Warm Site
The restoration time and cost of a Warm Site is somewhere between that of a hot site and a cold site. It is the most widely implemented alternate leased location. Although it is easier to test a warm site than a cold site, a warm site requires much more effort for testing than a hot site.
A Warm Site is a leased facility that contains electrical and communications wiring, full utilities, and networking equipment. In most cases, the only thing that needs to be restored is the software and the data. A warm site takes longer to restore than a hot site but less than a cold site.
Hot Site
A Hot Siteis a leased facility that contains all the resources needed for full operation. This environment includes computers, raised flooring, full utilities, electrical and communications wiring, networking equipment, and uninterruptible power supplies (UPSs). The only resource that must be restored at a hot site is the organization's data, usually only partially. It should only take a few minutes to bring a hot site to full operation.
Although a hot site provides the quickest recovery, it is the most expensive to maintain. In addition, it can be administratively hard to manage if the organization requires proprietary hardware or software. A hot site requires the same security controls as the primary facility and full redundancy, including hardware, software, and communication wiring.
Cloud Site
A Cloud Recovery Siteis an extension of the cloud backup services that have developed over the years. These are sites that, while mimicking your on-premises network, are totally virtual, as shown in Figure 15.8.
Active-Active vs. Active-Passive
When systems are arranged for fault tolerance or high availability, they can be set up in either an active/active arrangement or an active/passive configuration.
Active-Active increases availability by providing more systems for work. Two or more devices are configured exactly the same and both devices operate in unison. In these cases, both devices are running and operating simultaneously and both are active on the network. This 2x identical amounts of work on the network.
Managing data flow can be challenging; You’ll need to have a good grasp on the infrastructure.
Active-Passiveprovides fault tolerance by holding at least one system in reserve in case of a system failure. In this case, both systems are not simultaneously operating. In the scenario of the Active device malfunctioning, the Passive device will take it’s place.
Constant communication is required between the two to ensure both are ready to swap out if need be.
Multiple Internet Service Providers (ISPs)/Diverse Paths
Redundancy may also be beneficial when it comes to your Internet connection. There are two types of redundancy that can be implemented.
Path Redundancy is accomplished by different configuring paths to the ISP. This is shown in Figure 15.9. There is a single ISP with two paths extending to the ISP from two different routers.
ISP Redundancy That's great, but what if the ISP suffers a failure (it does happen)? To protect against that you could engage two different ISPs with a path to each from a single router.
Path + ISP Redundancy Combinedis achieved whenever you combine the two by using a separate router connection to each ISP, thus protecting against an issue with a single router or path in your network, as shown in the picture below.
First-Hop Routing Protocols
There are three first-hop redundancy protocols in the FHRP family: HSRP, VRRP, and GLBP. HSRP and GLBP are Cisco proprietary protocols, while VRRP is a standards-based protocol. Let's look at Hot Standby Router Protocol (HSRP) and Virtual Router Redundancy Protocol (VRRP).
HSRP is a Cisco proprietary protocol that can be run on most, but not all, of Cisco's router and multilayer switch models. It defines a standby group, and each standby group that you define includes the following routers:
Active router
Standby router
Virtual router
Any other routers that may be attached to the subnet.
The problem with HSRP is that only one router is active and two or more routers just sit there in standby mode and won't be used unless a failure occurs—not very cost effective or efficient!
The standby group will always have at least two routers participating in it. The primary players in the group are the one active router and one standby router that communicate to each other using multicast Hello messages.
The Hello messages provide all of the required communication for the routers. The Hellos contain the information required to accomplish the election that determines the active and standby router positions. They also hold the key to the failover process.If the standby router stops receiving Hello packets from the active router, it then takes over the active router role, as shown in Figure 15.13.
The HSRP timers are very important to the HSRP function because they ensure communication between the routers, and if something goes wrong, they allow the standby router to take over. The HSRP timers include hello, hold, active, and standby.
Hello The hello timer is the defined interval during which each of the routers send out Hello messages. Their default interval is 3 seconds, and they identify the state that each router is in. This is important because the particular state determines the specific role of each router and, as a result, the actions each will take within the group.
HoldThe hold timer specifies the interval the standby router uses to determine whether the active router is offline or out of communication. By default, the hold timer is 10 seconds, roughly three times the default for the hello timer. If one timer is changed for some reason, I recommend using this multiplier to adjust the other timers too. By setting the hold timer at three times the hello timer, you ensure that the standby router doesn't take over the active role every time there's a short break in communication.
ActiveThe active timer monitors the state of the active router. The timer resets each time a router in the standby group receives a Hello packet from the active router. This timer expires based on the hold time value that's set in the corresponding field of the HSRP Hello message.
Standby The standby timer is used to monitor the state of the standby router. The timer resets anytime a router in the standby group receives a Hello packet from the standby router and expires based on the hold time value that's set in the respective Hello packet.
VIRTUAL ROUTER REDUNDANCY PROTOCOL
Like HSRP, Virtual Router Redundancy Protocol (VRRP) allows a group of routers to form a single virtual router. In an HSRP or VRRP group, one router is elected to handle all requests sent to the virtual IP address. With HSRP, this is the active router. An HSRP group has only one active router, at least one standby router, and many listening routers. A VRRP group has one master router and one or more backup routers and is the open standard implementation of HSRP.
COMPARING VRRP AND HSRP
The LAN workstations are configured with the address of the virtual router as their default gateway, just as they are with HSRP, but VRRP differs from HSRP in these important ways:
VRRP is an IEEE standard (RFC 2338) for router redundancy; HSRP is a Cisco proprietary protocol.
The virtual router that represents a group of routers is known as a VRRP group.
The active router is referred to as the master virtual router.
The master virtual router may have the same IP address as the virtual router group.
Multiple routers can function as backup routers.
VRRP is supported on Ethernet, Fast Ethernet, and Gigabit Ethernet interfaces as well as on Multiprotocol Label Switching (MPLS), virtual private networks (VPNs), and VLANs.
VRRP REDUNDANCY CHARACTERISTICS
VRRP has some unique features:
VRRP provides redundancy for the real IP address of a router or for a virtual IP address shared among the VRRP group members.
If a real IP address is used, the router with that address becomes the master.
If a virtual IP address is used, the master is the router with the highest priority.
A VRRP group has one master router and one or more backup routers.
The master router uses VRRP messages to inform group members.
Mean Time to Repair MTTR this value describes the average length of time it takes a vendor to repair a device or component. How long does it take to fix your system?
Mean Time Between Failure MTBF
Another valuable metric typically provided is the mean time between failures (MTBF), which describes the amount of time that elapses between one failure and the next. How long should the system continue to operate normally until another failure occurs? Used to closely predict the forecast for another system failure.
Recovery Time Objective RTO
This is the shortest time period after a disaster or disruptive event within which a resource or function must be restored in order to avoid unacceptable consequences. RTO assumes that an acceptable period of downtime exists. RTO is the amount if time it takes to get your system back up and running.
Recovery Point Objective RPO
An RPO is a measurement of time from the failure, disaster, or comparable loss-causing event that is acceptable to deem your systems back operational. RPOs measure back in time to when your data was preserved in a usable format, usually to the most recent backup. How much data is available before we can say we’re back up and running?
Network Device Backup/Restore
When devices are backed up it is important to know that backing up the data and the underlying system are two separate actions. We create device configurations over time that can be quite complicated, and in some cases where multiple technicians have played a role, no single person has a complete understanding of the configuration. For this reason, configurations should be backed up.
State This is when you will back up what is called the system state. This backs up only the configuration of the server and not the data. In this case, a system state backup and a data backup should be performed. It is also possible to back up the entire computer, which would include both datasets.
Configuration Every device has a configuration! Most devices will save the configuration and allow it to be uploaded to a blank machine (Using a backup to restore). Configurations may be specific not only to the device, but also to the firmware of said device, so it’s important that you’re sure of these things before attempting to restore.