Building a resilient, secure infrastructure requires an understanding of the risks that your organization may face. Natural and human-created disasters, physical attacks, and even accidents can all have a serious impact on your organization's ability to function.
Resilience is part of the foundation of the availability leg of the CIA triad, and this chapter explores resilience as a key part of availability.
Cost, maintenance requirements, suitability to the risks that your organization faces, and other factors are factors you must take into account when building cybersecurity resilience. Redundancy
Redundancy is one method of building resilience; Redundancy simply means having more than one of a system, service, device, or other component.
Single Points of Failure are places where the failure of a single device, connection, or other element could disrupt or stop the system from functioning. These points must be identified and compensated for in the case of a failure. Its essential to identify any single points of failure in the network and asses any weak points when creating your design for a resilient, secure infrastructure. Common Design elements for redundancy include the following:
Geographic Dispersal
Geographic Dispersal of systems ensures that a single disaster, attack, or failure cannot disable or destroy them. It’s another term for ‘placing your datacenters far apart so their affected by the same disaster.’ Common rule of thumb is to place datacenters at least 90 miles apart. This also helps ensure that facilities will not be impacted by issues with the power grid, network connectivity, and other similar issues.
RAID (Redundant Array of Independent Disks)
Redundant Arrays of Inexpensive Disks (RAID) is a common solution that uses multiple disks with data either striped (spread across disks) or mirrored (completely copied), and technology to ensure that data is not corrupted or lost (parity).
RAID ensures that one or more disk failures can be handled by an array without losing data. Separation of Servers
Separation of Servers and other devices in datacenters is also commonly used to avoid a single rack being a point of failure. Thus, systems may be placed in two or more racks in case of a single point failure of a power distribution unit (PDU) or even something as simple as a leak that drips down into the rack. Multipathing
Multipathing is using multiple network paths to ensure adequate network connectivity in the case of a severed cable or failed device. Using Redundant Network Devices
Redundant network devices, including multiple routers, security devices like firewalls and intrusion prevention systems, or other security appliances, are also commonly implemented to prevent a single point of failure.
Load balancers
Load Balancers allow both redundancy and increases ability to handle loads by distributing it to more than one system. Load Balancers are often used during system upgrades to redirect traffic from the systems that are being upgraded, then directing them back after all upgrades are complete. NIC Teaming
NIC Teaming combines multiple network cards into a single virtual network connection. It’s a common technique of grouping physical network adapters to improve performance and redundancy. The major benefits of NIC teaming are load balancing (redistributing traffic over networks) and failover (ensuring network continuity in the event of system hardware failure) without the need for multiple physical connections. If one of the underlying physical NICs is broken down or if the cable of the corresponding NIC is unplugged, the host/server detects the fault condition and moves the traffic to another NIC automatically. This reduces the possibility of a breakdown
Power
Uninterruptible Power Supply (UPS)
UPS provides battery or other backup power options for short periods of time Short-term backup power – Blackouts, brownouts, surges. Offline or Standby UPS is the most simple and least expensive version; UPS that is not normally enabled unless power is lost. If the UPS recognizes that the power source is gone, it will switch over to battery power. So there’s a short time frame between the time when power is lost and then power is made available from the UPS. Line-interactive UPS is useful when power sources are slowly degenerating over time. During brownouts or times when the voltage is not at optimal levels, the line-interactive UPS can fill in the differences for the power source. Online or Double Conversion UPS is the most complex and the most expensive as it’s always on all the time. The Online (Double Conversion) UPS is always online and always providing power to your devices. And if the power does go out, there’s no switching process, because you’re already on battery power. Generator
A Generator is a long-term power backup that can keep the power running for days or even weeks at a time Generators depend on fuel to run; generators can often power an entire building, or at least a number of outlets inside of that building. Dual-Supply
Ensures that a power supply failure won't disable a server even if the internal power supply is faulty. Having a server with multiple power supplies is ideal because if you lose one of those power supplies, the other power supply continues to provide power to that device. Or if you need to swap a power supply out, you can do so without completely shutting the server down. Managed Power Distribution Units (PDUs)
PDUs are used to provide intelligent power management and remote control of power delivered inside server racks and other environments. This is a very simple PDU that has eight different power interfaces inside of it. It connects to an ethernet network. And each one of those interfaces can be controlled across the network. These PDUs also have monitoring capabilities. So they can report back if there are any type of power problems
Replication
Storage Area Network (SAN)
A Storage Area Network (SAN) is a dedicated network of storage devices used to provide a pool of shared storage that multiple computers and servers can access. A network created just for accessing storage. Typically uses RAID to ensure data is not lost.
Storing data in centralized shared storage architecture like allows organizations to manage storage from a collective place and apply consistent policies for security, data protection, and disaster recovery. A is a way to provide users high-performance, low-latency shared access to storage. Normally uses fiber connections to achieve the fastest speeds possible. We can also replicate the data between storage area networks. A SAN removes the storage responsibility from individual servers and collects it in a central place where it can be accessed, managed, and protected SANs act and feel as though their a local storage; Very efficient at reading and writing.
Network-Attached Storage
Network-Attached Storage (NAS) is a storage device accessed by connecting to a network. It offers file-level access.
A network-attached storage (NAS) device is a data storage device that connects to and is accessed through a network, instead of connecting directly to a computer. NAS is a great way to store large amounts of data while also making it accessible from anywhere A NAS creates its own small network that any device with the right credentials (username and password) can access.
SAN vs NAS
There are multiple differences between a SAN & NAS.
A SAN typically uses Fiber Channel connectivity, while NAS typically ties into to the network through a standard Ethernet connection. A SAN stores data at the block level, while NAS accesses data as files. To a client OS, a SAN typically appears as a disk and exists as its own separate network of storage devices, while NAS appears as a file server. SANs appear local to the client, while NAS appears as a file server accessible over a network. SAN is associated with structured workloads such as databases, while NAS is generally associated with unstructured data such as video and medical images. NAS is a single storage device that serves files over Ethernet and is relatively inexpensive and easy to set up, while a SAN is a tightly coupled network of multiple devices that is more expensive and complex to set up and manage. VM (Virtual Machine Redundancy)
We also have the ability to replicate virtual machines. So we can update one VM and have those updates replicate to all of the other VMs that might be running in our environment.
Once we update the primary VM all of those updates can be rolled out to every other virtual machine that we’re running wherever it happens to be in the world. This replicated VM also acts as a backup. If we happen to lose the primary virtual machine we can roll a new virtual machine from the replication and continue to have uptime and availability on the new VM. If you change only one file on a virtual machine, you only have to copy those changes to all of the other VMs to maintain the replicated data
On-Premise vs Cloud
Do you want your data replicated into a local service or replicated into the cloud? That’s the million dollar question here.
If there’s a large amount of data that has to be replicated having devices that are local on your network would provide very fast connectivity. Connections to the cloud are almost always going to be slower than devices that would be in a local data center, so if you’re replicating large amounts of data and speed is a concern, just replicate it on a local device. Local Device Replication requires you to purchase & house all of your own equipment; Which can be very expensive. Cloud storage systems tend to have a low cost entry point and then you would scale up + scale down the costs as you use more of those resources
Backups
Backups and replication are frequently used to ensure that data loss does not impact an organization. Backups are a copy of the live storage system.
Backup media
Backup Media is also an important decision that organizations must make. Backup media decisions involve capacity, reliability, speed, cost, expected lifespan while storing data, how often it can be reused before wearing out, and other factors, all of which can influence the backup solution that an organization chooses.
Tape
Has historically been one of the lowest-cost-per-capacity options for large-scale backups. Magnetic tape remains in use in large enterprises, often in the form of tape robot systems that can load and store very large numbers of tapes using a few drives and several cartridge storage slots. Disks
Either in magnetic or solid-state drive form, are typically more expensive for the same backup capacity as tape but are often faster. Disks are often used in large arrays in either a network attached storage (NAS) device or a storage area network (SAN). Optical media
like Blu-ray disks and DVDs, as well as specialized optical storage systems, remain in use in some circumstances, but for capacity reasons they are not in common use as a large-scale backup tool. Flash media
Like microSD cards and USB thumb drives continue to be used in many places for short-term copies and even longer-term backups. Though they aren't frequently used at an enterprise scale, they are important to note as a type of media that may be used for some backups.
Backup Types
Snapshot
A Snapshot captures the full state of a system or device at the time the backup is completed. -
Snapshots are common for virtual machines (VMs), where they allow the machine state to be restored at the point in time that the snapshot was taken. Snapshots can be useful to clone systems, to go back in time to a point before a patch or upgrade was installed, or to restore a system state to a point before some other event occurred. Snapshots can also be taken while the system is running. Similar to full-backups, snapshots can also consume quite a bit of space. Multiple snapshots of an application instance can be taken, then we’d have different versions that we may be able to revert to.
Image
An Image is a more complete copy of a system or server, typically down to the bit level for the drive. This means that a restored image is a complete match to the system at the moment it was imaged.
Images are a backup method of choice for servers where complex configurations may be in use, and where cloning or restoration in a short timeframe may be desired. Instead of backing up individual files on a system, we back up everything that is on a computer and create an exact duplicate or replica of that entire file system We’re backing up the operating system, the user files, and anything else that might be stored on that computer.
Cloud Backups
Cloud backup is a service in which the data and applications on a business's servers are backed up and stored on a remote server.
A cloud-based backup service can provide us with an automatic offsite backup function, where we would be taking files on our local device and backing them up to a device that’s located somewhere else in the cloud.
This can often support many, many devices in our environment, but it requires that we have enough bandwidth to be able to transfer these files back and forth to that cloud-based service. Online vs Offline
The decision between tape and disk storage at the enterprise level also raises the question of whether backups will be online, and thus always available, or if they will be offline backups and will need to be retrieved from a storage location before they can be accessed.
Online Backups have an advantage of quick retrieval and accessibility.
They also help you respond to immediate issues and maintain the flow of production. Online backup services ensure data is accessible and protected from cyber security threats. You can access the files you’ve backed up anytime or anywhere you need them—even on different devices, including mobiles and tablets. Often referred to a ‘Hot Backup.’ Backup speed is dependent on how strong and fast your internet connection is. Offline Backups are often used to ensure an organization doesn’t have a total data loss. Copies are kept locally on the computer or on a removable storage device (USB, SSD, Tape, Disk).
Copies data to a location that is accessible when a computer doesn’t have an internet connection. Often referred to as a ‘Cold Backup’. Offline backup methods won’t be impacted by power surges, and don’t require an internet connection to work Off-Site Storage
Offsite Data Storage refers to when the data storage facility is physically located away from your organization's office with the end goal of creating data redundancy and recovery.
Off-site storage, a form of geographic diversity, helps ensure that a single disaster cannot destroy an organization's data entirely. Distance considerations are also important to ensure that a single regional disaster is unlikely to harm the off-site storage. Non-Persistence
Non-Persistence is the ability to have systems or services that are spun up and shut down as needed.
A cloud-based environment is constantly in motion. We are creating new application instances and tearing down old application instances all the time. We refer to these changes as non-persistence, because it’s very unusual in a cloud-based environment to have any particular service, that’s always going to be permanent. Snapshots can capture the current configuration state and data of any given moment. They’re used to preserve the complete state or configuration of a device. Revert to Known State
Revert to Known State is achieved by rolling back to a previous version of a system using snapshots.
Last-Known Configuration
Live Boot Media
Live Boot Media is a bootable operating system that can run from removable media like a thumb drive or DVD. Using live boot media means that you can boot a full operating system that can see the hardware that a system runs on and that can typically mount and access drives and other devices.
Used when a system has been compromised, or when the operating system has been so seriously impacted by an issue that it cannot properly function Boot sector and memory-resident viruses, bad OS patches and driver issues, and a variety of other issues can be addressed Live Boot Media.
High-Availability
High-availability solutions like load balancing, content distribution networks, and clustered systems, provide the ability to respond to high-demand scenarios as well as to failures in individual systems.
When loads on systems and services become high or when components in an infrastructure fail, organizations need to respond with some High-Availability solutions. Scalability
Scalability is a common design element and a useful response control for many systems in modern environments where services are designed to scale across many servers instead of requiring a larger server to handle more workload. Two major categories of scalability: Vertical Scalability
Vertical scalability requires a larger or more powerful system or device; Vertical scalability can help when all tasks or functions need to be handled on the same system or infrastructure. Vertical scalability can be very expensive to increase, particularly if the event that drives the need to scale is not ongoing or frequent. There are, however, times when vertical scalability is required, such as for every large memory footprint application that cannot be run on smaller, less capable systems. Horizontal Scalability
Horizontal Scalability uses smaller systems or devices but adds more of them. When designed and managed correctly, a horizontally scaled system can take advantage of the ability to transparently add and remove more resources, allowing it to adjust as needs grow or shrink. This approach also provides opportunities for transparent upgrades, patching, and even incident response. Restoration Order
Restoration Order decisions balance the criticality of systems and services to the operation of the organization against the need for other infrastructure to be in place and operational to allow each component to be online, secure, and otherwise running properly Each organizations Restoration Order will be tailored to ensure business critical needs are met in a timely matter in order to mitigate loss. Restoration Order may look like the following, but each organization and infrastructure design will have slightly different restoration order decisions to make based on criticality to the organization's functional requirements and dependencies in the datacenter's or service's operating environment. Diversity
Diversity of Technology is another way of providing uptime and availability.
For example, a zero-day vulnerability might cause an outage with a particular operating system. But if you’re running different operating systems, you may still be able to provide uptime and availability, because that zero-day attack is only going to affect a subset of your services.