3.0 Network Operations

icon picker
3.1 Given a scenario, use the appropriate statistics and sensors to ensure network availability.

Last edited 566 days ago by Makiel [Muh-Keel]
Let's imagine you were just brought from the 1800s to the present in a time machine, and on your first trip in a car, you examine the dashboard. Speed, temperature, tire inflation, tachometer, what does all that stuff mean? It would be meaningless to you and useless for monitoring the state of the car's health. Likewise, you cannot monitor the health of a device or a network unless you understand the metrics.

Device Metrics to Monitor

Temperature Heat and computers do not mix well! Many computer systems require both temperature and humidity control for reliable service. The larger servers, communications equipment, and drive arrays generate considerable amounts of heat; this is especially true of mainframes and older minicomputers.
Overheating is also a big cause of reboots. When CPUs get overheated, a cycle of reboots can ensue. Make sure the fan is working on the heat sink and the system fan is also working. If required, vacuum the dust from around the vents.
Central Processing Unit (CPU) Usage When monitoring the CPU, the specific counters you use depend on the server role. Consult the vendor's documentation for information on those counters and what they mean to the performance of the service or application. The following counters are commonly monitored:
Processor\% Processor Time—The percentage of time the CPU spends executing a non-idle thread. This should not be over 85 percent on a sustained basis.
Processor\% User Time—The percentage of time the CPU spends in user mode, which means it is doing work for an application. If this value is higher than the baseline you captured during normal operation, the service or application is dominating the CPU.
Processor\% Interrupt Time—The percentage of time the CPU receives and services hardware interrupts during specific sample intervals. If this is over 15 percent, there could be a hardware issue.
System\Processor Queue Length—The number of threads (which are smaller pieces of an overall operation) in the processor queue. If this value is over two times the number of CPUs, the server is not keeping up with the workload.
Memory Different system roles place different demands on the memory, so there may be specific counters of interest you can learn by consulting the documentation provided by the vendor of the specific service. Running out of memory is normally fatal for a server or computer. Some of the most common counters monitored by server administrators are listed here:
Memory\% Committed Bytes in Use—The amount of virtual memory in use. If this is over 80 percent, you need more memory.
Memory\Available Mbytes—The amount of physical memory, in megabytes, currently available. If this is less than 5 percent you need more memory.
Memory\Free System Page Table Entries—The number of entries in the page table not currently in use by the system. If the number is less than 5000, there may well be a memory leak.
Memory\Pool Non-Paged Bytes—The size, in bytes, of the non-paged pool, which contains objects that cannot be paged to the disk. If the value is greater than 175 MB, you may have a memory leak (an application is not releasing its allocated memory when it is done).
Memory\Pool Paged Bytes—The size, in bytes, of the paged pool, which contains objects that can be paged to disk. (If this value is greater than 250 MB, there may be a memory leak.)
Memory\Pages per Second—The rate at which pages are written to and read from the disk during paging. If the value is greater than 1000, as a result of excessive paging, there may be a memory leak.

Network Metrics to Monitor

The health of a network’s operation can also be monitored so you can maintain its performance at peak efficiency. Just as you can avoid a problem issue with a workstation or server, so you can react to network conditions before they cause an issue by monitoring these items.
Bandwidth Although bandwidth has increased to allow us to do what we do, there are still limitations that cause network performance to suffer miserably.
The following are metrics to follow for bandwidth on a system:
Network Interface\Bytes Total/Sec—The percentage of bandwidth the NIC is capable of that is currently being used. If this value is more than 70 percent of the bandwidth of the interface, the interface is saturated or not keeping up.
Network Interface\Output Queue Length—The number of packets in the output queue. If this value is over 2, the NIC is not keeping up with the workload.
There are many different ways to gather bandwidth information: SNMP, NetFlow, sFlow, IPFIX protocol analysis, software agent. Monitoring the amount of usable bandwidth can be used to identify fundamental issues; Nothing works properly if bandwidth is highly utilized.

Latency is the time measured between a network request and the networks response. Although some latency is expected and is normal, keeping an eye out for an abnormally high amount of latency is key to maintaining a smooth operating network.
A low-latency network connection is one that generally experiences short delay times, while a high-latency connection generally suffers from long delays. Many security solutions may negatively affect latency. For example, routers take a certain amount of time to process and forward any communication. Configuring additional rules on a router generally increases latency, thereby resulting in longer delays. An organization may decide not to deploy certain security solutions because of the negative effects they will have on network latency.
Measuring latency is typically done using a metric called round-trip time (RTT). This metric is calculated using a ping, a command-line tool that bounces a user request off a server and calculates how long it takes to return to the user device.

Jitter occurs when the data flow in a connection is not consistent; that is, it increases and decreases in no discernable pattern. Jitter results from network congestion, timing drift, and route changes. Jitter is especially problematic in real-time communications like IP telephony and videoconferencing.
Data in video or live-streaming is especially susceptible to jitters. Can also be seen when on an IP network phone call, sometimes it will cut out or drop and you can’t rewind a phone call; That’s why it’s important for jitter to be mitigated as much as possible so you can reduce the amount of information missed to near zero %.

SNMP (Simple Network Management Protocol)

SNMP uses ports 161 and 162, collects and manipulates valuable network information. It gathers data by polling the devices on the network from a management station at fixed or random intervals, requiring them to disclose certain information.
Collecting this data can help IT professionals keep their finger on the pulse of all their managed devices and applications.
When all is well, SNMP receives something called a Baselinea report delimiting the operational traits of a healthy network. This protocol can also stand as a watchdog over the network, quickly notifying managers of any sudden turn of events. The network watchdogs are called agents, and when aberrations occur, agents send an alert called a trap to the management station. In addition, SNMP can help simplify the process of setting up a network as well as the administration of your entire network.
SNMP has three versions, with version 1 being rarely, if ever, implemented today. Here's a summary of these three versions:
SNMPv1—Supports plaintext authentication with community strings and uses only UDP.
SNMPv2c—Supports plaintext authentication with MD5 or SHA with no encryption but provides GET BULK, which is a way to gather many types of information at once and minimize the number of GET requests. It offers a more detailed error message reporting method, but it's not more secure than v1. It uses UDP even though it can be configured to use TCP.
SNMPv3—Supports strong authentication with MD5 or SHA, providing confidentiality (encryption) and data integrity of messages via DES or DES-256 encryption between agents and managers. GET BULK is a supported feature of SNMPv3, and this version also uses TCP. (Note: MD5 and DES are no longer considered secure.)
What are SNMP TRAP & GET messages?
SNMP TRAP The SNMPTRAP command is a common way for devices to send alerts. These are asynchronous messages sent to the manager by an agent when something needs to be reported. A storage appliance, for example, might send a trap to the manager when it loses access to a drive. Other examples include a power-up situation or high-traffic notification that should be evaluated.
But SNMP managers don't have to sit around waiting for agents to send a message. They may prefer to ask for data proactively. This ensures devices are still active and functioning properly. Without a proactive check you may not know if a quiet device is offline or simply doesn't have anything to report.
GET Message is when the SNMPGET command retrieves one or more values from the MIB (management information base).
image.png
In addition to polling to obtain statistics, SNMP can be used for analyzing information and compiling the results in a report or even a graph. Thresholds can be used to trigger a notification process when exceeded. Graphing tools are used to monitor the CPU statistics of devices like a core router. The CPU should be monitored continuously, and the NMS can graph the statistics. Notification will be sent when any threshold you have set has been exceeded.

Object identifiers (OIDs) are an identifier mechanism standardized for naming any object, concept, or “thing” with a globally unambiguous persistent name.
Each physical component can possess a number of OIDs to describe the current state of a system. In Simple Network Management Protocol (SNMP), each node in a management information base (MIB) is identified by an OID
Management Information Bases (MiBs) OIDs are organized into a hierarchical structure called management information bases (MIBs). A managed object (sometimes called a MIB object or object) is one of any number of specific characteristics of a managed device. Managed objects are made up of one or more object instances, which are essentially variables. An OID uniquely identifies a managed object in the MIB hierarchy.

Network Device Logs

While SMTP should be in your toolbox when monitoring the network, there is also a wealth of information to be found in the logs on the network devices. You will now learn about the main log types and methods to manage the volume of data that exists in these logs. Baseline configurations are covered in detail in
Log Reviews
High-quality documentation should include a baseline for network performance because you and your client need to know what “normal” looks like in order to detect problems before they develop into disasters. Don't forget to verify that the network conforms to all internal and external regulations and that you've developed and itemized solid management procedures and security policies for future network administrators to refer to and follow.
Traffic Logs Some of your infrastructure devices will have logs that record the network traffic that has traversed the device. Examples include firewalls and intrusion detection and prevention devices. Those devices were covered in Chapter 5. Many organizations choose to direct the traffic logs from these devices to a syslog server or to security information and event management (SIEM) systems (both covered later in this section).
Audit Logs Audit logs record the activities of the users. Windows Server 2019 (and most other Windows operating systems) comes with a tool called Event Viewer that provides you with several logs containing vital information about events happening on your computer. Other server operating systems have similar logs, and many connectivity devices like routers and switches also have graphical logs that gather statistics on what's happening to them. These logs can go by various names, such as history logs, general logs, or server logs. Figure 13.2 shows an Event Viewer security log display from a Windows Server 2019 machine.
image.png

Syslog
Reading system messages from a switch's or router's internal buffer is the most popular and efficient method of seeing what's going on with your network at a particular time. But the best way is to log messages to a syslog server, which stores messages from you and can even time-stamp and sequence them for you, and it's easy to set up and configure! Figure 13.3 shows a syslog server and client in action.
Syslog allows you to display, sort, and even search messages, all of which makes it a really great troubleshooting tool. The search feature is especially powerful because you can use keywords and even severity levels. Plus, the server can email administrators based on the severity level of the message.
image.png
Network devices can be configured to generate a syslog message and forward it to various destinations. These four examples are popular ways to gather messages from Cisco devices:
Logging buffer (on by default)
Console line (on by default)
Terminal lines (using the terminal monitor command)
Syslog server

In most cases you need to know what to filter so you get the information you really need and nothing else. For example, with syslog you can filter by the security level. Severity levels, from the most severe level to the least severe, are explained in Table 13.1. Informational is the default and will result in all messages being sent to the buffers and console.
image.png

Interface Statistics/Status

You can of course, perform some interface monitoring as well. Monitoring the interfaces can potentially spot any aberrations occurring in your network. You've got to be able to analyze interface statistics to find problems there if they exist, so let's pick out the important factors relevant to meeting that challenge effectively now.
Is your link status for your switch Up or Down? Check the link status of the switch to make sure it’s active or inactive. If the link status is down, then that normally means one of your switches has failed or lost power, creating an outage on your switch.
Typically, the most important metric on an interface is its link state. Is it up (functional) or down? While some tools can only tell you the link status, other devices and tools can tell you what the issue is. For example, Cisco routers and switches can tell you the link state along with an indication of the issue. On network interface cards (NICs), link lights can also tell the state of the connection. When the light is green, the connection is good, and when it's amber, there is an issue. Also, it will blink rapidly when data is traversing the NIC.
image.png
The first up listed is carrier detect. If this shows down, then you have a physical layer problem locally and you need to get to that port immediately and check the cable and port. The second statistic, which is protocol is up in this example, is keepalives from the remote end. If you see up/down, then you know your local end is good but you're not getting a digital signal from the remote end.

Are your Speed & Duplex statistics correctly configured?
In full-duplex communication, both devices can send and receive communication at the same time. This means that the effective throughput is doubled and communication is much more efficient. Full-duplex is typical in most of today's switched networks. ​You also learned that two interfaces on the end of a common link should be set to both the same duplex and the same speed to function correctly. Run the command below on a router or switch to list the duplex settings.
image.png
Has the error rate on your network deviated from the norm? You can monitor the error rates of CRC, Giants or Jumbos, Runts, and Encapsulation errors. Any deviation from the norm should be immediately regulated.
Send/Receive Traffic You’ll need to check how well traffic is flowing in and out of a device as well. Sometimes you need to check how well traffic is flowing into and out of a device, without regard to the type. The show run command will show this as well. In this case, the interface is down so there is no traffic flowing in either direction.
image.png
Cyclic Redundancy Check This Provides error detection from a cyclic redundancy check (CRC). But remember—this is error detection, not error correction. Just know that when CRC errors occur, something has corrupted the received packet.
Protocol Packet and Byte counts It is also possible to determine the number of packets received from protocols and the number of bytes received. This is also contained in a section of the output of the show run command as shown here:
image.png
Encapsulation Errors
Encapsulation is the process of adding headers and trailers to data. When a host transmits data to another device over a network, the data is encapsulated, with protocol information at each layer of the OSI reference model. Each layer uses protocol data units (PDUs) to communicate and exchange information from the source to the destination.
A Failed Encapsulation Error message indicates that the router has a layer 3 packet to forward and is lacking some element of the layer 2 header that it needs to be able to forward the packet toward the next hop. You may see this in the logs of a router as shown here:

Environmental factors and sensors

All of the equipment discussed in this chapter—switches, routers, hubs, and so on—require proper environmental conditions to operate correctly. These devices have the same needs as any computing device.
Temperature Like any device with a CPU, infrastructure devices such as routers, switches, and specialty appliances must have a cool area to operate. When temperatures rise, servers start rebooting and appliance CPUs start overworking as well. The room(s) where these devices are located should be provided with heavy-duty HVAC systems and ample ventilation.
Humidity The air around these systems can be neither too damp nor too dry; it must be “just right.” If it is too dry, static electricity will build up in the air, making the situation ripe for damaging a system. It takes very little static electricity to fry some electrical components.
If it is too damp, connections start corroding and shorts begin to occur. A humidifying system should be used to maintain the level above 50 percent. The air conditioning should keep it within acceptable levels on the upper end.
Electrical Power is the lifeline of the data center. One of your goals is to ensure that all systems have a constant clean source of power.
Runtime vs Capacity The runtime is the amount of time the UPS can provide power at a given power level. This means you can't really evaluate this metric without knowing the amount of load you will be placing on the UPS. Documentation that comes with the UPS should reveal to you the number of minutes expected at various power levels. So if you doubled the number of similar devices attached to the UPS, you should expect the time to be cut in half (actually, it will cut more than half in reality because the batteries discharge quicker at higher loads).
Capacity, on the other hand, is the maximum amount of power the UPS can supply at any moment in time. So, if the UPS has a capacity of 650 volt amperes (VA) and you attempt to pull 800 VA from the UPS, it will probably shut itself down. So both of the values must be considered. You need to know the total amount of power the devices may require (capacity) and, based on that figure, select a UPS that can provide that for the amount of time you will need to shut all the devices down.
Automated Graceful Shutdown is when a computer is turned off by software function and the operating system (OS) is allowed to perform its tasks of safely shutting down processes and closing connections. A hard shutdown is when the computer is forcibly shut down by interruption of power.
You should also perform a Periodic Testing of Batteries as well. Just as you would never wait until there is a loss of data to find out if the backup system is working, you should never wait until the power goes out to see the UPS does its job. Periodically, you should test the batteries to ensure they stand ready to provide the expected runtime.
Flooding is a major risk factor in some parts of the country, floods are a constant source of concern. For this reason, server rooms and data centers should be located on upper floors if possible. If not, raised floors should be deployed to help prevent the water from reaching the equipment, as shown in

NetFlow Data

SNMP can be a powerful tool to help you manage and troubleshoot your network, but Cisco knew it would be very helpful for engineers to be able to track TCP/IP flows within the network as well.
That's why we have NetFlow as an application for collecting IP traffic information. Cisco compares NetFlow informational reports to receiving a phone bill with detailed call information to track calls, call frequency, and even calls that shouldn't have been made at all. A more current analogy would be the CIA and certain additional government “alphabet agencies” watching who has talked to whom, when, and for how long.
Cisco IOS NetFlow efficiently provides a key set of services for IP applications, including network traffic accounting for baselining, usage-based network billing for consumers of network services, network design and planning, general network security, and DoS and DDoS monitoring capabilities as well as general network monitoring.
image.png

Uptime/Downtime

Uptime is the amount of time the system is up and accessible to your end users, so the more uptime you have the better. And depending on how critical the nature of your business is, you may need to provide four-nine or five-nine uptime on your network—that's a lot. Why is this a lot? Because you write out four nines as 99.99 percent, or better, you write out five nines as 99.999 percent. Now that is some serious uptime!
Downtime is the amount of time the system is down and inaccessible to your end users. This is bad and should be almost non-existent.

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.