Please note that I have no current affiliation or official relationship with Baltimore County Public Schools or with the Mayor and City Council of Baltimore (Baltimore City’s municipal government). The below commentary and analysis are my own, and represent my estimation and opinions about what has happened in the
, and everyone with either a Chromebook or their own spare PC will be able to get back in. It will, however, leave families struggling if the only computer they can use is a school-issued Windows PC, which will need hands-on remediation. My current understanding is that elementary school students have Chromebooks, while middle- and upper-school students have mostly Windows devices, which will need to be brought in for cleaning.
The following discussion was written before we got this news, and runs through the alternative and more common scenario, in which the School would have had to recreate all of its user accounts. (Note also that the below will likely apply to everything that BCPS has not yet said is safe, namely Windows PC’s and, presumably, non-Google accounts.)
, 12/2/2020. “Students who need a new device or assistance are being asked to visit their nearest Baltimore County public high school on Tuesday [12/1/2020] from 1-5 p.m.”
Beginning on the evening of Tuesday November 24th, 2020, the school’s network began to suffer from a
attack. By the next morning, students and teachers had lost access to everything they depended on for school.
Ransomware is a small piece of software that scrambles a computer’s files with an encryption key that only the attacker knows. The ransomware virus then copies itself cross computers on the network, locking more and more files. As the virus spreads, computers are starved of the information that they need to operate (e.g. network storage, printers, web pages, email), which causes services to shut down, eventually cascading into catastrophic failure.
A rough analogy would be someone breaking into the school and throwing a combination lock 🔐 on everything in sight: doors, filing cabinets, pipe valves, elevator buttons . . . you name it. There is one master combination to all the locks, but only the attacker knows it, and you’ll only get it if you pay up in bitcoin.
As the attack unfolded, users who were logged in to a machine with an active infection would have seen a popup similar to the below, with instructions on how to pay the ransom:
🤦♂️ How did it happen?
Once ransomware is in the environment, it uses fairly basic methods to open the files and scramble them. It would be difficult (but not impossible) to design a computer that lets you do what you want with your own files, but not allow a program to encrypt them.
But to enter the environment and spread quickly, attackers generally need to have some sort of administrator credentials. That is, the attackers need to have penetrated the network and stolen IT username and password with rights to do some very important work across the whole school computing environment. Either that, or they need to have found a machine with 1) an open door to the internet and 2) an unpatched vulnerability that they could exploit to gain control of the machine.
At the very root of these attacks is a phenomenon I like to call “local optimization under global resource constraints.” Depending on your point of view, this is either people making the most of what they have, or it’s simply laziness and failure to prioritize correctly.
Regardless of how the blame game ultimately shakes out, there are three manifest features of classically weak infrastructure: a flat network, flat servers, and flat accounts.
means that one machine can talk to anything else out there on the network. For example, in a flat network I could use my VoIP phone to send messages to my printer, without any network equipment stepping in to stop it. Why would I want my phone and my printer to talk? Maybe I need it for electronic faxing, but maybe not. The point is that if you don’t need to connect two things on the network, don’t.
means that your computers serve important functions and routine functions side-by-side. Think of it this way: you wouldn’t really want to put the student records filing room right next to the cafeteria, or the cafeteria next to the building security room. Better yet, you’d want security and records to be on totally separate floors from the more pedestrian parts of the school. Similarly, you shouldn’t put all user passwords on the same computer as everyone’s Word files, and if possible you should a special security layer (think building floor) between all of those administrative computers and everything else.
God accounts 👼 give IT users the power to do everything, instead of having to pass through additional layers for risky actions. Think again of a building: a lazy manager will give everyone on staff a master key, so that they can get whatever they need, whenever they need it. But to keep those keys from falling into the wrong hands, you’d want to have a single set that people have to sign in and out in person. Similarly, you want your IT admins to have to, e.g.,
before they can create another new administrator account.
Unfortunately, these security precautions take additional time and money to map, build, and navigate day-to-day. Segmentation takes time to configure and creates communication bottlenecks, slowing network performance. Implementing “least privileged access” by definition means you have multiple sets of privileges and roles, which is costly to design, control, and use. Finally, tiering servers can increase costs nearly exponentially, as IT admins now have to deal with more machines and manage more complex relationships between them.
Like a trauma doctor, your first goal is to stop the bleeding and stabilize the patient. This means telling users not to try to login to their accounts and computers, while your IT staff begins to isolate the network and change whatever administrative credentials they can.
As the situation stabilizes and you are no longer descending into madness, you need to gather your key team members to establish a meeting cadence, as you continue to assess your environment and plan your recovery. You should also plan to brief company leadership on a regular basis, no more than every two hours, at first.
One of the best things you can doーas early as possibleーis engage a trusted cyber crisis response team, as well as outside counsel. When I was with Baltimore City, we worked with the excellent
from Clark Hill PLC. The reason you want these folks in the room with you is that in the fog of war, you don’t know your unknowns. Heck, you don’t even know the number of unknowns. Meanwhile, your network has been reduced to zero and you need to re-architect systems that took years to implement. These decisions are
, and for every question that these professionals can answer authoritatively, you’ve pruned an entire branch from your decision tree. In other words, you can reap exponential returns on that investment.
Also, there will be many, many new and existing vendors trying to get your attention. It behooves you to have your guard up, as well as to listen to trusted neutral advisors that you’ve brought in to help with recovery. Let them manage the chaos and you can focus on getting back to business as quickly as possible.
There will be any number of priorities identified as you begin recovery, but the next most important thing you can do is to restore and controlcommunication. I emphasize control because if you do not speak authoritatively to let the public know what communication channels can be trusted, you are opening the door to imposters and/or compliance violations. As you can imagine, you would not want students sending medical documents to something like email@example.com. In the optimistic case, the school may run afoul of HIPAA. In the worst case, that email address actually belongs to a criminal who is harvesting private information.
When it comes to public communications, the pressures force you to say only what’s absolutely and minimally necessary for your stakeholders to plan their lives around. There are so many unknowns, that you can only be reasonably confident that you won’t be back to normal in a matter of days. But until you complete at least 48 hours of investigation, you can’t set any realistic timelines, even for fairly basic services like email.
One factor that the school has going very much in its favor is an excellent communication infrastructure that spans Twitter, automated phone dialing, and other channels. They have done a great job using those so far to keep the public informed about the situation, and deserve great credit for that.
💵 Should the School pay a ransom?
The two major factors in deciding whether to pay a ransom are 1) whether you have good data backups and 2) the attacker’s reputation.
If you have good backups, you are going to get very little value from paying a ransom to recover your data. It is very hard to ever know exactly how long attackers have been inside your network and what they’ve been up to, including whether they’ve hidden any “back doors” to re-enter the network at some point in the future. This means you’re going to have to rebuild almost everything from scratch regardless of whether you get your data back.
So the decision to pay ransom comes down almost entirely to whether your backups are clean and available. If your backups have been compromised (by the attacker or otherwise), you will need to ensure that the attackers are 1) credible and 2) not a terrorist organization.
The latter point speaks for itself, but the former is an interesting example of economics in action through
. An attacker knows that its victims will not naturally trust that their data will be released to them, so these criminals invest time in building a reputation, so that victims can have some confidence that their data will be returned upon ransom payment. That is to say, hackers invest in their brand.
The attackers will also provide a limited decryption key so that the victim will know they are dealing with the right party - another signal of credibility. One interesting wrinkle in the City attack happened when a Twitter user demanded ransom payment and posted data that clearly had been exfiltrated from City servers. But it turned out that this hacker had stolen the data well before the attack, and they had nothing to do with the ransomware. They were simply trying to horn in on the ransomers’ racket.
🌈 Why am I optimistic in this case?
From observing our children's teachers and speaking with my kids, most use OneDrive or Google Drive, and the Schoology e-learning platform runs on top of Sharepoint in the cloud. In fact, a friend sent me a screenshot from the morning after the attack, showing that Schoology remained up and was accessible to people whose sessions had not yet timed out. Because the cloud is generally
from on-prem resources, it’s extremely unlikely that any of the above services were compromised in the attack. It’s just that no one can get to them because the BCPS servers that authenticate users (check passwords) are hard down.
In the optimistic case, BCPS could recreate all the accounts they need for remote learning with Google’s cloud, and start handing them out in fairly short order. Theoretically, there is a minimal set of cloud “infrastructure” (i.e. settings to configure) and accounts necessary to support remote learning, as we know it. As long as the new accounts get into the right hands, that would let administrators, teachers, and students login to Schoology, Google Meet, Google Drive, and OneDrive. I would be very surprised if any of those resources were compromised in a real or direct sense.
However, your users now need clean passwords and clean devices, both of which present significant logistical hurdles.
First up are accounts. The basic tenet here is that you only want to give out passwords after verifying identity for the very first time, and that process is very hard to scale. It’s one thing to have a friend show up and get keys from you. It’s another to have everyone you know show up and ask for their password. Teachers can likely verify kids in-person, but if you want to automate the process, you need some sort of
. So, you need some other piece of information that is hard to steal (like a password handed to you on paper), or the school needs to observe something that is very hard to imitate (like your face, voice, or fingerprint).
Ordinarily, IT administrators could set up "self-service password reset" using text messages or phone calls, which are not so much a shared secret as a pre-existing
. The good news is that schools are used to handing out batches of passwords to new students every year. Plus, they’ve now been through a few rounds of material distribution during COVID, so they’re fairly well-drilled in those exercises.
Next up are devices, which may prove a larger challenge, but being cloud-based for remote learning, and having a large number of ChromeOS devices will both help very much. First, because Schoology and Google Meet are browser-based, families can use other computers to login. Second, Chromebooks are most likely immune to this Windows-based attack; and even if they are compromised, Chromebooks can be wiped by families at home. (While you may not have been able to sign in with your BCPS account “out of the box” before the hack, BCPS could set that up fairly easily as part of the recovery with little effort, although perhaps a few compromises.) These two factors should leave a relatively small number of machines that need to be brought in for cleaning or exchanged. That could be done alongside password distribution at designated locations, with one station for passwords, and another station for machines.
📈 How will schools manage the rest of their recovery?
In my book, a good recovery plan combines the best fundamental aspects of project and product management. At the root, you want to identify your core business drivers, and start breaking them down into independent pieces that you can rebuild quickly. As you do that, you can use the RICE formula to prioritize and “budget” your recovery efforts:
For example, you could compare email and VoIP recovery:
There are no rows in this table
Unfortunately, many of your decision points are not so clear-cut, and are more about trading off quality for time-to-recover. Going back to our combination lock example, imagine that the school is now off-limits until some serious work is done. “For the sake of realism,” let’s add hypothetical explosive charges. Your experts are now saying you can’t use anything in the school building until it has been cleared by the bomb squad.
If you need to start teaching quickly, the fastest way to get back in business may be to grab some empty office space across the street and set up makeshift classrooms. But your teachers will then have to go beg, borrow, and steal everything they need for the new location. So your fundamental tradeoff is between doing something quickly and doing business as usual after a delay. And you face this choice at every turn.
Should you find yourself in this unfortunate situation, there are a few key principles to keep in mind when crafting a recovery plan of attack:
Don’t try to recover everything perfectly. Figure out what’s minimally necessary for each business function and move on it.
Pay attention to the marginal cost of recovering each service. For example, the cost of being without your timekeeping system is not that you miss payroll, it’s that you “round up” and pay everyone on the assumption that they earned a full paycheck. The cost of not having printing is that people print at home.
Eliminate as many dependencies as possible. If two parts of your recovery look like they influence each other, figure out the delay or expense of running them independently, and then deal with it.
Stay flexible. Teams will be delivering good and bad news by the minute. It’s important that each team knows what is driving their efforts, so that they can let leadership know when enough relevantfacts have developed to consider shifting priorities.
One final point of interest is that the “recovery” is a very fuzzy concept. A big chunk of recovery will happen inside of two weeks, and with a little luck, the majority can be recovered within two months. But the remaining devilish details can take a very long time, and there will be considerable pressure to pack the recovery plan. Rightly or wrongly, institutions often get into these situations because they’re not spending money, or at least not spending it the right way. So when the opportunity arises to invest in IT as part of recovery, there will be many voices in the room advocating for it. After all, if someone steals your 2010 Accord, your first thought is not