During the Purdue Airlink outage in January, Information Technology at Purdue worked over 72 consecutive hours through one weekend to get the internet back up and running for panicking students.
Mark Sonstein, executive director of IT infrastructure services at Purdue, worked with his team in ITaP’s “war room.” Here they spent hours working to fix the problems that plagued students who wanted to finish their homework.
Now, the issue that caused January’s outage, a bug in the Cisco system Purdue uses, has been resolved, though outages may still happen in the future.
“I can’t buy the equipment we need five years from now,” Sonstein said.
He explained that expecting current technology to outperform that of the future is impossible, and compares it to expecting a phone released five years ago to do what cell phones today are capable of.
“It’s really cyclical,” Sonstein said.
During major PAL 3.0 outages like the one in January, procedures are in place to ensure the most efficient process in finding a solution.
At first, all teams report to the war room for an initial assessment, which may take anywhere between two to four hours, according to Sonstein. After the issue is analyzed, whatever team is most relevant takes charge, whether it be due to a software, hardware or infrastructure problem, and the rest are dismissed. Then, shifts of workers are assigned, and those who leave are meant to sleep and get rest, so they can come back to the job ready to work in about eight hours.
“I need them to get sleep,” Sonstein stressed, especially as those working into the night perform work that must be reviewed by incoming workers.
In one outage, hundreds of lines of configuration code meant to solve PAL problems were written on the spot, and had to be checked for bugs by the next shift.
“My team is very focused on ‘let’s find the problem, let’s fix the problem, let’s get service back,’” Sonstein said. “But there are specific steps that need to be taken, like a communications plan. Those types of things that have to happen, that my folks that are focused on fixing the problem, we need that major incident manager to say ‘Wait, have we thought through this?’
“That’s why we put those processes in place where someone else reviews the code before it gets deployed.”
One team takes special care to alert others to any outages, due to “fire and safety and environmental concerns,” director of services management, Rick Rodriguez, said. “Some quick triage” ensures that all bases are covered when it comes to PAL outages.
To combat against future outages, ITaP replaced its systems completely, a $30 million dollar venture. This isn’t the last update, Sonstein said, as ultimately students will always require more bandwidth accessibility to keep up with advancing technology.
“As we start to get too close to that five year point, we’ll be doing another life cycle replacement,” Sonstein said. “Because we need new equipment that’s going to keep up with the demand.
“So I think over the next two or three years, we’re probably going to see really less utilization versus what the equipment’s capable of, but as we start getting closer to that five year point, we’re going to see where the student devices are starting to overcome the capabilities of the system.”
“This is not like an uncommon thing with IT equipment,” Greg Kline, IT communications manager, said. “We tend to think in five-year cycles.”
“So we’re good for the next five years. We’re not anticipating that the demand will, or ever, go down,” Kline laughed.
After a crisis has been resolved, ITaP tries to learn from the experience, as to better prevent the issue from happening again in the future.
“Once we go through it, we go back and look at it and see how our procedure worked and what we might need to do differently,” Kline said. “Also, periodically we have simulations. We play little war games and pretend that it happened and look at how we respond to it.”