The Official Blog of United Solutions

The Core

Sajed on: Data Center Upgrades

May 11, 2016

How adding a Core Switch Gives our Data Center the ‘Five Nines’

 

What is a core switch for those of us who don't know?

Sajed: The core switch is the central and key component of the entire network infrastructure within the data center. It’s a telecommunication device that receives a message from any device connected to it. Then it transmits the message only to the device for which the message was meant.

United Solutions’ investment into an enterprise level Core Switch has helped optimize our critical line of business network devices with Cisco Firewalls, Cisco WAN/Voice Routers and the Cisco UCS Chassis.

 

Explain the importance of upgrading the network with the second core switch.

Sajed: The importance of upgrading the network with the second core switch addresses multiple important design principles within the network design. Such as resiliency, redundancy, port density and increased bandwidth throughout the data center.

Our data center traverses petabytes of data. To ensure we provide maximum uptime, we heavily invested in our network backplane to ensure business continuity.  Our clients run their businesses all day, every day, including off hours.

Network redundancy is a simple concept to understand. If you have a single point of access and it fails you, then you have nothing to continue operations. If you put in a secondary (or tertiary) method of access, then when the main connection goes down, you will have a way to connect to resources and keep the business operational.

You always want to have a backup to create redundancy and a load balance, so if one goes down there is a diverse path to route traffic to that secondary switch. We want to ensure our data center guarantees the five nines for uptime for our clients.

 

What challenges did you all encounter during the implementation?

Sajed: As with any critical and significant project, we encountered some challenges during the implementation process. These challenges included significant hardware failure, licensing issues, and issues with neighboring critical devices. 

The first week we thought we were ready to install and configure the switch, but we encountered a hardware issue. We started turning up the switch and rebooted twice. On the third reboot, we received a fatal hardware error. Since it was a Saturday night, and our maintenance windows are extremely small – between 1am to 3am - we needed to halt the upgrade until a replacement arrived from Cisco.  To ensure we limit the amount of downtime for our clients and their members, we pushed the upgrade to the following weekend.

The following weekend, all systems were a go (in a 3-hour maintenance window). We racked, installed, and rebooted the new switch about 15 times. No hardware issues were reported. We began configuring the switch and created the VPC (Virtual Port Channel), which enables the redundancy (peer link) between the two switches. We were at the final step of going live about 1 hour into the upgrade.  All of sudden we ran into another error!  Unfortunately, the replacement switch that was shipped to us, needed a GEM (Generic Expansion Module) card or Daughter card - which did not come with the replacement.  We needed to roll-back the update.  Thankfully, we’ve fostered a great relationship with Cisco and CDW and we were able to have a new card within a few hours, on Sunday morning.

Early Sunday morning, we made the decision to go through another maintenance window, between 1am to 3am, to continue with the upgrade as originally planned.  We were back up within 2 hours with all systems tested by the USC Network/Client Services team.  Bottom-line: the project was a success!

 

What advice would you have for others getting ready to do this at their shop?

Sajed: The best advice I can give is to prepare for significant projects, such as this, with as much cutover time as possible. Unfortunately, in the world of IT, it is inevitable that we will run into issues. With proper preparation, we can mitigate the impact of the unexpected issues that arise due to the anticipation of these types of events. 

We found it critical to the overall success of the project to have prewritten CLI (command line interface) scripts, TAC validation and cutover work plans. The prewritten CLI scripts allowed us to just take the code we needed, vet it against the engineers at the Technical Advisory Center at Cisco (TAC), and make sure we made no mistakes. Having a subject-matter expert review our work ensures we’re following best practice and security compliance. This critical, yet essential resource really mitigates the amount of errors one would make.

The last thing we want to do in our short 3-hour window is invest critical time on issues we could have mitigated.  If we review the planning, meetings and time investment we made for this project and considering we did this in only 3 hours – it is a significant feat!  It did it take us a little more time and three back to back weekends to get this accomplished, however overall the project was a success.  We’re cognizant of the sensitivity between our clients and their members so we try to tighten up the window.

Fostering relationships with critical partners like Cisco and CDW is so important because it ensures our success.  One of my key areas of focus, since joining USC, is to build those relationships.

 

Are we in good shape now? What are the short/long term effects of the upgrades?

Sajed: United Solutions is now in much better shape to endure the impact of a loss of a critical network core component within the architecture.  Additionally, United Solutions has laid the foundation for easy scalability and increased the overall available backbone bandwidth within the architecture. This ensures United Solutions provides ‘the five nines’ uptime, for our clients.

It also allows our credit unions to have high availability (HA) and it prepares them for business continuity and disaster recovery (BCDR). We’re mitigating our risk for any potential downtime as most of clients are connecting in to our datacenter. By making these investments we’re providing redundancy to ensure high availability.