Updated: 04/28/2022
Original Publish Date: 02/18/2022
The information contained in this document is provided by Cheetah Digital for general information purposes only and is based on information as of the date of distribution and is subject to change.
Incident Update: Root Cause Analysis and Remediation Plan
This article is intended to define the issue (what happened, scope and timing), why the issue happened and how Cheetah Digital has worked to remove the risk of the repetition of this issue.
Service Disruption Incident: 9:30am PST November 26, 2021
Impact Period: 11/26/2021-11/27/2021
Services/Regions affected during incident
Services: CheetahMail and Cheetah Messaging (Including Marketing Suite)
Regions: North America and parts of Asia-Pacific
Services/Regions NOT affected during incident
Services: Cheetah Experiences, Cheetah CES Core Services, Cheetah Engagement Data Platform (EDP), and Cheetah Loyalty (except instances where Cheetah Messaging (including Marketing Suite) is configured to send messages).
Incident Summary
On Friday, November 26, at approximately 9:30 am PST, a fault occurred in our data center that hosts CheetahMail and Cheetah Messaging (including Marketing Suite) services for North America and parts of Asia-Pacific.
The fault was a hardware failure in our Cisco switching environment and a loss of shared storage on the Hewlett Packard Enterprise (HPE) 3PAR Storage Area Network.
Cheetah Messaging (including Marketing Suite) customers were impacted and unable to use these services during the outage. CheetahMail customers had an interruption to their services after the network restart phase of the incident.
The fault was resolved Saturday, November 27, at 1:45 am PST, resulting in a service outage duration of 16 hours 14 minutes for Cheetah Messaging (including Marketing Suite). CheetahMail had a service outage of 8 hours and 10 minutes from Friday, November 26, at 11:43 am PST.
Root Cause Analysis (RCA)
Our monitoring detected the fault immediately and our Holiday Response Team and a multidisciplinary Cheetah team were working on the incident in less than 5 minutes. The incident was declared a P1. An investigation of the incident commenced immediately.
The fault impacted different service and platform components and created intermittent login and connectivity issues with servers and services behind our load balancers. This led to an initial determination of a network fault.
In parallel to the ongoing network investigation, our Security Team investigated the incident and determined the incident not to be a cyber attack. At this point, our Disaster Recovery (DR) option was evaluated. It was determined that service restoration would be the faster path for our customers vs a 24-hour RTO (recovery time objective) for DR.
Our standard procedure for this type of incident is to focus on network functions and work through each tier logically. Load balancers were cycled first, then the front-end and back-end firewalls. This did not resolve the connectivity issue. Next the access switches and core switches were reset. During this reset, Cisco Technical Assistance Center (TAC) was engaged. One of the access switches failed to recover after the reset. Cisco identified a hardware fault. The faulty switch was removed from the High Availability configuration. All networking functions were reset by 2:50 pm PST but connectivity issues persisted.
The access switch reboot caused an instability on the Network File System (NFS) cluster used by CheetahMail. CheetahMail customers were then unable to use the service. The Cheetah team split into two workstreams, one dedicated to connectivity issues and the other to the CheetahMail NFS issue. The Cheetah Platform Engineering Team restored the NFS services first by directing NFS access to storage and subsequently restoring the NFS cluster itself. Customer access to CheetahMail was restored at 7:53 pm PST.
The investigation of the connectivity issue continued. The majority of virtual hosts and the VMWare Management Console were inaccessible. After further analysis of SQL database I/O traffic, it was determined that the HP 3PAR Storage subsystem wasn’t responding as expected. This issue was escalated to the Hewlett Packard Enterprise (HPE) Technical Team who took ownership of the diagnosis and resolution of the storage issues at 5:15 pm PST.
The HPE Technical Team initially rebooted one of the 3PAR Storage Nodes, expecting the highly redundant 3PAR storage array to correct itself. However, after further investigation it became clear that the entire array had to be rebooted to get it back online. With the reboot a comprehensive filesystem check had to be performed. The 3PAR Storage subsystem was declared available at 1:30 am PST on Saturday, November 27.
Our virtual machines and remainder of systems were back online at 1:45 am PST on Saturday, November 27, after a full application system restart.
Due to the nature of the incident, a comprehensive remediation plan is underway. A Cheetah Digital cross-functional team and partners Cisco and HP are engaged in analysis of all possible remediation opportunities to dramatically reduce the risk of such an incident occurring in the future. Our plans will be communicated as progress is made over coming weeks.
Remediation Plan
Immediate:
- Recover high-availability network configuration and replace failed hardware
- Create a series for Public/Customer blog posts that include monthly updates on progress
Near-term:
- Improve Out of Band (OOB) Management capability
- Audit supportability of all hardware and software in Data Centers
- Understand whether we can improve existing DR options by leveraging VMWare Cloud on AWS
- Improve Incident Response process
Mid-term:
- Identify performance improvement options for HP 3PAR Storage
- Investigate whether processing requirements can be met in the cloud
- Identify options to improve current network design
- Implement enhanced DR capabilities
Update - March/April 2022
Hewlett Packard Enterprise (HPE) 3PAR Storage Area Network Fault:
We are duplicating the storage in the NY5 data center and will be spreading the I/O load across both devices. We have reclaimed the space on the new 3PAR unit.
- The additional storage unit will be installed in Summer 2022
- The I/O load distribution will be finished by Fall 2022
We identified a number of storage optimizations concerning the existing 3PAR unit in the area of backups and database storage. We have started to test the changes in our performance environment. Overall we expect to be able to shed 10-20% of storage load with the modifications.
Cisco Switching Environment Fault:
We have purchased the vendor recommended equipment replacements and are waiting on delivery.
Update - Feb 2022
Hewlett Packard Enterprise (HPE) 3PAR Storage Area Network Fault:
We are duplicating the storage in the NY5 data center and will be spreading the I/O load across both devices. We are currently reclaiming space on the new 3PAR unit.
- The additional storage unit will be installed in Summer 2022
- The I/O load distribution will be finished by Fall 2022
Cisco Switching Environment Fault:
We are purchasing the vendor recommended equipment replacements.
Out of Band (OOB):
Completed our OOB management capability in our NY5 data center.
Update - Jan 2022
Hewlett Packard Enterprise (HPE) 3PAR Storage Area Network Fault:
The RCA from our storage vendor, HPE, indicates that the unit locked up due to heavy sustained write traffic. The recommended remediation is to lower the write traffic on the unit. To do so, we will be duplicating the storage in the NY5 data center and will be spreading the I/O load across both devices.
- The additional storage unit will be installed in Summer 2022
- The I/O load distribution will be finished by Fall 2022
Cisco Switching Environment Fault:
Replaced the remaining single network switch with a pair of redundant switches. We are reviewing with our vendor further equipment replacements.
Out of Band (OOB):
Enhanced our OOB management capability in our NY5 data center.
Update - Dec 2021
Hewlett Packard Enterprise (HPE) 3PAR Storage Area Network Fault:
Performed a configuration audit and performance analysis
Cisco Switching Environment Fault:
Purchased redundant network hardware to replace the remaining single network switch
Support Contract Audit:
Audited our support contracts for networking, storage and systems