The following report details the findings after a thorough investigation of this incident by Omnilert’s team as well as the steps taken to prevent future issues of this kind from occurring.
DESCRIPTION OF ISSUE:
Some Omnilert customers experienced a delay in sending alerts in the wee hours of January 16th. We learned of the problem at 4:10 a.m. ET and DevOps engineers were working on the issue within twenty minutes. Messaging services were fully restored at 6:08 a.m. ET and SNPP trigger services were restored at 6:20 a.m. No data or messages were lost, and all messages were processed normally when the cause of the delay was identified.
The major system architecture upgrade that Omnilert is completing this month (which includes the migration to the Virtual Private Cloud mentioned above) is a “once a decade” change. We clearly did not anticipate this configuration problem with server scripts and did not do an adequate job testing how the servers were behaving after stopping and starting as load changed, as well as when other routine maintenance was performed by our SaaS cloud provider.
Immediately prior to this incident, Omnilert engineers had made some system architecture changes as part of Omnilert’s Information Security Initiative. No impact on production services was anticipated in connection with this back-end maintenance. However, we discovered that we had not taken all system and network dependencies into account. During the move of servers and various services to a Virtual Private Cloud (VPC), a very unexpected and unusual system configuration issue was discovered related to the scripts that checked to make sure the servers start the right background processes in a valid way. As part of Omnilert’s Information Security Initiative to further protect our platform from cyber attacks, the configuration specifies a specific, approved script that can start or restart the servers involved in this incident. This check for the correct script failed for a very complex set of reasons when the servers involved in this incident were moved into the VPC.
Omnilert engineers were called in during the middle of the night to help identify the problem – which took approximately two hours to diagnose and remediate. This was an unusually challenging configuration error to find and fix. After removing the restriction on the specific script that could start background processes on the affected servers, the queue servers within our cluster were restarted and the messages in the queue were processed normally.
No further issues with the way background services are started/restarted in the new Virtual Private Cloud architecture have been seen since the incident, indicating that the problem with the earlier restriction on the specific script used to start background processes on the affected servers has been resolved.
Eliminating any unnecessary delays in processing messages is a key principle guiding the development and operation of the Omnilert platform. We have spent a lot of time assessing what happened and capturing the lessons learned. A few days after the incident we had an offsite meeting with all developers involved with the portion of Omnilert’s product offering involved in this incident. We have captured lessons-learned and identified ways to improve our change management procedures for the network and other parts of the platform.
We’ve taken the following steps since the incident to further reduce the possibility of another occurrence:
1) Additional change management procedures have been put in place to carefully test any future changes in the VPC architecture that might affect services. This incident revealed that we need better procedures in place to test system and network dependencies inside and outside our network.
2) Additional server/services redundancy has been implemented, with more servers now available to handle queued messages. Ideas for further improving the resiliency of the platform are being reviewed and added to the Q1 development backlog.
3) Additional proactive monitoring has been implemented, with more agents running continuously to detect any delays. As we develop new versions of the application software, we will be developing for additional ways to “instrument” and monitor the system. As we enhance our cloud-based network infrastructure, we will be adding even more sophisticated tools to increase the resiliency of the network and allow us to anticipate and forestall outages through proactive monitoring.
4) We have management approval to hire additional Customer Support personnel and bolster off-hours coverage of network operations. We are recruiting an additional support engineer to provide additional coverage of the after-hours shift in our 24/7/365 support operation. We also are enhancing some of our procedures for escalating calls (from support personnel to engineers/developers) that come in during the middle of the night.