Incident Report – Canvas Service Disruption (November 2019)

The following serves as an incident report directly from Instructure, the makers of Canvas, for the service disruption of Canvas on November 6th & 7th, 2019. More info…

Technical Questions can be directed to Canvas Faculty Support.
See the Help Menu, within Canvas.

Summary

“Users worldwide were experiencing page errors between 5:55 PM MST November 6th and 9:38 AM MST November 7th, 2019. An issue during a deploy clean-up caused the initial incident, and while it was mostly resolved the night of the 6th, there were sporadic errors until the following morning.

OVERVIEW
Canvas users started experienced page errors when attempting to access at 5:55 PM MST November 6, 2019. Service was fully restored by 9:38 AM MST November 7th, 2019.

DETAILS
On November 6th, one of the post-deploy cleanup phases removed an unused database column that was no longer needed in the newly deployed code. However, on this occasion an instruction to now ignore the deleted column was accidentally removed. Because of this, servers were still expecting data for that column which no longer existed, resulting in page errors in Canvas. As this clean-up ran across each server, users started experiencing these page errors between 5:55 PM and 6:15 PM MST.

At 6:03 PM MST, Support started to see error reports and began to investigate the issue. At 6:15 PM MST, Support alerted our DevOps team. After investigating, the issue was identified and the restart command to flush the server cache was issued at 6:24 PM MST. Canvas returned to full functionality between 6:30 PM and 6:47 PM MST (depending on when each server completed this step).

While this solved a majority of errors for users, some of the processes on your database server didn’t update correctly, which resulted in a small, steady trickle of errors throughout the night. Due to a monitoring oversight, our engineers were not alerted to these. Our Support team noticed this at 9:24 AM MST on November 7th, 2019, and reported it to our DevOps team who were able to deploy a fix. Canvas returned to full functionality at 9:38 AM November 7th, 2019.

MITIGATION
Although the incident’s cause and solution were rapidly identified by our engineering team, the issue was preventable, and could have been identified and fixed sooner.

Our teams have conducted extensive retrospective analysis and have identified a number of actions that will prevent issues like this in the future and provide faster detection and resolution:
• Making additions to our automated alerts to provider faster, failsafe notifications directly to our engineering teams for problems of this type.

• Improving notification to our Support teams of the migration schedule to aid in faster problem identification.

• Implementing automated development-time code checks to specifically detect the presence of a destructive database migration without its supporting code changes.

• Disseminating training to our teams on this specific category of migration, using this event as a case study.

• Investigate improving our progressive deploy environment to allow migrations to be run progressively per environment and region during the progressive rollout, rather than all at once at the completion of the deploy.

• Improving our migration-specific workflow to ensure that when changes are accelerated through the deploy process, they still have passed all migration testing checkpoints.

CONCLUSION
We understand any challenges encountered while using Canvas are frustrating, and they impact your ability to serve your students and teachers. We are taking what we have learned from this incident to improve how we detect and identify these issues in the future. We are deeply sorry for the impact this had on your Canvas users.”

~ Instructure, Inc.

Leave a Reply

Your email address will not be published. Required fields are marked *