Salesforce Database Fiasco
To all of our @salesforce customers, please be aware that we are experiencing a major issue with our service and apologize for the impact it is having on you. Please know that we have all hands on this issue and are resolving as quickly as possible.— Parker Harris (@parkerharris) May 17, 2019
Current News Coverage
The Register - Salesforce database outage
May 18 Status Update
Service availability appears to be restored for a majority of users, although a subset are still locked out of their data. Based on initial reporting some details remain unclear.
…a database script deployment that inadvertently gave users broader data access than intended. ZDNet reporting
Due to the lack of technical details in initial reporting, there are many unanswered questions. Unanswered questions naturally lead to idle speculation.
Firstly, service remediation appears to be at least partially manual. This conclusion is not without reproach, but is based on evidence in the initial reports and the careful wording of company statements since the incident broke. This recovery has spanned a long duration, that has apparently been intensively staffed, with service restoration being fragmented and ongoing.
While in no way conclusive, these are signs of a potentially manual, disorganized, and unplanned-for rollback.
Secondly, a primary feature of this misconfigured deployment is that all users were granted read-write access to all databases.
How long was the misconfiguration live before it was noticed? Before access was revoked?
How did Salesforce finally notice the misconfiguration? Did they have a horrified scramble after a puzzled customer reached out?
If password hashes were leaked, it would be prudent to force password resets on users. Yet, no reporting or updates from Salesforce have mentioned as much.
A detail that has been missing so far is whether, or to what extent, unauthorized access or modification happened to any affected databases. Hopefully, remediation includes rolling back databases to a known good state, which would at least ensure data integrity from before the incident. However due to the wording, and focus on “restoring” permissions and access, it’s hard to know exactly how unauthorized access or modification is being handled.
According to reports, a misconfigured deployment script ran, which modified all customer databases to allow read-write access to all users.
What was the review process for this deployment? Was it approved?
Did this deployment go to a staging environment before production? How was that environment verified?
How was the deployment monitored?
How was the new deployed state verified?
How was monitoring conducted on the affected infrastructure?
The Hard Truth
A mistake like this only happens when there are a lack of controls, or very weak controls, on many levels. From the organization, to its review and deploy processes, and its monitoring and alerting capabilities. Legitimate oversights and mistakes do happen, which is why mature engineering organizations rely on a multi-layered process to stop defects during the SDLC.
Best Practice Solutions
While no strategy guarantees safety against defects, there are practices and processes which are proven to reduce the number of defects that ultimately reach production and impact paying customers.
For the defects that do make it through to production, there are strategies which will help with response and recovery.
Code reviews are an easy, low cost, high value process to add to the SDLC. In addition to catching defects, code reviews are an excellent way to share knowledge across an organization.
Senior engineers get daily opportunities to mentor and coach their peers. It’s also a great opportunity to ask questions, gather feedback, and improve implementation.
Configuration management is critical to a few key defect-minimization strategies. Having fully automated deployment and rollback procedures almost requires some level of configuration management. Configuration versioning and secrets management are other important pieces to this puzzle.
Terraform by HashiCorp is one of many great options for configuration as code.
Vault (also by HashiCorp) is a secret management option which has a CLI, a web interface, and integrates natively with cloud providers. It’s pretty intuitive and easy to use.
Every deployment to production should be preceded by a deployment to a similar production-like environment. That environment should be monitored during and after deployment. It should be tested rigorously with smoke tests, UI tests, or a manual QA team.
Deployments should roll across clusters or partitions in phases, with each phase being monitored against expectations. Rollbacks, particularly rollbacks of schema changes, are difficult to prepare for. If that’s the case, backups need to be hot and ready to roll back to. The deployment should be scheduled for the lowest-use time of the week.
Incident Planning, Disaster Recovery
Proper planning prevents piss-poor performance.
– Coach Hanika, high school wrestling
Fail to plan, plan to fail.
Both of these statements are corny, often repeated, and completely true. Incident response requires planning, organization, practice, trust, and coordination.
Designated roles reduce confusion and response time. Roles will depend on organizations and their individual requirements, but some common ones include:
- incident commander
- communications point person
- scribes (for documenting timelines and actions taken)
With these roles and an incident response plan in place, MTTR has the best chance to be minimized. Investigators are insulated from fly-by questions, and are allowed to focus on the problem. The company and its various teams can stay up to date on any news via the communications person. And the incident commander ensures that the response stays focused and investigators have all the resources they need to complete the recovery.