Table of contents
No headings in the article.
I started at Wonolo in 2019. Back then, there was no rotation, no on-call onboarding, no clear expectations, no playbooks, and no pager system—just alerts on a single slack channel. If you were a backend engineer, it was expected that you fix any alerts and bring the system back into a stable state. Without rotations and well-defined processes, it was much easier for backend engineers to constantly be burdened with potential production issues, risking burnout and fatigue.
Today, we have a much more defined process. We make use of an on-call rotation and onboard all new engineers that contribute code to the system. When an incident does occur, we utilize blameless Post Mortems and update any on-call playbooks to make sure we incorporate anything that happened during the incident.
Rotation: Who Goes On Call
Everyone. More specifically, every Engineer, Engineering Manager, Director and VP at Wonolo is expected to join an on-call rotation three months after their start date. The following chart is taken from our on-call policy in Confluence:
How We Do On-Call Onboarding
The main goal for our on-call onboarding is to set up everyone for success and ensure they are as comfortable as possible before starting that first on-call rotation. We achieve this by having well-written documentation that is sent out in advance of an onboarding session. The documentation includes the following documents:
- On-Call Overview: Policy, rotation, responsibilities, escalation, mitigation, and triage
- On-Call Setup: How to properly set up your mobile device
- On-Call Incident Response: How to handle suspected and confirmed production issues
- On-Call Playbooks: Describes symptoms, impact, action, and helpful links
The onboarding session itself outlines clear expectations for being on call and, more importantly, helps you feel like you are not alone. We try to reassure everyone that it's always okay to ask anyone for help or to escalate issues in the on-call rotation.
Clear Expectations: The Responsibilities of Being On Call
The ultimate goal of being on call is to remediate any issues with the platform deemed a P0.
A P0 means that the system is in an unhealthy state, not functioning, and should be fixed or mitigated down to a lower priority level as soon as possible. We have reduced the main responsibility to three steps:
- Alert: Acknowledge receipt of pager message.
- Assess: Determine and communicate impact to stakeholders.
- Mitigate & Triage: Resolve or reduce to P1, and triage to the appropriate team.
Creating a Useful On-Call Playbook
The On-Call Playbook was created with ease of use in mind. We created a table which has a row for every alert we have created and every symptom we have encountered. Each row contains Symptom, Impact, Action, and Helpful Links. Here is a small sample of the on-call playbook:
Symptom Contains the exact text that maps to an alert.
Impact The best guess of the impact on the system to help speed up the assessment and prioritization of the situation.
Action Suggested action for the on-call person to take.
Helpful Links Contain links to the exact logs, supporting metric charts, how-to documentation, etc.
The on-call process has evolved significantly since the beginning of my time here at Wonolo. As we grow and learn as a team, there will always be more for us to iterate and improve on, and I’m grateful we’ve built processes for our team around doing so. Having a shared responsibility across engineering, creating an onboarding program, defining clear expectations, and creating an easy-to-use on-call playbook are all important pieces of setting your team up for success.