What goes on when you get involved in the on-call rota?
Hi! I’m Patrick, a software engineer on the Onboarding and Payments team at Freetrade. As the name suggests, we handle anything to do with signing up new users, Know Your Customer (KYC) checks (verifying a client’s identity), subscriptions, and bank transfers.
While stock markets open at specified times during the day, all of these mentioned features (and more!) must be available and reliable 24/7. To achieve this, software engineers at Freetrade take it in turns to be on-call. This article will shed some light on how on-call is managed at Freetrade.
The textbook definition of being on-call means that you must be able to respond to alerts and incidents at any given time. This is a necessary step at any company to ensure a constant reliable system is in place for their customers.
Some companies have dedicated teams of people that will monitor and respond to incidents as their day-to-day job but, at Freetrade, we follow the you build it, you run it principle, meaning that the same engineers implementing new features must maintain and support them in production.
A fundamental requirement for an effective on-call process is to set up monitoring and alerts.
We use Google Cloud Monitoring to set Service-level Indicators (SLIs) and Service-level Objectives (SLOs). The key difference between SLIs and SLOs is that the former measures while the latter budgets.
For example, an SLI could count the number of times a deposit fails, while an SLO could set a target where at least 99% of deposits must succeed over a period of time. If only 98.9% of deposits succeed, then this SLO is considered breached.
At Freetrade, when an SLO violation occurs, on-call engineers get alerted via a service called PagerDuty. PagerDuty allows teams to organise on-call rotas, notification rules and automatic escalation policies.
Engineers can decide how they want to be notified: via email, phone call, app notification, or SMS. They can also set what kind of notification to receive at different time intervals in case they don’t respond to an incident immediately.
We have two tiers of on-call engineers for each team: first-line and second-line. If the first line engineer doesn’t respond in a few minutes, the escalation policies kick in and the second line engineer gets notified.
We’re also piloting a “follow-the-sun” approach to on-call support now that we have an office in Australia.
Finally, our PagerDuty incidents are integrated with Slack. We have dedicated Slack channels for critical and non-critical issues which map to PagerDuty’s high and low urgency notifications respectively.
This improves the transparency of incidents to the rest of the company and serves as a place to discuss and follow up on any progress towards fixing the incident.
When an incident comes through, the first thing the on-call engineer needs to do is acknowledge it via PagerDuty. The PagerDuty alert notification will include some description, links to the triggered alert or violated SLO, and in some cases steps on how to handle it.
At this point, the engineer needs to determine what has happened. More often than not, this could simply involve checking for errors in the logs. Communication is key here, and the on-call engineer posts regular updates on their investigation and the impact on the appropriate Slack channel.
Sometimes incidents are transient, and might be self-resolved (some failing queries do retry and eventually succeed) however, in case of a serious incident, the on-call engineer will need to figure out a way to either fix or work around the issue.
Depending on the severity of the issue, the on-call engineer may draw on the support of the wider engineering team.
Once the issue is resolved, depending on the severity, the on-call engineer might need to help run a post-mortem. A post-mortem documents what went wrong when an issue occurred. This involves writing down the timeline of key events, the issues, the impact and identifying ways to avoid similar issues in the future.
This entirely depends on how many engineers are at Freetrade. Rotas are defined per engineering team. For an engineer, on-call spans a whole week.
At the time of writing, I end up on-call every four to five weeks. Of course, the more engineers join Freetrade, the less frequent they’d be on-call.
New engineers are also encouraged to shadow on-call to learn the ropes before eventually graduating to being first-line on-call engineers. Shadowing on-call occurs during work hours and typically involves observing the process and in some cases helping out.
Freetrade recognises the effort and commitment required by engineers on-call, and the value they provide to the company. This is why first line on-call engineers get compensated £500 per week.
Fortunately, at Freetrade critical incidents do not occur frequently thus engineers have some other responsibilities during the day while on-call.
Freetrade releases twice a week. This is mainly an automated process. The on-call engineer is responsible for gathering release notes, asserting that nightly builds have succeeded, and pressing the push to production button (sadly, not a physical one!).
Sometimes the operations team might need engineering support. For example, temporarily disabling a potentially compromised account (e.g. in the scenario of a stolen phone) or performing ad-hoc investigations and analysis. This usually involves following a run-book.
Finally, if the on-call engineer isn’t busy with any of the above, they will fix bugs and make improvements to our systems for the rest of the week.
The idea of being on-call can be daunting at first but I hope I’ve shown that it shouldn’t be. Freetrade has done a lot to make the on-call process as robust and streamlined as possible:
Experiencing on-call has some added advantages. It is an excellent way to understand how the whole system works, the sort of issues that our customers face, and the pain points in our system.
As an engineer, understanding all this will facilitate making the system better and being more aware of consequences when maintaining the codebase or implementing new features.
Do you think our on-call process is something you can imagine yourself doing? If so, we’re hiring and we’d love to have you join the team!