The On-Call Chronicles: Enhancing B2B P&E Defense

A journey through our strategic On-Call process ensuring system stability and swift incident resolution

In software engineering, the on-call process ensures that an engineer is responsible to respond to emergencies, monitor alerts, troubleshoot incidents, and coordinate resolutions. Incidents are often triaged by urgency and severity, ensuring swift acknowledgment and mitigation within set Service Level Objectives (SLOs) & Service Level Agreements (SLAs).

At Babbel, the on-call process is structured to maintain the stability and performance of our services,  ensuring that our customers can seamlessly engage with our products and immerse themselves in language learning whenever they choose. The on-call duties are managed through a combination of tools including PagerDuty for alerts, Rollbar for error tracking, and JIRA for incident tracking and resolution.

The On-Call Adventure: A Weekly Rotation Story 📅

In B2B Product and Engineering (P&E), we rotate on-call responsibilities weekly among three internal teams. This ensures continuous monitoring of technical and business metrics at all levels, keeping someone always vigilant and ready to tackle obstacles. Like knights guarding castle gates, weekly handovers maintain consistent watch, prevent burnout, spread critical knowledge, and build confidence among all team members.

As our squad scaled, the growing size of our team revealed significant challenges in our on-call defenses. The increased frequency and complexity of incidents led to issues such as partial visibility into the matter, duplicate efforts, and extended resolution times. It underscored the need for a more effective on-call setup, where responsibilities could be clearly defined and alerts directed to the most capable groups, ensuring rapid and efficient resolution of incidents.

Key Areas to Conquer: Incidents, Questions, and Security 🏰✨🛡️

The Realm of Incidents

The Mission: When an issue arises, the on-call engineer acknowledges it within a short span and works towards resolving it within Babbel’s established technical SLOs. Think of it as a call to defend the kingdom – swift action keeps everything running smoothly.

Our Shields and Swords: We rely on PagerDuty, Amazon CloudWatch, and Rollbar to safeguard our realm. CloudWatch monitors for anomalies and triggers alerts, which PagerDuty then prioritizes and notifies the on-call knight. A detailed error tracking is provided by Rollbar. Together, these tools ensure high-priority incidents are acknowledged and resolved promptly, maintaining our system’s stability and resilience.

The Land of Questions

On-call Engineer’s Role: Responding promptly to questions and concerns from our stakeholders & fellow colleagues looking for technical support is an important key. Politeness and proactiveness are our guiding principles.

First Responder: Each week, a designated team member from each of the three B2B P&E teams takes on the mantle of first responder in their realm, ensuring that all domain-specific queries are addressed effectively.

The Fortress of Security

Guard Duties: Just as a medieval castle requires regular upkeep to remain formidable, maintenance is crucial for application software to ensure its resilience, performance, and security. Continuous maintenance involves fixing bugs, updating dependencies, and addressing security vulnerabilities, thereby, fostering a stable and reliable system.

Lessons from the Battlefield: Triumphs and Trials

Actions Taken by the Order of Metrics

Our recently formed guild, the Metrics Workgroup set out on a quest to gather essential metrics that strengthen our defenses and strategies. They collaborated with stakeholders to ensure that B2B endeavors were in sync with company guidelines and expectations.

Our vigilant knights embark on a bi-weekly quest to gather data from PagerDuty incidents. They meticulously analyze various metrics and craft detailed charts to uncover trends and identify areas of improvement. During this process, they also keep a watchful eye for any breaches of our SLOs, such as delays in the optimal resolution times of incidents.

Armed with these insights, they present their findings during the bi-weekly B2B P&E standup meetings. This keeps the team informed about key metrics such as Mean Time to Resolution (MTTR), the number of escalations, and incidents, thereby driving continuous improvement and fortification. It was observed that as B2B P&E continued to venture into developing new products and exploring new territories, the number of incidents increased. However, the Mean Time to Resolution (MTTR) continued to drop in the last six months, indicating that our recovery process became significantly faster and more efficient.

From the early days of this expedition, charts were curated and metrics were analyzed manually. This initial approach provided a strong foundation and allowed valuable feedback to be gathered. Building iteratively on the idea and to reduce the inefficiency of the repeated manual effort, the workgroup is committed to build an automated bi-weekly cycle to collect and evaluate data from Pagerduty through their APIs. This approach will also facilitate in deriving custom analyses and metrics that provide more meaningful insights.

A step to draw custom analysis and metrics is the ability to segregate incidents and data based on their occurrence during business hours and their urgency. Below is a Ruby code snippet that demonstrates how data is segregated in the automated process’ Proof of Concept (PoC), providing a foundation for in-depth analysis of the chosen metrics.

Streamlining the On-Call Experience

In the past, the on-call experience for B2B services has faced significant challenges, particularly with managing shared services. As our domains expanded, teams created distinct repositories and projects, setting clear boundaries and responsibilities. The shared repository became a central stronghold for one team, while other factions relied heavily on its data and processes. Often, engineers faced partial visibility, duplicate efforts, and felt an over-reliance on a single team in debugging on-call issues.

To address these challenges effectively, several strategic steps were implemented. The shared codebase was modularized into distinct packages, with each package assigned to a group of specific stewards. This brought clarity and order, ensuring that each faction had well-defined responsibilities.

To eliminate confusion and establish clear lines of the fence, we directed all alerts to a better equipped team, B2B Platform. This approach ensured that the team most familiar with the shared codebase handles the incidents, leading to faster and more effective resolutions. Any incidents outside their domain are seamlessly rerouted to the appropriate team via PagerDuty’s reassign feature and escalation policy, guaranteeing that the right experts address every issue promptly.
Each team is now tasked with creating their own rotation to patrol the Land of Questions. This ensures that knowledgeable engineers are always at hand to address inquiries of their domain, establishing a more efficient and autonomous defense system.

The Epic On-Call Handbook

While these are the primary areas of focus, there are always more adventures in the on-call journey. Our internal comprehensive guide, featuring a technical handbook consisting of procedures, helpful tools, debugging tips for common issues, FAQs, and other resources, ensures that every on-call engineer is well-equipped to make swift and informed decisions when faced with alarms.

Your Adventure Awaits!

Being on-call is an exhilarating aspect of working in B2B P&E. It’s about managing incidents, providing support, and safeguarding our systems – an essential part of our team’s daily mission. This role also offers the opportunity to connect with customers and stakeholders, understand their challenges, and gather valuable feedback. It builds a developer’s skills in observability and monitoring, deepens their understanding of the intricate corners of the codebase, and often inspires ideas for refactoring and maintenance.

Embrace the challenge, stay vigilant, and let’s ensure everything runs smoothly 🥷

Share: