Product Design Case Study — Rethinking Alert System
Consider this narrative
Aayush runs an e-commerce platform. To monitor his business, he uses a reliable platform called cliff.ai to track all the metrics.
It’s a festive season and the website sees enormous traffic suddenly. He now wants to be notified when the website records traffic of more than 65K visitors so that he can alert his technology team to prepare for upscaling.
In order to do so, he sets up a stream, inputs threshold value, and just waits for the traffic notification to pop.
Not only does his website traffic surpass 65K threshold mark, but also, he does not recieve any notification from cliff when the incident happened. He got frustrated.
Now thankfully he raised a complaint to our customer satisfaction team regarding the same issue. Which got us thinking.
The impact we expect in the end
User’s perspective
- Time Saved: Dashboards are only helpful when there are less data. Giving only relevant and key insights will make users interact less with a huge amount of data. Less interaction with the data saves time. The time saved can be used for something more productive and impactful. Business owner’s time is very valuable and impactful work in that same duration leads to more value generation
- Business Growth: Less reliable metric observability creates friction in business. Good tools will make irregularities more predictable which leads to better business planning and hence better growth.
Why build an incident management system?
There’s always something to improve the feature and add value to it. After listening to their experience of using the existing alert system, we realized that we had to put more work into the design of cliff.ai.
While taking proactive actions to push users to complete the journey, we always had a strong narrative of getting rid of dashboards. For that, we had an Alert Rule feature. An ideal “Happy Path” looked something like this.
But there were a couple of problems with this flow that we realized after listening to our users.
Problem 1: Aayush wasn’t able to receive the notifications whenever there was any anomaly in the metric.
Reason: He didn’t set up the “Alert Rules” properly. Now there can be many reasons for him not to set up those rules.
— a.) He didn’t know he was supposed to set up Rules to receive notifications on his device.
— b.) He just forgot to set up alert rules.
Solution: Automate the Alert Rule setup process. Provide a Default Monitor (yes we renamed Alert Rules to Monitor). So that even if the user does not set up any Monitor by themselves in the beginning, a base monitor will make sure they keep receiving the notification whenever anomalies occur.
Problem 2: Aayush could not choose whom and when to notify in case of occurence of incidents.
Reason: Cliff.ai didn’t have the functionality of informing specific people of the team at a specific time. That is, ‘after detection of an anomaly, notify John. If he is unable to acknowledge that within the given time, notify Saahil’ and so on…
Solution: Establish an escalation policy that determines how, where and when the notifications will be escalated to specific team members or individuals when anomalies occur.
Problem 3: It was hard for Aayush to draw insights about incident patterns on metrics.
Reason: There was no insight page or incident “repository”, where he could draw quick critical insights from the metric incidents from a “single place”.
Solution: Provide an incident page where users can see the list of all incidents datewise and help them notice key information(numbers) of incidents, like lifecycle status(triggered/acknowledged/resolved/surrendered), time of incidence, and people assigned to that incident. And finally, provide a dedicated incident details page.
Brainstorming
Before starting IMS, we thought it would take a lot of time for us to understand the flow and decide on what all features we will provide so that the user can optimize benefits from it, but the engineering team successfully explained to us ins and outs of all the features which helped us to slide directly to user flow.
Creating User Flow — Inspiring from companies in the infrastructure monitoring domain
We first need to see what others are good at and what we can do better. Our competitors in terms of IMS were mainly Pagerduty, Squadcast, Opsgenie(Atlassian). As far as visual design was concerned, we had our design guidelines and patterns established already, so we needed to focus on the UX analysis of these products. Concerning our product design, we had some things from each product we could take inspiration from. But for us, Opsgenie resonated with our flow to a larger extent. Hence, we started to analyze its user flow and found it quite intuitive. Understanding other products helped us see the part where we could improvise our product.
- Nomenclature: The naming convention is very important when it comes to accessibility, users should know what the feature is for or what the given action does just by reading the name. Although our naming convention was easily understood by our team, we realized it could be improved for better and make user-friendly.
- Hierarchy: We improved the overall hierarchy of our system.
After brainstorming the feasibility of all the features, we started structuring all the information we had. Our new user flow had these key steps.
How we defined our Incident management system
Monitors
The monitor is a rule that inspects specified stream, measures, and dimensions, and generates an incident that To The user has to tell what metrics they want to track, and the nature of that monitor. So every time an anomaly is detected in that stream, a monitor will be generated automatically
1. Monitors List page
Users can see all the details they need about the monitor they have set like, the type of monitor, how many incidents that monitor has made, who created it, when was it created, what all streams is it tracking, and so on.
2. Monitors Details page
Users can see detailed information about monitors and can suppress or delete the monitor as per their requirements. They can see the escalation policies associated with it, and the top responders for the incidents that got generated due to this monitor. They can access the heatmap of the frequency of incidents happening on that monitor for any particular day of a month or year.
Incidents
An incident is an event or series of events that disturb a service that gets affected due to numerous reasons. At cliff.ai, after the creation of a Monitor, when a bunch of related anomalies is found, we group them to create an Incident.
1. Incident List page
Users will get a holistic analysis of all the incidents with the help of a dashboard and they will be able to draw insights and patterns easily. We have provided date-wise curation of incidents for a better user experience
2. Incident Details page
A page dedicated to any particular Incident will increase the accessibility of Incident information manifolds. Details such as Graphical insights, Responders working on it, a timeline of activity, the comment section for teammates to weigh in if they want to, etc.
Everything is in one place. NEAT!
Escalation Policy
Understanding escalation flow was simple. Whom, When, Where, and How to notify team members in case of incidence. So rather than sending each and everyone notifications for all the incidents, the escalation policy allows you to let only specific people on the team know about specific problems. That is, notify the Machine learning team and not the frontend team if an incident is related to ML.
Now comes the fun part, as soon as an Incident is generated, the notifications will be pushed to the respected members based on our Default Escalation Policy, AUTOMATICALLY!
i.e., Users do not need to set up anything for Incident Management System to work properly. Our base Escalation policy will take care of it for them. Hence the main problem is solved.
Now we had to address whom and when to escalate the notification. This is where Escalation Policy comes into the picture. A user can create their Teams and set time intervals and specify where to get notified (Slack/Viber/Mail etc) so that after the incident, the team members would be notified based on those preset rules.
Users can set an escalation policy and assign members to that policy for every team and based on that, the team members would be notified, as shown.
After someone Acknowledges the incident and Resolves it, they can now do the Postmortem of that incident.
My design learnings:
- Iterations:
An iterative approach with active feedback is the quickest and easiest way to build digital platforms. Failing quickly and then working with the right mindset helped me do all my tasks.
2. PRD: I realized the importance of Product requirement documentation. Having a well-curated PRD saves a lot of time for all the teams.
3. Audits: The way you iterate while designing, is the same way the designer should iterate while auditing as nothing can be built perfectly in one go.
4. Design Patterns: I got to know about cliff.ai, it helped me understand how to approach design systems