Alerts in Fusion

Sending alerts messages is not a new feature for Fusion. Since version 1.4, Fusion users could use an integrated Messaging system to log or send out email or Slack alerts in response to events like the presence of a specific text in a stream of documents being indexing, or while processing query pipelines.

If you’re not familiar with alerts here’s a primer on Fusion’s Alert and Messaging architecture. The primer also includes practical instructions on how to set up Fusion to send email or Slack alerts while running Indexing and Query pipelines.

With the release of Fusion version 2.1, in addition to Logging, Email and Slack alerts, Fusion provides a method to send PagerDuty alerts as well.

What is PagerDuty?

PagerDuty is an incident management platform that helps IT operations professionals reduce incident resolution time, improve infrastructure-wide visibility, and improve operational performance. It collects signals from 150+ monitoring tools and connects the problem to the appropriate on-call engineer via phone, SMS, push notification, and email. In addition to IT operations team members, PagerDuty also gives support teams a unified view of all systems, no matter what tools are used and what systems are monitored.

PagerDuty incident management platform includes housing team contact information, alert workflow, automatic escalations, on-call scheduling, and analytics for system and team performance.

The Fusion PagerDuty Integration

The integration allows a Fusion user to manage PagerDuty incidents. For every incident, PagerDuty sends alerts according to alert workflow mentioned above. Fusion uses PagerDuty’s Incident API to communicate with PagerDuty servers in real time, so alerts originated in Fusion are sent to the relevant parties literally in seconds.

Examples of such events could be presence of certain text in indexing streams, any changes in data that Fusion processes or manages, or health check events associated with various problems (say, Solr collection is empty or number of recently indexed documents is less than expected and so on). Similarly to Slack or Email alerting, Fusion can trigger a PagerDuty alert while processing an indexing or query pipeline, based on certain conditions that are user configurable. Since the PagerDuty is very support oriented, it makes sense to use it for any kind of alerts related to Fusion services health and data processing or integrity problems that require immediate attention.

Note that Fusion’s support for PagerDuty does not allowing users to see or configure the escalation policies, or to see alerts history; however we consider creating a PagerDuty connector to Fusion in the future (so, for example, PagerDuty alerts history can be indexed and searched in Fusion).

Setting Integration Up

The PagerDuty uses “services” to integrate with monitoring tools. Each “service” has its own alerting and escalation rules, called “escalation policies”. This feature is used to route alerts to the people best able to handle them. So, the first thing you would need to enable the PagerDuty integration is to establish a Fusion related service in your PagerDuty account. Once that’s done, all Fusion generated alerts will be associated with this service and dispatched properly to staff who are Fusion (and Fusion related services like Solr ) support specialists.

To create a Service in PD, you need to create the Escalation Policy that will be associated to this service first. So go to the Configuration → Escalation Policies in your PagerDuty account and create one that will be used with Fusion related incidents. To create a new service, choose New Escalation Policy button in top right corner.

Once the escalation policy is configured (let’s say it is named “Fusion Support”), go to the Configuration → Services menu item and create a new Service that will be associated and integrated with your Fusion instance. Use the Add New Service button in the top right corner to create a new Service. That brings you to the screen to add a new Service. Fill the details for your newly created service. Note that for the Integration Type radio button you should choose the Use our API directly option. Configure the Notification Urgency and Incident Behavior policies as desired and complete Service creation with Add Service button. All done!

Now look at the important key you’ll find in the Integration Settings section of the newly created Service:

image002

This is the unique Service Key you will need to enter in your Fusion UI to configure PD integration.

Here is how:

In Fusion UI, go to Applications → System screen and pick up the Messaging Services:

From Configure Messaging Service combo box pick the Pager Duty Message Service entry. Now enter the Integration Key for the Service you configured in PagerDuty into the Pager Duty Service Key field. As for the Pager Duty Service API URL, just keep the default value. Save the changes.

That’s it, the PD integration is now configured in Fusion and now it is time to start using it.

PagerDuty Message Stage Configuration

The PagerDuty integration code in Fusion interacts with a PagerDuty service every time when a Send PagerDuty Message stage executes as a part of a pipeline. The name is a bit misleading – there is no actual message sending to PagerDuty. The Fusion stage uses an http API call. The actual PagerDuty alert could not be a message as well. This naming reflects the general concept Fusion uses to deal with alerts – send them via messaging system. In some sense, you may think that Fusion sends an alert message via PagerDuty – similar to sending Email messages.

This stage triggers, acknowledges or resolves a PagerDuty incident – an event that requires attention until resolved or expired. The incident either gets resolved by a human (via PagerDuty web site or app once the job on this incident is done) or it will expire and will be resolved automatically according to configured PD time out rules. Until the incident gets resolved, the PD will continue sending alerts according to escalation policy. The incident can be acknowledged to indicate that someone is already working on the issue and to silence alerts for some time. Once the incident is resolved, it becomes history; however if the incident with the same Incident Key is triggered again, the existing closed incident will be reopened.

The stage configuration defines two things. The first one is the incident details – what data and how will be presented to the people dealing with the issue once they receive the PagerDuty alert. The second is the condition that is needed to be fulfilled so the stage will be executed, and it is done in the stage’s Conditional Script. Typically that involves evaluation of Fusion objects available in stage context at the moment of processing (for Query pipeline stage – the Fusion/Solr query Request and Response, and for the Index pipeline stage – the Fusion Pipeline Document). Examples would be evaluation of whether the Query response has 0 documents or whether the pipeline document has some particular data in some field (see below).

Let’s talk about setting the incident details first, and then about how to trigger the stage to be executed depending on Conditional Script outcome.

Setting Incident Details

There are 3 required fields to be configured for this stage – the Event Type, the Description and the Incident Key. The Event type is one of trigger, acknowledge or resolve. That defines whether the message will trigger the PagerDuty incident, acknowledge or resolve it. The Description itself is a short description of the event. This field (or a truncated version, the maximum length allowed by PD is 1024 characters) will be used when generating phone calls, SMS messages and alert emails. It will also appear in the incidents table in the PagerDuty UI. And the Incident Key field identifies the incident to which this trigger event should be applied. If there’s no open (i.e. unresolved) incident with this key, a new one will be created. If there’s already an open incident with a matching key, this event will be appended to that incident’s log. So the Incident Key allows you to de-duplicate or group all the incidents related to the same event.

Besides those 3 fields, the rest of the fields that define the Incident context are optional

The Client field is an optional field that represents name of the monitoring client that is triggering this event, for example just Fusion. And the Client URL field represents the callback URL to be called to see more details about the event on some site (other than PagerDuty), for example this URL can open some page of Fusion UI or some page of your search application to help solve the original problem.

The stage may have also a list of Incident Details, a list of a name-value pairs of arbitrary data. It will be a part of incident description on PD site and will be included in the incident log. The same is for Incident Context Links and Incident Context Details: those lists are lists of an arbitrary data that could be helpful to people working on incident. The Context Link is a pair of arbitrary URL (it is clickable in PagerDuty UI) and the text, describing the URL. Context Image is a set of fields defining some arbitrary clickable image (the image src URL should be of secure protocol, i.e. start with https://) that will be a part of incident context visible to someone investigating the incident.

All value fields for those list entries could be parameterized, i.e. the values for them can be represented by String Templates (see www.stringtemplate.org) and the actual values will be injected by Fusion from objects available to the stage, at the moment of execution. For example, the template expression <doc.id> can be used in Incident Detail value field; the actual value will be the id of the Fusion pipeline document that the stage was processing:

image003 Will be shown on PagerDuty site as a part of incident context, just like this:

Fusion Document Id: 345

Setting Stage Execution Conditions

Now, let’s talk about how to configure the stage so it will be triggered only when we want to execute PagerDuty stage (and therefore trigger or resolve an incident). We definitely do not want to trigger PagerDuty incident for every Fusion pipeline document processed by Index pipeline, or for every Query executed by Query pipeline. The part of the stage configuration – a Conditional Script field – defines when the stage is going to be executed. To execute the stage, the evaluation of JavaScript expression in the Conditional Script field should result in true (1). If the expression evaluates to false (0), the stage will not be executed. 

For example, we may run this stage only when we see a pipeline document with field first_name having value John:

The expression in Conditional Script field checks the data in the first_name field:

image005

Similarly, for the Query Pipeline’s stage we may want to trigger the PagerDuty alert in case if some important query brings zero results (something that should not normally happen). The conditional script for Query Pipeline’s Send PagerDuty Message stage should look like this: 

image007

If the incident Description and/or Incident Key are parameterized to use <request.q>, the incident will be descriptive in PD UI incidents list (assuming the query string is not too long):

image009

Note that to successfully process the response.initialEntity.query() in Conditional Script field, the wt parameter for Solr query should be set to json. Convenient place to do this is in Set Query Params stage. And, of course, the Query Solr stage should come prior to Send Pager Duty Message stage in your Query pipeline.

A Solr query like fetchedDate_dt:[NOW-1HOUR TO NOW], in combination with Conditional Script that checks for zero hits would make a good check for the constant flow of document indexed by your Fusion application, assuming the query pipeline is executed on a regular basis (I.e registered with Fusion Scheduler).

Multiple Stages in Pipelines

The Send PagerDuty Message stage does not change the Fusion objects passed to the stage (i.e. Fusion pipeline document or Request-and-Response), it only evaluates them and, if conditional script resolves to true, sends the message to PagerDuty. So it is possible to have a multiple Send PagerDuty Message stages in the same Fusion pipeline. For example, you may configure the pipeline to have 2 PD stages – one triggers the PD incident and another resolves it, based on certain pipeline document data. Pay attention to use the same Incident Key and not to send multiple duplicate Resolve PD messages.

Also, for Index Pipeline, you may use Set Property stage prior to Send PagerDuty Message stage to set some property on pipeline context or pipeline document, to make PagerDuty’s Conditional Script look simpler.

Conclusion

PagerDuty integration allows Fusion users that are interested in monitoring managed data to add notifications and alert functionality to their search applications, typically by triggering PagerDuty incidents when something significant is discovered while executing Fusion indexing or query pipelines. The Send PagerDuty Message stage is used to notify PagerDuty services. Minimal configuration is required on both PagerDuty and Fusion sides to enable the integration.