Service Level Objectives

Concept

DevOps practitioners or Business owners face a very difficult task of managing a service correctly. In order to ensure that service levels are consistent with the business requirements, it is imperative to know what the critical user journeys for a particular service are: a set of interactions a user has with a service to achieve some end result, as described in Guide to setting SLOs.

Instana users can model their critical user journeys by setting up an Application Perspective. To know more about how to achieve this, head over to Create an application perspective.

Once the critical user journeys are identified and ordered by business impact, owners need to determine the metrics to use as key indicators. The configuration of such indicators is described in SLI configuration in detail.

Once the indicators are set, the objectives or the target for the selected indicators have to be defined. Service level objectives are a target of service level indicators during a specified time window. This helps to measure whether the reliability of a service during a given duration meets the expectations for the majority of its users as described in the blog post Guide to setting SLOs.

SLOs are necessary because they define the quality of service (QoS) and reliability goals in concrete, measurable, objective terms. They are not intended to define the best performance level but a range of best possible and least acceptable performance standards.

Service Level Terminology

Indicators: A service level indicator (SLI) is the defined quantitative measure of one characteristic of the level of service that is provided to a customer. Common examples of such indicators are error rate or response latency of a service.

Objectives: A service level objective (SLO) defines the target value for the service level that is measured by a service level indicator. As an example, the SLO could specify that a given SLI is 99.9% of the time fulfilled.

Error Budget: The specified target value of an SLO implicitly defined a small budget where the service is allowed to not work fully reliable. This error budget allows to incorporate planned or unplanned downtime of the service that is unavoidable in practice.

SLO Widget

Instana enables users to create custom dashboard widgets for your SLOs, in order to display and analyze the performance of your services over time. The widget can show information of either Time-based or Event-based SLI configurations.

SLO Widget

The image above illustrates an example SLO widget called Robot Shop SLO which was configured using a Time-based SLI configuration and an SLO target value of 95%. The SLI in this example is based on the 90th percentile of the latency metric and a threshold value of 2 seconds. For the displayed time window of 7 days, as shown in the right hand side of the widget, this translates to an error budget of 504 minutes. The example SLO was violated in the selected timeframe as the spent error budget of 565 minutes exceeds that limit.

Configuration

The following sections describe how to setup an SLO widget for any of your Application Perspectives.

Adding SLO Widgets

In order to add an SLO widget, navigate to one of your custom dashboards to open the dialog for adding a new widget. Next, follow the steps below:

  1. Click SLO in the dialog sidebar. This will bring up the configuration section to set up an SLO widget.
  2. Provide the Widget Details:

    • Enter a title for the widget.
  3. Select the Application Perspective / User journey for the SLO from the dropdown.
  4. Select the Service Level Indicator for the given Application Perspective from the dropdown. If no SLI is available for the previously selected Application Perspective, head over to SLI configuration.
  5. Enter the desired SLO Target value, for example 99.9%.
  6. Select the Time Window Type which defines the context and shown timeframe of the widget:

    • Dynamic time window: The SLO is calculated for the time window selected in the global time picker.
    • Rolling time window: A time window with a fixed window size, where the end is defined by the global time picker’s end date/time selection. As an example, this allows to always see the last week, without having to adjust to global time picker.
    • Fixed time interval: A time window with a defined start and duration. As an example, you can configure a fixed one month window starting 2020-01-01. The time window will be automatically reset to the next month (2020-02-01) once the month is completed.
  7. Verify your widget in the preview at the bottom of the configuration page.

    • If there is no preview being shown, click the Highlight missing configuration button to immediately see what is missing.
  8. Click on the Create button at the end of the dialog to save the SLO widget configuration.
  9. Save the changes done to your Custom Dashboard by clicking on Save changes.

SLI configuration

Manage SLIs via the UI

In order to create a new SLI configuration or clone an existing SLI configuration, navigate to the SLI Management dialog by clicking on the Manage SLIs button on the SLO Widget.

Manage SLIs

Next, follow the steps below to either create a new configuration from scratch or by cloning an existing configuration.

Create a new SLI configuration

  1. Click on Create SLI button on the SLI Management dialog.
  2. Provide the SLI configuration Details:

    • Enter a name for the SLI configuration to identify the configuration uniquely.
  3. Select the type of the SLI configuration, either Time-based or Event-based.
  4. Select the boundary scope, either Inbound Calls or All Calls.

    • Inbound calls: Inbound calls are calls initiated from outside the application and where the destination service is part of the selected application perspective.
    • All calls: Results and metrics for not only calls at the application perspective boundary, but also those occurring within the application perspective.
  5. Provide the necessary configuration depending on the type of SLI configuration.

  6. Click Create to save the new SLI configuration.

Clone an existing SLI configuration

The parameters of the SLI cannot be modified to prevent invalidation of the calculated spent budgets. This is why the SLI configuration needs to be cloned when you change any parameter.

  1. Click on the View/Clone SLI configuration icon on the selected SLI configuration which you would like to Clone from the SLI Management dialog.
  2. Edit the SLI configuration Details:

    • Change the name for the SLI configuration to identify the configuration as a clone from the already existing configuration.
  3. Edit the SLI configuration as per your desire.
  4. Click Clone to create a clone of the SLI configuration.
Time-based SLI

Configuration of Time-based SLI:

  1. You can choose to select the service to narrow down to a particular service or leave it as is to apply for the whole Application Perspective.
  2. If you want to narrow down further to an endpoint, you can select an endpoint from the dropdown. Similar to service selection, you can choose to leave it as is to apply for the whole service.
  3. Choose a metric on which the SLI configuration has to be evaluated from the list of supported metrics.

    • Following metrics are supported:

      • Latency
      • Call Count
      • Error rate
      • Erroneous Calls
  4. Select the aggregation for the selected metric.
  5. Enter the threshold value for the selected metric.

After the metric and threshold are selected, SLI is computed as follows:

SLI = (1 - #minutes_where_threshold_is_violated / #minutes_in_time_window) * 100%

An example of the Time-based SLI configuration on the application k8s_demo, service cart, endpoint POST /shipping/:id scoped on Inbound Calls for the 90th percentile of the latency metric and a threshold value of 25 milliseconds is shown below.

Time-based SLI

Event-based SLI

Event-based SLI configuration gives the full flexibility of the Unbounded Analytics query builder to select a subset of good events and bad events.

  • Good events: As the name indicates, these are the set of calls that indicate the success criteria of a particular service.

    • For ex: All HTTP requests of an HTTP Service which have the status code 2XX.
  • Bad events: These are the set of calls that indicate the failure criteria of a particular service.

    • For ex: All HTTP requests of an HTTP Service which have the status code 5XX.

After defining bad and good events, SLI is computed as follows:

SLI = #good_events / (#good_events + #bad_events) * 100%

An example of the Event-based SLI configuration on the application k8s_demo, scoped on Inbound Calls, the set of good events is represented by all calls which contain the status code 2XX and the set of bad events is represented by all calls which contain the status code 5XX is shown below.

Event-based SLI

The resulting widget of this configuration would show the error budget in terms of calls not minutes.

Create SLIs via API

Our SLI Configuration API provides endpoints to create, read, update and delete SLI configurations.

As an example, the following curl command can be used to create a new Time-based SLI configuration named My First SLI for the applicationId appId, serviceId serviceId and the endpointId endpointId, for the metric latency, metric aggregation 90th percentile and the threshold value 25ms considering ALL calls:

curl --location --request POST "{{base}}/api/settings/sli" \
  --header "Authorization: apiToken {{apiToken}}" \
  --header "Content-Type: application/json" \
  --data '{
    "id": "<ARBITRARY_UNIQUE_SLI_ID>",
    "sliName": "My first SLI",
    "metricConfiguration": {
        "metricName": "latency",
        "metricAggregation": "P90",
        "threshold": 25
    },
    "sliEntity": {
        "sliType": "application",
        "applicationId": "appId",
        "serviceId": "serviceId",
        "endpointId": "endpointId",
        "boundaryScope": "ALL"
    }
  }'

Please note:

  • The used API Token requires the permission "Configuration of service level indicators".

Grafana SLO Plugin

As an alternative, if you have any other custom dashboarding needs, we offer a Grafana plugin that enables to display specific SLO information based on data provided by Instana.