Skip to the content.

附录A SLO文档示例



本文介绍了示例游戏服务的SLO。

Status Published
Author Steven Thurgood
Date 2018-02-19
Reviewers David Ferguson
Approvers Betsy Beyer
Approval Date 2018-02-20
Revisit Date 2019-02-01

服务概述

示例游戏服务允许Android和iPhone用户彼此玩游戏。该应用程序在用户的手机上运行,并且移动通过REST API发送回API数据存储区包含所有当前和先前游戏的状态。得分管道读取该表并生成今天,本周和所有时间的最新联赛表。排行榜结果可在应用程序中通过API获得,也可以在公共HTTP服务器上获得。

SLO使用四个星期的滚动窗口。

**SLI和SLO **

Category SLI SLO
API    
Availability The proportion of successful requests, as measured from the load balancer metrics. Any HTTP status other than 500–599 is considered successful.
count of "api" http_requests which do not have a 5XX status code divided by count of all "api" http_requests
97% success
延迟 The proportion of sufficiently fast requests, as measured from the load balancer metrics.
"Sufficiently fast" is defined as < 400 ms, or < 850 ms. count of "api" http_requests with a duration less than or equal to "0.4" seconds divided by count of all "api" http_requests

count of "api" http_requests with a duration less than or equal to "0.85" seconds divided by count of all "api" http_requests
90% of requests < 400 ms
99% of requests < 850 ms
HTTP server    
可用性 根据负载平衡器指标衡量的成功请求的比例。除500–599以外的任何HTTP状态均被视为成功。
count of "web" http_requests which do not have a 5XX status code divided by count of all "web" http_requests
99%
Latency The proportion of sufficiently fast requests, as measured from the load balancer metrics.
“Sufficiently fast” is defined as < 200 ms, or < 1,000 ms.
count of "web" http_requests with a duration less than or equal to "0.2" seconds divided by count of all "web" http_requests

count of "web" http_requests with a duration less than or equal to "1.0" seconds divided by count of all "web" http_requests
90% of requests < 200 ms
99% of requests < 1,000 ms
Score pipeline    
Freshness The proportion of records read from the league table that were updated recently. “Recently” is defined as within 1 minute, or within 10 minutes.
Uses metrics from the API and HTTP server: count of all data_requests for "api" and "web" with freshness less than or equal to 1 minute divided by count of all data_requests

count of all data_requests for "api" and "web" with freshness less than or equal to 10 minutes divided by count of all data_requests
90% of reads use data written within the previous 1 minute. 99% of reads use data written within the previous 10 minutes.
Correctness The proportion of records injected into the state table by a correctness prober that result in the correct data being read from the league table.
A correctness prober injects synthetic data, with known correct outcomes, and exports a success metric: count of all data_requests which were correct divided by count of all data_requests
99.99999% of records injected by the prober result in the correct output.
Completeness The proportion of hours in which 100% of the games in the data store were processed (no records were skipped).
Uses metrics exported by the score pipeline:count of all pipeline runs that processed 100% of the records divided by count of all pipeline runs
99% of pipeline runs cover 100% of the data.

理论

可用性和延迟SLI基于2018年1月1日至2018年1月28日之间的测量。可用性SLO向下舍入到最接近的1%,而延迟SLO时序被舍入到最接近的50 ms。作者选择了所有其他数字,并验证了这些服务正在或高于这些级别运行。

尚未尝试验证这些数字是否与用户体验密切相关。1

错误预算

每个目标都有一个单独的错误预算,定义为该目标的100%减去(-)目标。例如,如果在前四周中有1,000,000个请求发送到API服务器,则API可用性错误预算为1,000,000中的3%(100%-97%):30,000个错误。

当我们的任何目标用尽错误预算时,我们将制定错误预算政策(请参阅附录B)。

说明和警告



  1. 即使SLO中的数字不是严格依据证据的,也有必要对此进行记录,以使将来的读者可以理解这一事实,并做出适当的决定。他们可能会决定值得收集更多证据的投资。