附录A SLO文档示例
本文介绍了示例游戏服务的SLO。
Status | Published |
---|---|
Author | Steven Thurgood |
Date | 2018-02-19 |
Reviewers | David Ferguson |
Approvers | Betsy Beyer |
Approval Date | 2018-02-20 |
Revisit Date | 2019-02-01 |
服务概述
示例游戏服务允许Android和iPhone用户彼此玩游戏。该应用程序在用户的手机上运行,并且移动通过REST API发送回API。数据存储区包含所有当前和先前游戏的状态。得分管道读取该表并生成今天,本周和所有时间的最新联赛表。排行榜结果可在应用程序中通过API获得,也可以在公共HTTP服务器上获得。
SLO使用四个星期的滚动窗口。
**SLI和SLO **
Category | SLI | SLO |
---|---|---|
API | ||
Availability | The proportion of successful requests, as measured from the load balancer metrics. Any HTTP status other than 500–599 is considered successful. count of "api" http_requests which do not have a 5XX status code divided by count of all "api" http_requests |
97% success |
延迟 | The proportion of sufficiently fast requests, as measured from the load balancer metrics."Sufficiently fast" is defined as < 400 ms, or < 850 ms. count of "api" http_requests with a duration less than or equal to "0.4" seconds divided by count of all "api" http_requests count of "api" http_requests with a duration less than or equal to "0.85" seconds divided by count of all "api" http_requests |
90% of requests < 400 ms 99% of requests < 850 ms |
HTTP server | ||
可用性 | 根据负载平衡器指标衡量的成功请求的比例。除500–599以外的任何HTTP状态均被视为成功。count of "web" http_requests which do not have a 5XX status code divided by count of all "web" http_requests |
99% |
Latency | The proportion of sufficiently fast requests, as measured from the load balancer metrics. “Sufficiently fast” is defined as < 200 ms, or < 1,000 ms. count of "web" http_requests with a duration less than or equal to "0.2" seconds divided by count of all "web" http_requests count of "web" http_requests with a duration less than or equal to "1.0" seconds divided by count of all "web" http_requests |
90% of requests < 200 ms 99% of requests < 1,000 ms |
Score pipeline | ||
Freshness | The proportion of records read from the league table that were updated recently. “Recently” is defined as within 1 minute, or within 10 minutes. Uses metrics from the API and HTTP server: count of all data_requests for "api" and "web" with freshness less than or equal to 1 minute divided by count of all data_requests count of all data_requests for "api" and "web" with freshness less than or equal to 10 minutes divided by count of all data_requests |
90% of reads use data written within the previous 1 minute. 99% of reads use data written within the previous 10 minutes. |
Correctness | The proportion of records injected into the state table by a correctness prober that result in the correct data being read from the league table. A correctness prober injects synthetic data, with known correct outcomes, and exports a success metric: count of all data_requests which were correct divided by count of all data_requests |
99.99999% of records injected by the prober result in the correct output. |
Completeness | The proportion of hours in which 100% of the games in the data store were processed (no records were skipped). Uses metrics exported by the score pipeline: count of all pipeline runs that processed 100% of the records divided by count of all pipeline runs |
99% of pipeline runs cover 100% of the data. |
理论
可用性和延迟SLI基于2018年1月1日至2018年1月28日之间的测量。可用性SLO向下舍入到最接近的1%,而延迟SLO时序被舍入到最接近的50 ms。作者选择了所有其他数字,并验证了这些服务正在或高于这些级别运行。
尚未尝试验证这些数字是否与用户体验密切相关。1
错误预算
每个目标都有一个单独的错误预算,定义为该目标的100%减去(-)目标。例如,如果在前四周中有1,000,000个请求发送到API服务器,则API可用性错误预算为1,000,000中的3%(100%-97%):30,000个错误。
当我们的任何目标用尽错误预算时,我们将制定错误预算政策(请参阅附录B)。
说明和警告
-
请求指标是在负载均衡器上测量的。此度量可能无法准确度量用户请求未到达负载均衡器的情况。
-
我们仅将HTTP 5XX状态消息视为错误代码;其他一切都算成功。
-
正确性探测器使用的测试数据包含大约200个测试,每1秒注入一次。我们的错误预算是每四周48个错误。
-
即使SLO中的数字不是严格依据证据的,也有必要对此进行记录,以使将来的读者可以理解这一事实,并做出适当的决定。他们可能会决定值得收集更多证据的投资。 ↩