Practice PCDE Applying Site Reliability Engineering Practices to a Service questions with full explanations on every answer.
Start practicing
Applying Site Reliability Engineering Practices to a Service — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
An SRE team wants to define an SLI for service availability. Which metric correctly represents the availability SLI?
2A service has an SLO of 99.9% availability over a 30-day month. What is the error budget in minutes for that month?
3An engineer needs to set up alerting for error budget burn rate. For a fast burn alert, which burn rate multiplier and evaluation window are recommended?
4A team wants to reduce toil by automating a manual process that generates a report from Cloud Logging logs and emails it weekly. Which solution is most cost-effective and requires minimal operational overhead?
5During an incident, the incident commander identifies a need to scale up a managed instance group. Which IAM role should be granted to the on-call engineer to allow them to modify the instance group?
6A service has an SLO based on request latency: 99% of requests must complete under 500ms over a 28-day window. The team wants to monitor the error budget burn rate. Which Cloud Monitoring SLO type and configuration should be used?
7Which of the following best describes 'toil' in SRE?
8A team wants to implement chaos engineering on Google Kubernetes Engine (GKE) to test resilience against pod failures. Which tool is designed for injecting faults into GKE clusters?
9An SRE team uses Cloud Monitoring SLOs with an error budget policy. They want to receive an alert when the error budget is exhausted at a rate that would exhaust it in 1 hour (fast burn). The SLO is 99.9% over 30 days. The error budget is 43.2 minutes. What burn rate threshold should be used?
10During a blameless postmortem, the team uses the '5 Whys' technique to identify root causes. What is the primary purpose of this technique?
11A company wants to implement a service mesh with fault injection for HTTP services running on Google Kubernetes Engine. They need to inject artificial delays and errors into requests to test resilience. Which GCP service should they use?
12An SRE team defines an SLO for a batch processing pipeline. Which SLI is most appropriate for pipeline freshness?
13Which TWO components are essential for setting up an incident management on-call rotation in Google Cloud? (Choose 2)
14Which THREE are key principles of a blameless postmortem culture in SRE? (Choose 3)
15Which TWO metrics are appropriate for defining a request-based SLI for a web service? (Choose 2)
16A site reliability engineer is defining an SLO for a service that processes user uploads. The team wants to measure success as the proportion of uploads completed within 2 seconds. Which type of SLI should they use?
17A team's SLO for availability is 99.9% over a 30-day window. They have consumed 80% of their error budget halfway through the month. What is the remaining allowed downtime for the rest of the month?
18You are configuring error budget burn rate alerts for an SLO with a 30-day window. The SLO target is 99.9%. You want to alert if the error budget is projected to be fully consumed in 2 hours, using a fast burn rate. Which alerting policy configuration should you use?
19An SRE team wants to reduce toil associated with manual database schema migrations. They currently run SQL scripts manually during maintenance windows. Which Google Cloud service is most appropriate to automate this process in a repeatable way?
20During a blameless postmortem after an incident, the team identified that the root cause was a misconfigured load balancer health check. Which practice should the team prioritize to prevent recurrence?
21Which Google Cloud service can be used to inject artificial delays into HTTP traffic to test service resilience?
22A team uses Cloud Monitoring SLO monitoring with a request-based SLI. The SLO is defined as the proportion of requests returning HTTP 200 with latency under 500ms over a 30-day window. They notice that the SLO is being violated due to a slow increase in latency from a specific backend. Which alerting strategy will best detect this gradual degradation early?
23An SRE team wants to reduce toil by automating the response to common alert notifications. For example, when a disk usage alert fires, they want to automatically run a script to clean up temporary files. Which Google Cloud service is best suited for this?
24What is the primary purpose of an error budget?
25An e-commerce platform wants to implement chaos engineering on its Google Kubernetes Engine cluster to test resilience against network latency. Which tool is specifically designed for this purpose on GKE?
26Your microservice has an SLO of 99.95% availability over 30 days. A 5-minute outage occurs. The error budget consumption rate for that hour is extremely high. You want to alert on this quick consumption using a fast burn alert. What burn rate threshold and lookback window should you configure to detect if the budget would be exhausted in under 2 hours?
27A team wants to create an SLO for a batch data pipeline that processes files hourly. They want to measure whether each batch completes successfully within the hour. Which type of SLI should they use?
28An SRE team wants to implement error budget burn rate alerts for a service with SLO 99.9% over 30 days. They need to be notified both when the error budget is being consumed rapidly (full consumption in ~2 hours) and when it is being consumed slowly (full consumption in ~6 days). Which two alert configurations should they use? (Choose 2)
29A team is conducting a blameless postmortem after a production incident. Which three actions are part of an effective blameless postmortem process? (Choose 3)
30A team wants to reduce toil in their operations. Which two of the following are characteristics of toil according to Google SRE principles? (Choose 2)
31An SRE team wants to define an SLO for a microservice that processes HTTP requests. They need an SLI that measures the proportion of requests that are answered within 200ms with a non-5xx status code. Which type of SLI should they use?
32A team has an SLO of 99.9% availability over a 30-day month. They burn through their entire error budget in the first 10 days. Which of the following is the MOST appropriate immediate action according to SRE principles?
33An SRE team uses Cloud Monitoring to create an SLO for a service with a 99.9% availability target over 28 days. They set up a fast burn-rate alert on the error budget with a lookback window of 1 hour and a burn rate factor of 14. At what error budget consumption percentage will the alert fire?
34A team is automating toil reduction and needs to identify tasks that qualify as toil. Which of the following is a defining characteristic of toil according to SRE principles?
35A site reliability engineer wants to implement chaos engineering on a Google Kubernetes Engine (GKE) cluster by injecting network latency into pods of a specific deployment. Which tool or service should they use?
36An SRE team uses Cloud Monitoring to alert on error budget burn rate. They configure a slow burn alert with a 6-hour lookback window and a burn rate factor of 5. What is the purpose of this slow burn alert?
37An organization wants to reduce toil by automating a recurring process: every night, a script must run Cloud Build to rebuild a Docker image and deploy it to a GKE cluster. The script currently requires manual invocation by an engineer. Which Google Cloud service can trigger this automation on a schedule without manual intervention?
38An SRE team is creating a postmortem after a service outage. They want to ensure the process is blameless and focuses on systemic improvements. Which practice is central to a blameless postmortem?
39A team uses Traffic Director with Envoy proxies to manage traffic in a service mesh on Compute Engine. They want to introduce fault injection to test resilience by injecting a 5-second delay in 10% of requests to a specific backend service. Which resource should they configure?
40An SRE team is setting up an on-call rotation for incident response. They want to use a tool that integrates with Cloud Monitoring and can escalate incidents if not acknowledged. Which service should they integrate with Cloud Monitoring?
41A team defines an SLO for a data pipeline: 99.9% of data records should be processed within 1 hour of ingestion. They need an SLI to measure this. Which SLI is most appropriate?
42An engineer needs to create an SLO in Cloud Monitoring for a service that processes requests. The SLO should measure the proportion of requests that complete successfully within 300ms. The metric type for successful requests is custom.googleapis.com/myapp/success_count and for total requests is custom.googleapis.com/myapp/total_count. Which type of SLO should they create?
43An SRE team wants to implement a blameless postmortem culture after incidents. Which TWO practices are essential for a blameless postmortem?
44A team needs to automate the reduction of toil in their operations. Which THREE of the following are valid strategies to reduce toil according to SRE principles?
45An SRE team is designing an incident management process. Which TWO components are part of a typical incident command structure?
46An SRE team defines an SLI as the proportion of good requests to valid requests over a 1-minute window. They set an SLO of 99.9% availability over 30 days. Which error budget burn rate alert configuration should they use to detect rapid consumption of the error budget within 1 hour?
47A team wants to reduce toil from manual database backups. They estimate the toil takes 10 hours per week. What is the maximum amount of toil they should allow to keep toil under 50% of their time according to SRE best practices?
48During an incident, the incident commander delegates tasks to multiple teams. After the incident is resolved, the team holds a postmortem. Which of the following is a key principle of a blameless postmortem culture?
49A service has an SLO of 99.9% availability over a 30-day rolling window. The team wants to use Cloud Monitoring to create a request-based SLO. Which configuration is correct?
50A team notices that a critical microservice often fails when the downstream database is slow. They want to test the service's resilience by injecting latency into the database dependency. Which GCP tool should they use?
51What is the primary purpose of an error budget?
52An SRE team wants to automate a repetitive manual task that involves moving files from Cloud Storage to BigQuery and then deleting the source files. Which GCP service is BEST suited for this toil reduction?
53A service has an SLO of 99.99% availability over 28 days. The team wants to set up a slow burn alert that will notify them within 6 hours if error budget consumption is at 5x the budgeted rate. How much error budget has been consumed after 6 hours at this rate? The total error budget for 28 days is 4 minutes and 2 seconds (242 seconds).
54Which of the following is an example of toil according to SRE principles?
55During an incident, the incident commander notices that multiple teams are working on the same issue without coordination. Which structure should be implemented to improve incident response?
56A team wants to implement Chaos Engineering on GKE to test the resilience of their microservices by randomly killing pods and injecting network latency. Which tool is specifically designed for this purpose on GKE?
57A team has a service with an SLO of 99.5% uptime over 30 days. They track error budget and want to alert when the error budget is almost exhausted. What is their total error budget in minutes per month?
58An SRE team is defining SLIs for a data pipeline that ingests events from a pub/sub topic and writes to BigQuery. Which two metrics are good SLIs for pipeline freshness? (Choose TWO.)
59A team wants to implement an on-call rotation using Cloud Monitoring and third-party tools. Which three components are essential for setting up on-call alerting? (Choose THREE.)
60Which two practices are characteristic of a blameless postmortem? (Choose TWO.)
61A team wants to define an SLO for a microservice that processes batch jobs. The service is considered healthy if each batch completes within 60 minutes. There are 100 batches per day. Which SLI should be used?
62A site reliability engineer needs to alert when the error budget burn rate exceeds 14x the budget over a 1-hour window. Which type of alert should be configured in Cloud Monitoring?
63An SRE team uses Cloud Monitoring SLOs with request-based SLI for a microservice. They want to alert when the error budget is projected to be exhausted within 2 hours at current burn rate. The SLO target is 99.9% over 30 days. Which approach should they use?
64A service has an SLO of 99.9% availability over a 30-day window. The team wants to automate a deployment rollback if the error budget burn rate exceeds 10x over a 30-minute window. Which combination of Cloud Monitoring and Cloud Build should be used?
65A Site Reliability Engineer is tasked with reducing toil in their team. They identify that resetting expired database connections manually is a common task. What is the best way to automate this toil?
66During a postmortem for a service outage, the team identifies that the root cause was a configuration change that disabled TLS on a critical internal service. The change was made by an automated deployment pipeline. Which tool or practice should be implemented to prevent this in the future?
67A team wants to improve incident response by creating a runbook for a common failure scenario: a database replication lag exceeds 5 seconds. Which Cloud Monitoring feature should be used to automatically trigger the runbook?
68An SRE team wants to ensure that no single person can deploy to production without a peer review. Which Google Cloud service or feature should they use?
69A service expects to receive 10,000 requests per second. The team needs to monitor request latency with an SLI that measures the proportion of requests that complete in under 100 ms. The latency distribution is right-skewed. Which approach should be used to define the SLI in Cloud Monitoring?
70A team runs a service on Google Kubernetes Engine (GKE) and wants to inject faults to test resilience. They need to introduce latency into requests to a specific microservice without modifying code. Which tool should they use?
71An SRE team wants to track the amount of toil their team performs each week and set a goal to keep it under 50% of working time. Which approach should they use?
72A company has an SLA of 99.95% availability for its service. The SRE team defines an SLO of 99.99% availability. The error budget is calculated as 0.01% over a 30-day window. How much downtime is allowed per month according to the error budget?
73An SRE team is implementing a chaos engineering practice on GKE. They want to test the resilience of a microservice by injecting failures. Which TWO tools or services can they use? (Choose 2.)
74A team uses Cloud Monitoring SLOs for a service that has an SLO of 99.9% availability. They want to create alerts that notify the on-call engineer when the error budget is burning too fast. Which TWO conditions should they configure? (Choose 2.)
75A site reliability engineer is leading a blameless postmortem for an incident. Which THREE practices should be included? (Choose 3.)
76A team wants to define an SLO for their service based on availability. Over a 30-day window, the service handled 1,000,000 requests, of which 999,500 succeeded. What is the achieved availability, and what is the error budget consumed if the SLO is 99.95%?
77An SRE team wants to alert when the error budget burn rate exceeds 14x the allowed rate over a 1-hour window. Which Cloud Monitoring alert policy configuration is appropriate?
78A service has an SLO of 99.9% availability over 30 days. In the first 10 days, the service has already consumed 60% of the error budget. Which action best aligns with SRE principles?
79An engineer needs to reduce toil. Which of the following tasks is considered toil according to SRE principles?
80A team wants to automate the response to a common incident: restarting a service when it becomes unhealthy. Which GCP service is best suited to trigger a Cloud Function based on a Cloud Monitoring alert?
81What is the primary purpose of a blameless postmortem in incident management?
82A service uses Cloud SQL for MySQL. To test resilience, you want to inject latency into database queries. Which chaos engineering approach is most suitable on Google Cloud?
83An SRE team wants to track the amount of toil each week and ensure it does not exceed 50% of the team's time. Which approach is most aligned with SRE best practices?
84Which of the following is NOT a characteristic of toil as defined by SRE?
85During an incident, the incident commander decides to escalate to a higher severity level. Which of the following best describes the incident commander's primary responsibility?
86A team uses Cloud Monitoring SLO monitoring with a request-based SLI for availability. They define good requests as those returning HTTP 200. Which configuration correctly creates this SLO?
87A team wants to implement a slow burn alert for error budget consumption. Which configuration should they use?
88Which TWO are valid methods to define an SLI for a request-driven service? (Choose 2.)
89An SRE team wants to conduct chaos engineering on a GKE cluster to test resilience. Which TWO tools or services can be used? (Choose 2.)
90Which THREE are components of an effective incident management process? (Choose 3.)
91A site reliability engineer defines a service's availability SLI as the percentage of successful requests. Which of the following is the correct formula for this SLI?
92A team has an SLO of 99.9% availability over a 30-day period. How many minutes of downtime does the error budget allow per month?
93An SRE team implements error budget alerting using Cloud Monitoring. They want to set a 'fast burn' alert that triggers when the error budget burn rate exceeds 14x over a 1-hour window. What is the purpose of this alert?
94A DevOps team wants to reduce toil by automating manual, repetitive tasks that have no enduring value and scale with service growth. Which two characteristics define toil according to SRE principles?
95An organization wants to implement a blameless postmortem culture after incidents. Which of the following is a key practice in blameless postmortems?
96During an incident, an SRE team uses an incident command system. Which role is responsible for coordinating communication and resources, but not for technical debugging?
97A company uses Cloud Monitoring to create an SLO for a service. They want to define a request-based SLO with a ratio of good requests to valid requests. Which of the following is a valid way to define the SLI in Cloud Monitoring SLOs?
98A team wants to inject latency faults into their microservices running on Google Kubernetes Engine (GKE) to test resilience. Which tool or service should they use?
99An SRE team wants to track the amount of toil their team performs each week. According to SRE best practices, what is the recommended maximum percentage of time that should be spent on toil?
100An SRE team uses PagerDuty for on-call rotation. They receive a critical alert at 2 AM. According to incident management best practices, what should the on-call engineer do first?
101A company uses Cloud Monitoring SLO monitoring with error budget alerts. They set a slow burn alert with a 5x burn rate over a 6-hour window. If the error budget is 0.1% over 30 days, approximately how long would it take to exhaust the budget at a 5x burn rate?
102Which Google Cloud service can be used to automate repetitive operational tasks such as restarting a VM or clearing a cache, as part of toil reduction?
103An SRE team wants to implement chaos engineering on their GKE cluster. Which TWO options are valid tools or services for injecting faults into GKE workloads?
104A company wants to set up error budget alerting in Cloud Monitoring for a service with a 99.9% SLO over 30 days. They want to receive alerts when the error budget burn rate reaches certain thresholds. Which TWO of the following are typical recommendations for alerting thresholds?
105During a postmortem, an SRE team identifies several contributing factors. Which THREE items should be included in the action items section of a blameless postmortem?
106A site reliability team wants to define an SLO for a service with a target availability of 99.9% over a 30-day window. The error budget is exhausted. Which action MUST the team take?
107An SRE wants to measure latency SLI for a web service. Which metric is the BEST indicator of user-perceived performance?
108A team uses Cloud Monitoring to create an SLO for a request-based service. They want to alert when the error budget burn rate exceeds 14x the budget for a short window. Which alert type and window should they configure?
109An SRE team notices that a routine database cleanup task takes 30 minutes of manual effort each week. The task does not add enduring value and scales linearly with the number of databases. How should the team classify this work?
110After a major incident, the SRE team conducts a postmortem. Which practice is ESSENTIAL for a blameless culture?
111An SRE team wants to perform fault injection testing on a GKE cluster by injecting network latency into a specific set of pods. Which tool should they use?
112A team uses Cloud Monitoring to track availability SLI as good-request-count / valid-request-count. They want to create a window-based SLO. Which metric filter should they use for the numerator?
113An SLA guarantees 99.9% monthly uptime. The team's SLO is 99.95% and error budget is 0.05%. What is the maximum allowed downtime per month according to the SLA?
114During an incident, the incident commander delegates tasks to multiple teams. Which communication model is MOST effective to reduce noise?
115An SRE team wants to automate a manual process that involves multiple steps and conditional logic (e.g., if a backup fails, retry with a different method). Which TWO Google Cloud services can they use to orchestrate this workflow? (Choose 2 answers)
116A team wants to implement error budget alerts in Cloud Monitoring. They need TWO policies to detect both rapid and gradual budget consumption. Which TWO alert policies should they configure? (Choose 2 answers)
117Which TWO statements correctly describe characteristics of toil in SRE? (Choose 2 answers)
118A team wants to reduce toil by automating a recurring cloud resource update. Which THREE Google Cloud services can be used together to build an automated pipeline? (Choose 3 answers)
119An SRE team is conducting a blameless postmortem after an outage. Which THREE elements should be included in the postmortem document? (Choose 3 answers)
120A team wants to implement chaos engineering on GKE to test resilience. Which THREE fault types can be injected using Chaos Mesh? (Choose 3 answers)
121An SRE team has defined a service's availability SLI as the proportion of successful requests over a 5-minute window. They set an SLO of 99.9% over 30 days. What is the error budget for a 30-day period?
122A team wants to set up alerting on error budget burn rate for a service with an SLO of 99.9%. They want to detect when the error budget is being consumed at a rate that would exhaust it in less than 24 hours, using a 1-hour assessment window. What is the appropriate burn rate threshold for a fast burn alert?
123You are implementing a chaos engineering experiment on a GKE cluster using Chaos Mesh. You want to test the resilience of a microservice by injecting a 5-second delay into 50% of HTTP requests to a specific service. Which Chaos Mesh resource should you use?
124A site reliability team wants to reduce toil in their incident management process. They currently receive alerts via email and manually create tickets, page the on-call engineer, and update a shared spreadsheet. Which TWO Google Cloud services can help automate these tasks and reduce toil?
The Applying Site Reliability Engineering Practices to a Service domain covers the key concepts tested in this area of the PCDE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PCDE domains — no account required.
The Courseiva PCDE question bank contains 124 questions in the Applying Site Reliability Engineering Practices to a Service domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the Applying Site Reliability Engineering Practices to a Service domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included