Free PCDE Applying Site Reliability Engineering Practices to a Service Practice Questions (2026)

Q: How can I practice Applying Site Reliability Engineering Practices to a Service questions for PCDE?

Click any of the 124 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Applying Site Reliability Engineering Practices to a Service domain.

Practice Applying Site Reliability Engineering Practices to a Service questions

10Q 20Q 30Q 50Q

All PCDE Applying Site Reliability Engineering Practices to a Service questions (124)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

An SRE team wants to define an SLI for service availability. Which metric correctly represents the availability SLI?

A service has an SLO of 99.9% availability over a 30-day month. What is the error budget in minutes for that month?

An engineer needs to set up alerting for error budget burn rate. For a fast burn alert, which burn rate multiplier and evaluation window are recommended?

A team wants to reduce toil by automating a manual process that generates a report from Cloud Logging logs and emails it weekly. Which solution is most cost-effective and requires minimal operational overhead?

During an incident, the incident commander identifies a need to scale up a managed instance group. Which IAM role should be granted to the on-call engineer to allow them to modify the instance group?

A service has an SLO based on request latency: 99% of requests must complete under 500ms over a 28-day window. The team wants to monitor the error budget burn rate. Which Cloud Monitoring SLO type and configuration should be used?

Which of the following best describes 'toil' in SRE?

A team wants to implement chaos engineering on Google Kubernetes Engine (GKE) to test resilience against pod failures. Which tool is designed for injecting faults into GKE clusters?

An SRE team uses Cloud Monitoring SLOs with an error budget policy. They want to receive an alert when the error budget is exhausted at a rate that would exhaust it in 1 hour (fast burn). The SLO is 99.9% over 30 days. The error budget is 43.2 minutes. What burn rate threshold should be used?

During a blameless postmortem, the team uses the '5 Whys' technique to identify root causes. What is the primary purpose of this technique?

A company wants to implement a service mesh with fault injection for HTTP services running on Google Kubernetes Engine. They need to inject artificial delays and errors into requests to test resilience. Which GCP service should they use?

An SRE team defines an SLO for a batch processing pipeline. Which SLI is most appropriate for pipeline freshness?

Which TWO components are essential for setting up an incident management on-call rotation in Google Cloud? (Choose 2)

Which THREE are key principles of a blameless postmortem culture in SRE? (Choose 3)

Which TWO metrics are appropriate for defining a request-based SLI for a web service? (Choose 2)

A site reliability engineer is defining an SLO for a service that processes user uploads. The team wants to measure success as the proportion of uploads completed within 2 seconds. Which type of SLI should they use?

A team's SLO for availability is 99.9% over a 30-day window. They have consumed 80% of their error budget halfway through the month. What is the remaining allowed downtime for the rest of the month?

You are configuring error budget burn rate alerts for an SLO with a 30-day window. The SLO target is 99.9%. You want to alert if the error budget is projected to be fully consumed in 2 hours, using a fast burn rate. Which alerting policy configuration should you use?

An SRE team wants to reduce toil associated with manual database schema migrations. They currently run SQL scripts manually during maintenance windows. Which Google Cloud service is most appropriate to automate this process in a repeatable way?

During a blameless postmortem after an incident, the team identified that the root cause was a misconfigured load balancer health check. Which practice should the team prioritize to prevent recurrence?

Which Google Cloud service can be used to inject artificial delays into HTTP traffic to test service resilience?

A team uses Cloud Monitoring SLO monitoring with a request-based SLI. The SLO is defined as the proportion of requests returning HTTP 200 with latency under 500ms over a 30-day window. They notice that the SLO is being violated due to a slow increase in latency from a specific backend. Which alerting strategy will best detect this gradual degradation early?

An SRE team wants to reduce toil by automating the response to common alert notifications. For example, when a disk usage alert fires, they want to automatically run a script to clean up temporary files. Which Google Cloud service is best suited for this?

What is the primary purpose of an error budget?

An e-commerce platform wants to implement chaos engineering on its Google Kubernetes Engine cluster to test resilience against network latency. Which tool is specifically designed for this purpose on GKE?

Your microservice has an SLO of 99.95% availability over 30 days. A 5-minute outage occurs. The error budget consumption rate for that hour is extremely high. You want to alert on this quick consumption using a fast burn alert. What burn rate threshold and lookback window should you configure to detect if the budget would be exhausted in under 2 hours?

A team wants to create an SLO for a batch data pipeline that processes files hourly. They want to measure whether each batch completes successfully within the hour. Which type of SLI should they use?

An SRE team wants to implement error budget burn rate alerts for a service with SLO 99.9% over 30 days. They need to be notified both when the error budget is being consumed rapidly (full consumption in ~2 hours) and when it is being consumed slowly (full consumption in ~6 days). Which two alert configurations should they use? (Choose 2)

A team is conducting a blameless postmortem after a production incident. Which three actions are part of an effective blameless postmortem process? (Choose 3)

A team wants to reduce toil in their operations. Which two of the following are characteristics of toil according to Google SRE principles? (Choose 2)

An SRE team wants to define an SLO for a microservice that processes HTTP requests. They need an SLI that measures the proportion of requests that are answered within 200ms with a non-5xx status code. Which type of SLI should they use?

A team has an SLO of 99.9% availability over a 30-day month. They burn through their entire error budget in the first 10 days. Which of the following is the MOST appropriate immediate action according to SRE principles?

An SRE team uses Cloud Monitoring to create an SLO for a service with a 99.9% availability target over 28 days. They set up a fast burn-rate alert on the error budget with a lookback window of 1 hour and a burn rate factor of 14. At what error budget consumption percentage will the alert fire?

A team is automating toil reduction and needs to identify tasks that qualify as toil. Which of the following is a defining characteristic of toil according to SRE principles?

A site reliability engineer wants to implement chaos engineering on a Google Kubernetes Engine (GKE) cluster by injecting network latency into pods of a specific deployment. Which tool or service should they use?

An SRE team uses Cloud Monitoring to alert on error budget burn rate. They configure a slow burn alert with a 6-hour lookback window and a burn rate factor of 5. What is the purpose of this slow burn alert?

An organization wants to reduce toil by automating a recurring process: every night, a script must run Cloud Build to rebuild a Docker image and deploy it to a GKE cluster. The script currently requires manual invocation by an engineer. Which Google Cloud service can trigger this automation on a schedule without manual intervention?

An SRE team is creating a postmortem after a service outage. They want to ensure the process is blameless and focuses on systemic improvements. Which practice is central to a blameless postmortem?

A team uses Traffic Director with Envoy proxies to manage traffic in a service mesh on Compute Engine. They want to introduce fault injection to test resilience by injecting a 5-second delay in 10% of requests to a specific backend service. Which resource should they configure?

An SRE team is setting up an on-call rotation for incident response. They want to use a tool that integrates with Cloud Monitoring and can escalate incidents if not acknowledged. Which service should they integrate with Cloud Monitoring?

A team defines an SLO for a data pipeline: 99.9% of data records should be processed within 1 hour of ingestion. They need an SLI to measure this. Which SLI is most appropriate?

An engineer needs to create an SLO in Cloud Monitoring for a service that processes requests. The SLO should measure the proportion of requests that complete successfully within 300ms. The metric type for successful requests is custom.googleapis.com/myapp/success_count and for total requests is custom.googleapis.com/myapp/total_count. Which type of SLO should they create?

An SRE team wants to implement a blameless postmortem culture after incidents. Which TWO practices are essential for a blameless postmortem?

A team needs to automate the reduction of toil in their operations. Which THREE of the following are valid strategies to reduce toil according to SRE principles?

An SRE team is designing an incident management process. Which TWO components are part of a typical incident command structure?

An SRE team defines an SLI as the proportion of good requests to valid requests over a 1-minute window. They set an SLO of 99.9% availability over 30 days. Which error budget burn rate alert configuration should they use to detect rapid consumption of the error budget within 1 hour?

A team wants to reduce toil from manual database backups. They estimate the toil takes 10 hours per week. What is the maximum amount of toil they should allow to keep toil under 50% of their time according to SRE best practices?

During an incident, the incident commander delegates tasks to multiple teams. After the incident is resolved, the team holds a postmortem. Which of the following is a key principle of a blameless postmortem culture?

A service has an SLO of 99.9% availability over a 30-day rolling window. The team wants to use Cloud Monitoring to create a request-based SLO. Which configuration is correct?

A team notices that a critical microservice often fails when the downstream database is slow. They want to test the service's resilience by injecting latency into the database dependency. Which GCP tool should they use?

What is the primary purpose of an error budget?

An SRE team wants to automate a repetitive manual task that involves moving files from Cloud Storage to BigQuery and then deleting the source files. Which GCP service is BEST suited for this toil reduction?

A service has an SLO of 99.99% availability over 28 days. The team wants to set up a slow burn alert that will notify them within 6 hours if error budget consumption is at 5x the budgeted rate. How much error budget has been consumed after 6 hours at this rate? The total error budget for 28 days is 4 minutes and 2 seconds (242 seconds).

Which of the following is an example of toil according to SRE principles?

During an incident, the incident commander notices that multiple teams are working on the same issue without coordination. Which structure should be implemented to improve incident response?

A team wants to implement Chaos Engineering on GKE to test the resilience of their microservices by randomly killing pods and injecting network latency. Which tool is specifically designed for this purpose on GKE?

A team has a service with an SLO of 99.5% uptime over 30 days. They track error budget and want to alert when the error budget is almost exhausted. What is their total error budget in minutes per month?

An SRE team is defining SLIs for a data pipeline that ingests events from a pub/sub topic and writes to BigQuery. Which two metrics are good SLIs for pipeline freshness? (Choose TWO.)

A team wants to implement an on-call rotation using Cloud Monitoring and third-party tools. Which three components are essential for setting up on-call alerting? (Choose THREE.)

Which two practices are characteristic of a blameless postmortem? (Choose TWO.)

A team wants to define an SLO for a microservice that processes batch jobs. The service is considered healthy if each batch completes within 60 minutes. There are 100 batches per day. Which SLI should be used?

A site reliability engineer needs to alert when the error budget burn rate exceeds 14x the budget over a 1-hour window. Which type of alert should be configured in Cloud Monitoring?

An SRE team uses Cloud Monitoring SLOs with request-based SLI for a microservice. They want to alert when the error budget is projected to be exhausted within 2 hours at current burn rate. The SLO target is 99.9% over 30 days. Which approach should they use?

A service has an SLO of 99.9% availability over a 30-day window. The team wants to automate a deployment rollback if the error budget burn rate exceeds 10x over a 30-minute window. Which combination of Cloud Monitoring and Cloud Build should be used?

A Site Reliability Engineer is tasked with reducing toil in their team. They identify that resetting expired database connections manually is a common task. What is the best way to automate this toil?

During a postmortem for a service outage, the team identifies that the root cause was a configuration change that disabled TLS on a critical internal service. The change was made by an automated deployment pipeline. Which tool or practice should be implemented to prevent this in the future?

A team wants to improve incident response by creating a runbook for a common failure scenario: a database replication lag exceeds 5 seconds. Which Cloud Monitoring feature should be used to automatically trigger the runbook?

An SRE team wants to ensure that no single person can deploy to production without a peer review. Which Google Cloud service or feature should they use?

A service expects to receive 10,000 requests per second. The team needs to monitor request latency with an SLI that measures the proportion of requests that complete in under 100 ms. The latency distribution is right-skewed. Which approach should be used to define the SLI in Cloud Monitoring?

A team runs a service on Google Kubernetes Engine (GKE) and wants to inject faults to test resilience. They need to introduce latency into requests to a specific microservice without modifying code. Which tool should they use?

An SRE team wants to track the amount of toil their team performs each week and set a goal to keep it under 50% of working time. Which approach should they use?

A company has an SLA of 99.95% availability for its service. The SRE team defines an SLO of 99.99% availability. The error budget is calculated as 0.01% over a 30-day window. How much downtime is allowed per month according to the error budget?

An SRE team is implementing a chaos engineering practice on GKE. They want to test the resilience of a microservice by injecting failures. Which TWO tools or services can they use? (Choose 2.)

A team uses Cloud Monitoring SLOs for a service that has an SLO of 99.9% availability. They want to create alerts that notify the on-call engineer when the error budget is burning too fast. Which TWO conditions should they configure? (Choose 2.)

A site reliability engineer is leading a blameless postmortem for an incident. Which THREE practices should be included? (Choose 3.)

A team wants to define an SLO for their service based on availability. Over a 30-day window, the service handled 1,000,000 requests, of which 999,500 succeeded. What is the achieved availability, and what is the error budget consumed if the SLO is 99.95%?

An SRE team wants to alert when the error budget burn rate exceeds 14x the allowed rate over a 1-hour window. Which Cloud Monitoring alert policy configuration is appropriate?

A service has an SLO of 99.9% availability over 30 days. In the first 10 days, the service has already consumed 60% of the error budget. Which action best aligns with SRE principles?

An engineer needs to reduce toil. Which of the following tasks is considered toil according to SRE principles?

A team wants to automate the response to a common incident: restarting a service when it becomes unhealthy. Which GCP service is best suited to trigger a Cloud Function based on a Cloud Monitoring alert?

What is the primary purpose of a blameless postmortem in incident management?

A service uses Cloud SQL for MySQL. To test resilience, you want to inject latency into database queries. Which chaos engineering approach is most suitable on Google Cloud?

An SRE team wants to track the amount of toil each week and ensure it does not exceed 50% of the team's time. Which approach is most aligned with SRE best practices?

Which of the following is NOT a characteristic of toil as defined by SRE?

During an incident, the incident commander decides to escalate to a higher severity level. Which of the following best describes the incident commander's primary responsibility?

A team uses Cloud Monitoring SLO monitoring with a request-based SLI for availability. They define good requests as those returning HTTP 200. Which configuration correctly creates this SLO?

A team wants to implement a slow burn alert for error budget consumption. Which configuration should they use?

Which TWO are valid methods to define an SLI for a request-driven service? (Choose 2.)

An SRE team wants to conduct chaos engineering on a GKE cluster to test resilience. Which TWO tools or services can be used? (Choose 2.)

Which THREE are components of an effective incident management process? (Choose 3.)

A site reliability engineer defines a service's availability SLI as the percentage of successful requests. Which of the following is the correct formula for this SLI?

A team has an SLO of 99.9% availability over a 30-day period. How many minutes of downtime does the error budget allow per month?

An SRE team implements error budget alerting using Cloud Monitoring. They want to set a 'fast burn' alert that triggers when the error budget burn rate exceeds 14x over a 1-hour window. What is the purpose of this alert?

A DevOps team wants to reduce toil by automating manual, repetitive tasks that have no enduring value and scale with service growth. Which two characteristics define toil according to SRE principles?

An organization wants to implement a blameless postmortem culture after incidents. Which of the following is a key practice in blameless postmortems?

During an incident, an SRE team uses an incident command system. Which role is responsible for coordinating communication and resources, but not for technical debugging?

A company uses Cloud Monitoring to create an SLO for a service. They want to define a request-based SLO with a ratio of good requests to valid requests. Which of the following is a valid way to define the SLI in Cloud Monitoring SLOs?

A team wants to inject latency faults into their microservices running on Google Kubernetes Engine (GKE) to test resilience. Which tool or service should they use?

An SRE team wants to track the amount of toil their team performs each week. According to SRE best practices, what is the recommended maximum percentage of time that should be spent on toil?

100

An SRE team uses PagerDuty for on-call rotation. They receive a critical alert at 2 AM. According to incident management best practices, what should the on-call engineer do first?

101

A company uses Cloud Monitoring SLO monitoring with error budget alerts. They set a slow burn alert with a 5x burn rate over a 6-hour window. If the error budget is 0.1% over 30 days, approximately how long would it take to exhaust the budget at a 5x burn rate?

102

Which Google Cloud service can be used to automate repetitive operational tasks such as restarting a VM or clearing a cache, as part of toil reduction?

103

An SRE team wants to implement chaos engineering on their GKE cluster. Which TWO options are valid tools or services for injecting faults into GKE workloads?

104

A company wants to set up error budget alerting in Cloud Monitoring for a service with a 99.9% SLO over 30 days. They want to receive alerts when the error budget burn rate reaches certain thresholds. Which TWO of the following are typical recommendations for alerting thresholds?

105

During a postmortem, an SRE team identifies several contributing factors. Which THREE items should be included in the action items section of a blameless postmortem?

106

A site reliability team wants to define an SLO for a service with a target availability of 99.9% over a 30-day window. The error budget is exhausted. Which action MUST the team take?

107

An SRE wants to measure latency SLI for a web service. Which metric is the BEST indicator of user-perceived performance?

108

A team uses Cloud Monitoring to create an SLO for a request-based service. They want to alert when the error budget burn rate exceeds 14x the budget for a short window. Which alert type and window should they configure?

109

An SRE team notices that a routine database cleanup task takes 30 minutes of manual effort each week. The task does not add enduring value and scales linearly with the number of databases. How should the team classify this work?

110

After a major incident, the SRE team conducts a postmortem. Which practice is ESSENTIAL for a blameless culture?

111

An SRE team wants to perform fault injection testing on a GKE cluster by injecting network latency into a specific set of pods. Which tool should they use?

112

A team uses Cloud Monitoring to track availability SLI as good-request-count / valid-request-count. They want to create a window-based SLO. Which metric filter should they use for the numerator?

113

An SLA guarantees 99.9% monthly uptime. The team's SLO is 99.95% and error budget is 0.05%. What is the maximum allowed downtime per month according to the SLA?

114

During an incident, the incident commander delegates tasks to multiple teams. Which communication model is MOST effective to reduce noise?

115

An SRE team wants to automate a manual process that involves multiple steps and conditional logic (e.g., if a backup fails, retry with a different method). Which TWO Google Cloud services can they use to orchestrate this workflow? (Choose 2 answers)

116

A team wants to implement error budget alerts in Cloud Monitoring. They need TWO policies to detect both rapid and gradual budget consumption. Which TWO alert policies should they configure? (Choose 2 answers)

117

Which TWO statements correctly describe characteristics of toil in SRE? (Choose 2 answers)

118

A team wants to reduce toil by automating a recurring cloud resource update. Which THREE Google Cloud services can be used together to build an automated pipeline? (Choose 3 answers)

119

An SRE team is conducting a blameless postmortem after an outage. Which THREE elements should be included in the postmortem document? (Choose 3 answers)

120

A team wants to implement chaos engineering on GKE to test resilience. Which THREE fault types can be injected using Chaos Mesh? (Choose 3 answers)

121

An SRE team has defined a service's availability SLI as the proportion of successful requests over a 5-minute window. They set an SLO of 99.9% over 30 days. What is the error budget for a 30-day period?

122

A team wants to set up alerting on error budget burn rate for a service with an SLO of 99.9%. They want to detect when the error budget is being consumed at a rate that would exhaust it in less than 24 hours, using a 1-hour assessment window. What is the appropriate burn rate threshold for a fast burn alert?

123

You are implementing a chaos engineering experiment on a GKE cluster using Chaos Mesh. You want to test the resilience of a microservice by injecting a 5-second delay into 50% of HTTP requests to a specific service. Which Chaos Mesh resource should you use?

124

A site reliability team wants to reduce toil in their incident management process. They currently receive alerts via email and manually create tickets, page the on-call engineer, and update a shared spreadsheet. Which TWO Google Cloud services can help automate these tasks and reduce toil?

Practice all 124 Applying Site Reliability Engineering Practices to a Service questions

Other PCDE exam domains

Building and Implementing CI/CD Pipelines for a Service Bootstrapping a Google Cloud Organisation for DevOps Implementing Service Monitoring Strategies Optimising Service Performance Plan and manage database infrastructure Define data structures and implement SQL for Business Intelligence Design and implement database schemas Monitor and optimize database performance

Frequently asked questions

What does the Applying Site Reliability Engineering Practices to a Service domain cover on the PCDE exam?

The Applying Site Reliability Engineering Practices to a Service domain covers the key concepts tested in this area of the PCDE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PCDE domains — no account required.

How many Applying Site Reliability Engineering Practices to a Service questions are in the PCDE question bank?

The Courseiva PCDE question bank contains 124 questions in the Applying Site Reliability Engineering Practices to a Service domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Applying Site Reliability Engineering Practices to a Service for PCDE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Applying Site Reliability Engineering Practices to a Service questions for PCDE?

Yes — the session launcher on this page draws questions exclusively from the Applying Site Reliability Engineering Practices to a Service domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PCDE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included