PCDE Applying Site Reliability Engineering Practices to a Service — All Questions With Answers

Question 1easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to define an SLI for service availability. Which metric correctly represents the availability SLI?

Question 2easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.9% availability over a 30-day month. What is the error budget in minutes for that month?

Question 3mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An engineer needs to set up alerting for error budget burn rate. For a fast burn alert, which burn rate multiplier and evaluation window are recommended?

Question 4mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to reduce toil by automating a manual process that generates a report from Cloud Logging logs and emails it weekly. Which solution is most cost-effective and requires minimal operational overhead?

Question 5mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander identifies a need to scale up a managed instance group. Which IAM role should be granted to the on-call engineer to allow them to modify the instance group?

Question 6hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO based on request latency: 99% of requests must complete under 500ms over a 28-day window. The team wants to monitor the error budget burn rate. Which Cloud Monitoring SLO type and configuration should be used?

Question 7easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which of the following best describes 'toil' in SRE?

Question 8mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement chaos engineering on Google Kubernetes Engine (GKE) to test resilience against pod failures. Which tool is designed for injecting faults into GKE clusters?

Question 9hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses Cloud Monitoring SLOs with an error budget policy. They want to receive an alert when the error budget is exhausted at a rate that would exhaust it in 1 hour (fast burn). The SLO is 99.9% over 30 days. The error budget is 43.2 minutes. What burn rate threshold should be used?

Question 10mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During a blameless postmortem, the team uses the '5 Whys' technique to identify root causes. What is the primary purpose of this technique?

Question 11hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company wants to implement a service mesh with fault injection for HTTP services running on Google Kubernetes Engine. They need to inject artificial delays and errors into requests to test resilience. Which GCP service should they use?

Question 12easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team defines an SLO for a batch processing pipeline. Which SLI is most appropriate for pipeline freshness?

Question 13mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which TWO components are essential for setting up an incident management on-call rotation in Google Cloud? (Choose 2)

Question 14hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which THREE are key principles of a blameless postmortem culture in SRE? (Choose 3)

Question 15mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which TWO metrics are appropriate for defining a request-based SLI for a web service? (Choose 2)

Question 16mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer is defining an SLO for a service that processes user uploads. The team wants to measure success as the proportion of uploads completed within 2 seconds. Which type of SLI should they use?

Question 17easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team's SLO for availability is 99.9% over a 30-day window. They have consumed 80% of their error budget halfway through the month. What is the remaining allowed downtime for the rest of the month?

Question 18hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

You are configuring error budget burn rate alerts for an SLO with a 30-day window. The SLO target is 99.9%. You want to alert if the error budget is projected to be fully consumed in 2 hours, using a fast burn rate. Which alerting policy configuration should you use?

Question 19mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to reduce toil associated with manual database schema migrations. They currently run SQL scripts manually during maintenance windows. Which Google Cloud service is most appropriate to automate this process in a repeatable way?

Question 20mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During a blameless postmortem after an incident, the team identified that the root cause was a misconfigured load balancer health check. Which practice should the team prioritize to prevent recurrence?

Question 21easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which Google Cloud service can be used to inject artificial delays into HTTP traffic to test service resilience?

Question 22hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring SLO monitoring with a request-based SLI. The SLO is defined as the proportion of requests returning HTTP 200 with latency under 500ms over a 30-day window. They notice that the SLO is being violated due to a slow increase in latency from a specific backend. Which alerting strategy will best detect this gradual degradation early?

Question 23mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to reduce toil by automating the response to common alert notifications. For example, when a disk usage alert fires, they want to automatically run a script to clean up temporary files. Which Google Cloud service is best suited for this?

Question 24easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

What is the primary purpose of an error budget?

Question 25mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An e-commerce platform wants to implement chaos engineering on its Google Kubernetes Engine cluster to test resilience against network latency. Which tool is specifically designed for this purpose on GKE?

Question 26hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Your microservice has an SLO of 99.95% availability over 30 days. A 5-minute outage occurs. The error budget consumption rate for that hour is extremely high. You want to alert on this quick consumption using a fast burn alert. What burn rate threshold and lookback window should you configure to detect if the budget would be exhausted in under 2 hours?

Question 27mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to create an SLO for a batch data pipeline that processes files hourly. They want to measure whether each batch completes successfully within the hour. Which type of SLI should they use?

Question 28mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to implement error budget burn rate alerts for a service with SLO 99.9% over 30 days. They need to be notified both when the error budget is being consumed rapidly (full consumption in ~2 hours) and when it is being consumed slowly (full consumption in ~6 days). Which two alert configurations should they use? (Choose 2)

Question 29hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team is conducting a blameless postmortem after a production incident. Which three actions are part of an effective blameless postmortem process? (Choose 3)

Question 30mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to reduce toil in their operations. Which two of the following are characteristics of toil according to Google SRE principles? (Choose 2)

Question 31easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to define an SLO for a microservice that processes HTTP requests. They need an SLI that measures the proportion of requests that are answered within 200ms with a non-5xx status code. Which type of SLI should they use?

Question 32mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team has an SLO of 99.9% availability over a 30-day month. They burn through their entire error budget in the first 10 days. Which of the following is the MOST appropriate immediate action according to SRE principles?

Question 33hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses Cloud Monitoring to create an SLO for a service with a 99.9% availability target over 28 days. They set up a fast burn-rate alert on the error budget with a lookback window of 1 hour and a burn rate factor of 14. At what error budget consumption percentage will the alert fire?

Question 34mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team is automating toil reduction and needs to identify tasks that qualify as toil. Which of the following is a defining characteristic of toil according to SRE principles?

Question 35hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer wants to implement chaos engineering on a Google Kubernetes Engine (GKE) cluster by injecting network latency into pods of a specific deployment. Which tool or service should they use?

Question 36easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses Cloud Monitoring to alert on error budget burn rate. They configure a slow burn alert with a 6-hour lookback window and a burn rate factor of 5. What is the purpose of this slow burn alert?

Question 37mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An organization wants to reduce toil by automating a recurring process: every night, a script must run Cloud Build to rebuild a Docker image and deploy it to a GKE cluster. The script currently requires manual invocation by an engineer. Which Google Cloud service can trigger this automation on a schedule without manual intervention?

Question 38mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is creating a postmortem after a service outage. They want to ensure the process is blameless and focuses on systemic improvements. Which practice is central to a blameless postmortem?

Question 39hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Traffic Director with Envoy proxies to manage traffic in a service mesh on Compute Engine. They want to introduce fault injection to test resilience by injecting a 5-second delay in 10% of requests to a specific backend service. Which resource should they configure?

Question 40easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is setting up an on-call rotation for incident response. They want to use a tool that integrates with Cloud Monitoring and can escalate incidents if not acknowledged. Which service should they integrate with Cloud Monitoring?

Question 41mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team defines an SLO for a data pipeline: 99.9% of data records should be processed within 1 hour of ingestion. They need an SLI to measure this. Which SLI is most appropriate?

Question 42hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An engineer needs to create an SLO in Cloud Monitoring for a service that processes requests. The SLO should measure the proportion of requests that complete successfully within 300ms. The metric type for successful requests is custom.googleapis.com/myapp/success_count and for total requests is custom.googleapis.com/myapp/total_count. Which type of SLO should they create?

Question 43mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to implement a blameless postmortem culture after incidents. Which TWO practices are essential for a blameless postmortem?

Question 44hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team needs to automate the reduction of toil in their operations. Which THREE of the following are valid strategies to reduce toil according to SRE principles?

Question 45mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is designing an incident management process. Which TWO components are part of a typical incident command structure?

Question 46mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team defines an SLI as the proportion of good requests to valid requests over a 1-minute window. They set an SLO of 99.9% availability over 30 days. Which error budget burn rate alert configuration should they use to detect rapid consumption of the error budget within 1 hour?

Question 47easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to reduce toil from manual database backups. They estimate the toil takes 10 hours per week. What is the maximum amount of toil they should allow to keep toil under 50% of their time according to SRE best practices?

Question 48mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander delegates tasks to multiple teams. After the incident is resolved, the team holds a postmortem. Which of the following is a key principle of a blameless postmortem culture?

Question 49hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.9% availability over a 30-day rolling window. The team wants to use Cloud Monitoring to create a request-based SLO. Which configuration is correct?

Question 50mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team notices that a critical microservice often fails when the downstream database is slow. They want to test the service's resilience by injecting latency into the database dependency. Which GCP tool should they use?

Question 51easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

What is the primary purpose of an error budget?

Question 52mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to automate a repetitive manual task that involves moving files from Cloud Storage to BigQuery and then deleting the source files. Which GCP service is BEST suited for this toil reduction?

Question 53hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.99% availability over 28 days. The team wants to set up a slow burn alert that will notify them within 6 hours if error budget consumption is at 5x the budgeted rate. How much error budget has been consumed after 6 hours at this rate? The total error budget for 28 days is 4 minutes and 2 seconds (242 seconds).

Question 54easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which of the following is an example of toil according to SRE principles?

Question 55mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander notices that multiple teams are working on the same issue without coordination. Which structure should be implemented to improve incident response?

Question 56hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement Chaos Engineering on GKE to test the resilience of their microservices by randomly killing pods and injecting network latency. Which tool is specifically designed for this purpose on GKE?

Question 57mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team has a service with an SLO of 99.5% uptime over 30 days. They track error budget and want to alert when the error budget is almost exhausted. What is their total error budget in minutes per month?

Question 58mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is defining SLIs for a data pipeline that ingests events from a pub/sub topic and writes to BigQuery. Which two metrics are good SLIs for pipeline freshness? (Choose TWO.)

Question 59hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement an on-call rotation using Cloud Monitoring and third-party tools. Which three components are essential for setting up on-call alerting? (Choose THREE.)

Question 60mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which two practices are characteristic of a blameless postmortem? (Choose TWO.)

Question 61mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to define an SLO for a microservice that processes batch jobs. The service is considered healthy if each batch completes within 60 minutes. There are 100 batches per day. Which SLI should be used?

Question 62easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer needs to alert when the error budget burn rate exceeds 14x the budget over a 1-hour window. Which type of alert should be configured in Cloud Monitoring?

Question 63hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses Cloud Monitoring SLOs with request-based SLI for a microservice. They want to alert when the error budget is projected to be exhausted within 2 hours at current burn rate. The SLO target is 99.9% over 30 days. Which approach should they use?

Question 64mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.9% availability over a 30-day window. The team wants to automate a deployment rollback if the error budget burn rate exceeds 10x over a 30-minute window. Which combination of Cloud Monitoring and Cloud Build should be used?

Question 65easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A Site Reliability Engineer is tasked with reducing toil in their team. They identify that resetting expired database connections manually is a common task. What is the best way to automate this toil?

Question 66hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During a postmortem for a service outage, the team identifies that the root cause was a configuration change that disabled TLS on a critical internal service. The change was made by an automated deployment pipeline. Which tool or practice should be implemented to prevent this in the future?

Question 67mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to improve incident response by creating a runbook for a common failure scenario: a database replication lag exceeds 5 seconds. Which Cloud Monitoring feature should be used to automatically trigger the runbook?

Question 68easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to ensure that no single person can deploy to production without a peer review. Which Google Cloud service or feature should they use?

Question 69mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service expects to receive 10,000 requests per second. The team needs to monitor request latency with an SLI that measures the proportion of requests that complete in under 100 ms. The latency distribution is right-skewed. Which approach should be used to define the SLI in Cloud Monitoring?

Question 70hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team runs a service on Google Kubernetes Engine (GKE) and wants to inject faults to test resilience. They need to introduce latency into requests to a specific microservice without modifying code. Which tool should they use?

Question 71mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to track the amount of toil their team performs each week and set a goal to keep it under 50% of working time. Which approach should they use?

Question 72easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company has an SLA of 99.95% availability for its service. The SRE team defines an SLO of 99.99% availability. The error budget is calculated as 0.01% over a 30-day window. How much downtime is allowed per month according to the error budget?

Question 73mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is implementing a chaos engineering practice on GKE. They want to test the resilience of a microservice by injecting failures. Which TWO tools or services can they use? (Choose 2.)

Question 74hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring SLOs for a service that has an SLO of 99.9% availability. They want to create alerts that notify the on-call engineer when the error budget is burning too fast. Which TWO conditions should they configure? (Choose 2.)

Question 75mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer is leading a blameless postmortem for an incident. Which THREE practices should be included? (Choose 3.)

Question 76mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to define an SLO for their service based on availability. Over a 30-day window, the service handled 1,000,000 requests, of which 999,500 succeeded. What is the achieved availability, and what is the error budget consumed if the SLO is 99.95%?

Question 77easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to alert when the error budget burn rate exceeds 14x the allowed rate over a 1-hour window. Which Cloud Monitoring alert policy configuration is appropriate?

Question 78hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.9% availability over 30 days. In the first 10 days, the service has already consumed 60% of the error budget. Which action best aligns with SRE principles?

Question 79mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An engineer needs to reduce toil. Which of the following tasks is considered toil according to SRE principles?

Question 80mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to automate the response to a common incident: restarting a service when it becomes unhealthy. Which GCP service is best suited to trigger a Cloud Function based on a Cloud Monitoring alert?

Question 81easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

What is the primary purpose of a blameless postmortem in incident management?

Question 82hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service uses Cloud SQL for MySQL. To test resilience, you want to inject latency into database queries. Which chaos engineering approach is most suitable on Google Cloud?

Question 83mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to track the amount of toil each week and ensure it does not exceed 50% of the team's time. Which approach is most aligned with SRE best practices?

Question 84easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which of the following is NOT a characteristic of toil as defined by SRE?

Question 85mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander decides to escalate to a higher severity level. Which of the following best describes the incident commander's primary responsibility?

Question 86hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring SLO monitoring with a request-based SLI for availability. They define good requests as those returning HTTP 200. Which configuration correctly creates this SLO?

Question 87mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement a slow burn alert for error budget consumption. Which configuration should they use?

Question 88mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which TWO are valid methods to define an SLI for a request-driven service? (Choose 2.)

Question 89hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to conduct chaos engineering on a GKE cluster to test resilience. Which TWO tools or services can be used? (Choose 2.)

Question 90easymulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which THREE are components of an effective incident management process? (Choose 3.)

Question 91easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer defines a service's availability SLI as the percentage of successful requests. Which of the following is the correct formula for this SLI?

Question 92mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team has an SLO of 99.9% availability over a 30-day period. How many minutes of downtime does the error budget allow per month?

Question 93hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team implements error budget alerting using Cloud Monitoring. They want to set a 'fast burn' alert that triggers when the error budget burn rate exceeds 14x over a 1-hour window. What is the purpose of this alert?

Question 94mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A DevOps team wants to reduce toil by automating manual, repetitive tasks that have no enduring value and scale with service growth. Which two characteristics define toil according to SRE principles?

Question 95easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An organization wants to implement a blameless postmortem culture after incidents. Which of the following is a key practice in blameless postmortems?

Question 96mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, an SRE team uses an incident command system. Which role is responsible for coordinating communication and resources, but not for technical debugging?

Question 97hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company uses Cloud Monitoring to create an SLO for a service. They want to define a request-based SLO with a ratio of good requests to valid requests. Which of the following is a valid way to define the SLI in Cloud Monitoring SLOs?

Question 98mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to inject latency faults into their microservices running on Google Kubernetes Engine (GKE) to test resilience. Which tool or service should they use?

Question 99easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to track the amount of toil their team performs each week. According to SRE best practices, what is the recommended maximum percentage of time that should be spent on toil?

Question 100mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses PagerDuty for on-call rotation. They receive a critical alert at 2 AM. According to incident management best practices, what should the on-call engineer do first?

Question 101hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company uses Cloud Monitoring SLO monitoring with error budget alerts. They set a slow burn alert with a 5x burn rate over a 6-hour window. If the error budget is 0.1% over 30 days, approximately how long would it take to exhaust the budget at a 5x burn rate?

Question 102easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which Google Cloud service can be used to automate repetitive operational tasks such as restarting a VM or clearing a cache, as part of toil reduction?

Question 103mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to implement chaos engineering on their GKE cluster. Which TWO options are valid tools or services for injecting faults into GKE workloads?

Question 104hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company wants to set up error budget alerting in Cloud Monitoring for a service with a 99.9% SLO over 30 days. They want to receive alerts when the error budget burn rate reaches certain thresholds. Which TWO of the following are typical recommendations for alerting thresholds?

Question 105mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During a postmortem, an SRE team identifies several contributing factors. Which THREE items should be included in the action items section of a blameless postmortem?

Question 106mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability team wants to define an SLO for a service with a target availability of 99.9% over a 30-day window. The error budget is exhausted. Which action MUST the team take?

Question 107easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE wants to measure latency SLI for a web service. Which metric is the BEST indicator of user-perceived performance?

Question 108hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring to create an SLO for a request-based service. They want to alert when the error budget burn rate exceeds 14x the budget for a short window. Which alert type and window should they configure?

Question 109mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team notices that a routine database cleanup task takes 30 minutes of manual effort each week. The task does not add enduring value and scales linearly with the number of databases. How should the team classify this work?

Question 110easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

After a major incident, the SRE team conducts a postmortem. Which practice is ESSENTIAL for a blameless culture?

Question 111hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to perform fault injection testing on a GKE cluster by injecting network latency into a specific set of pods. Which tool should they use?

Question 112mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring to track availability SLI as good-request-count / valid-request-count. They want to create a window-based SLO. Which metric filter should they use for the numerator?

Question 113easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SLA guarantees 99.9% monthly uptime. The team's SLO is 99.95% and error budget is 0.05%. What is the maximum allowed downtime per month according to the SLA?

Question 114mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander delegates tasks to multiple teams. Which communication model is MOST effective to reduce noise?

Question 115mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to automate a manual process that involves multiple steps and conditional logic (e.g., if a backup fails, retry with a different method). Which TWO Google Cloud services can they use to orchestrate this workflow? (Choose 2 answers)

Question 116hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement error budget alerts in Cloud Monitoring. They need TWO policies to detect both rapid and gradual budget consumption. Which TWO alert policies should they configure? (Choose 2 answers)

Question 117easymulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which TWO statements correctly describe characteristics of toil in SRE? (Choose 2 answers)

Question 118mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to reduce toil by automating a recurring cloud resource update. Which THREE Google Cloud services can be used together to build an automated pipeline? (Choose 3 answers)

Question 119hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is conducting a blameless postmortem after an outage. Which THREE elements should be included in the postmortem document? (Choose 3 answers)

Question 120mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement chaos engineering on GKE to test resilience. Which THREE fault types can be injected using Chaos Mesh? (Choose 3 answers)

Question 121easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team has defined a service's availability SLI as the proportion of successful requests over a 5-minute window. They set an SLO of 99.9% over 30 days. What is the error budget for a 30-day period?

Question 122mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to set up alerting on error budget burn rate for a service with an SLO of 99.9%. They want to detect when the error budget is being consumed at a rate that would exhaust it in less than 24 hours, using a 1-hour assessment window. What is the appropriate burn rate threshold for a fast burn alert?

Question 123hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

You are implementing a chaos engineering experiment on a GKE cluster using Chaos Mesh. You want to test the resilience of a microservice by injecting a 5-second delay into 50% of HTTP requests to a specific service. Which Chaos Mesh resource should you use?

Question 124mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability team wants to reduce toil in their incident management process. They currently receive alerts via email and manually create tickets, page the on-call engineer, and update a shared spreadsheet. Which TWO Google Cloud services can help automate these tasks and reduce toil?

Question 1easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to define an SLI for service availability. Which metric correctly represents the availability SLI?

Question 2easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.9% availability over a 30-day month. What is the error budget in minutes for that month?

Question 3mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An engineer needs to set up alerting for error budget burn rate. For a fast burn alert, which burn rate multiplier and evaluation window are recommended?

Question 4mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to reduce toil by automating a manual process that generates a report from Cloud Logging logs and emails it weekly. Which solution is most cost-effective and requires minimal operational overhead?

Question 5mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander identifies a need to scale up a managed instance group. Which IAM role should be granted to the on-call engineer to allow them to modify the instance group?

Question 6hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO based on request latency: 99% of requests must complete under 500ms over a 28-day window. The team wants to monitor the error budget burn rate. Which Cloud Monitoring SLO type and configuration should be used?

Question 7easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which of the following best describes 'toil' in SRE?

Question 8mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement chaos engineering on Google Kubernetes Engine (GKE) to test resilience against pod failures. Which tool is designed for injecting faults into GKE clusters?

Question 9hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses Cloud Monitoring SLOs with an error budget policy. They want to receive an alert when the error budget is exhausted at a rate that would exhaust it in 1 hour (fast burn). The SLO is 99.9% over 30 days. The error budget is 43.2 minutes. What burn rate threshold should be used?

Question 10mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During a blameless postmortem, the team uses the '5 Whys' technique to identify root causes. What is the primary purpose of this technique?

Question 11hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company wants to implement a service mesh with fault injection for HTTP services running on Google Kubernetes Engine. They need to inject artificial delays and errors into requests to test resilience. Which GCP service should they use?

Question 12easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team defines an SLO for a batch processing pipeline. Which SLI is most appropriate for pipeline freshness?

Question 13mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which TWO components are essential for setting up an incident management on-call rotation in Google Cloud? (Choose 2)

Question 14hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which THREE are key principles of a blameless postmortem culture in SRE? (Choose 3)

Question 15mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which TWO metrics are appropriate for defining a request-based SLI for a web service? (Choose 2)

Question 16mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer is defining an SLO for a service that processes user uploads. The team wants to measure success as the proportion of uploads completed within 2 seconds. Which type of SLI should they use?

Question 17easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team's SLO for availability is 99.9% over a 30-day window. They have consumed 80% of their error budget halfway through the month. What is the remaining allowed downtime for the rest of the month?

Question 18hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

You are configuring error budget burn rate alerts for an SLO with a 30-day window. The SLO target is 99.9%. You want to alert if the error budget is projected to be fully consumed in 2 hours, using a fast burn rate. Which alerting policy configuration should you use?

Question 19mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to reduce toil associated with manual database schema migrations. They currently run SQL scripts manually during maintenance windows. Which Google Cloud service is most appropriate to automate this process in a repeatable way?

Question 20mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During a blameless postmortem after an incident, the team identified that the root cause was a misconfigured load balancer health check. Which practice should the team prioritize to prevent recurrence?

Question 21easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which Google Cloud service can be used to inject artificial delays into HTTP traffic to test service resilience?

Question 22hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring SLO monitoring with a request-based SLI. The SLO is defined as the proportion of requests returning HTTP 200 with latency under 500ms over a 30-day window. They notice that the SLO is being violated due to a slow increase in latency from a specific backend. Which alerting strategy will best detect this gradual degradation early?

Question 23mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to reduce toil by automating the response to common alert notifications. For example, when a disk usage alert fires, they want to automatically run a script to clean up temporary files. Which Google Cloud service is best suited for this?

Question 24easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

What is the primary purpose of an error budget?

Question 25mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An e-commerce platform wants to implement chaos engineering on its Google Kubernetes Engine cluster to test resilience against network latency. Which tool is specifically designed for this purpose on GKE?

Question 26hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Your microservice has an SLO of 99.95% availability over 30 days. A 5-minute outage occurs. The error budget consumption rate for that hour is extremely high. You want to alert on this quick consumption using a fast burn alert. What burn rate threshold and lookback window should you configure to detect if the budget would be exhausted in under 2 hours?

Question 27mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to create an SLO for a batch data pipeline that processes files hourly. They want to measure whether each batch completes successfully within the hour. Which type of SLI should they use?

Question 28mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to implement error budget burn rate alerts for a service with SLO 99.9% over 30 days. They need to be notified both when the error budget is being consumed rapidly (full consumption in ~2 hours) and when it is being consumed slowly (full consumption in ~6 days). Which two alert configurations should they use? (Choose 2)

Question 29hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team is conducting a blameless postmortem after a production incident. Which three actions are part of an effective blameless postmortem process? (Choose 3)

Question 30mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to reduce toil in their operations. Which two of the following are characteristics of toil according to Google SRE principles? (Choose 2)

Question 31easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to define an SLO for a microservice that processes HTTP requests. They need an SLI that measures the proportion of requests that are answered within 200ms with a non-5xx status code. Which type of SLI should they use?

Question 32mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team has an SLO of 99.9% availability over a 30-day month. They burn through their entire error budget in the first 10 days. Which of the following is the MOST appropriate immediate action according to SRE principles?

Question 33hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses Cloud Monitoring to create an SLO for a service with a 99.9% availability target over 28 days. They set up a fast burn-rate alert on the error budget with a lookback window of 1 hour and a burn rate factor of 14. At what error budget consumption percentage will the alert fire?

Question 34mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team is automating toil reduction and needs to identify tasks that qualify as toil. Which of the following is a defining characteristic of toil according to SRE principles?

Question 35hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer wants to implement chaos engineering on a Google Kubernetes Engine (GKE) cluster by injecting network latency into pods of a specific deployment. Which tool or service should they use?

Question 36easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses Cloud Monitoring to alert on error budget burn rate. They configure a slow burn alert with a 6-hour lookback window and a burn rate factor of 5. What is the purpose of this slow burn alert?

Question 37mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An organization wants to reduce toil by automating a recurring process: every night, a script must run Cloud Build to rebuild a Docker image and deploy it to a GKE cluster. The script currently requires manual invocation by an engineer. Which Google Cloud service can trigger this automation on a schedule without manual intervention?

Question 38mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is creating a postmortem after a service outage. They want to ensure the process is blameless and focuses on systemic improvements. Which practice is central to a blameless postmortem?

Question 39hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Traffic Director with Envoy proxies to manage traffic in a service mesh on Compute Engine. They want to introduce fault injection to test resilience by injecting a 5-second delay in 10% of requests to a specific backend service. Which resource should they configure?

Question 40easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is setting up an on-call rotation for incident response. They want to use a tool that integrates with Cloud Monitoring and can escalate incidents if not acknowledged. Which service should they integrate with Cloud Monitoring?

Question 41mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team defines an SLO for a data pipeline: 99.9% of data records should be processed within 1 hour of ingestion. They need an SLI to measure this. Which SLI is most appropriate?

Question 42hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An engineer needs to create an SLO in Cloud Monitoring for a service that processes requests. The SLO should measure the proportion of requests that complete successfully within 300ms. The metric type for successful requests is custom.googleapis.com/myapp/success_count and for total requests is custom.googleapis.com/myapp/total_count. Which type of SLO should they create?

Question 43mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to implement a blameless postmortem culture after incidents. Which TWO practices are essential for a blameless postmortem?

Question 44hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team needs to automate the reduction of toil in their operations. Which THREE of the following are valid strategies to reduce toil according to SRE principles?

Question 45mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is designing an incident management process. Which TWO components are part of a typical incident command structure?

Question 46mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team defines an SLI as the proportion of good requests to valid requests over a 1-minute window. They set an SLO of 99.9% availability over 30 days. Which error budget burn rate alert configuration should they use to detect rapid consumption of the error budget within 1 hour?

Question 47easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to reduce toil from manual database backups. They estimate the toil takes 10 hours per week. What is the maximum amount of toil they should allow to keep toil under 50% of their time according to SRE best practices?

Question 48mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander delegates tasks to multiple teams. After the incident is resolved, the team holds a postmortem. Which of the following is a key principle of a blameless postmortem culture?

Question 49hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.9% availability over a 30-day rolling window. The team wants to use Cloud Monitoring to create a request-based SLO. Which configuration is correct?

Question 50mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team notices that a critical microservice often fails when the downstream database is slow. They want to test the service's resilience by injecting latency into the database dependency. Which GCP tool should they use?

Question 51easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

What is the primary purpose of an error budget?

Question 52mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to automate a repetitive manual task that involves moving files from Cloud Storage to BigQuery and then deleting the source files. Which GCP service is BEST suited for this toil reduction?

Question 53hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.99% availability over 28 days. The team wants to set up a slow burn alert that will notify them within 6 hours if error budget consumption is at 5x the budgeted rate. How much error budget has been consumed after 6 hours at this rate? The total error budget for 28 days is 4 minutes and 2 seconds (242 seconds).

Question 54easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which of the following is an example of toil according to SRE principles?

Question 55mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander notices that multiple teams are working on the same issue without coordination. Which structure should be implemented to improve incident response?

Question 56hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement Chaos Engineering on GKE to test the resilience of their microservices by randomly killing pods and injecting network latency. Which tool is specifically designed for this purpose on GKE?

Question 57mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team has a service with an SLO of 99.5% uptime over 30 days. They track error budget and want to alert when the error budget is almost exhausted. What is their total error budget in minutes per month?

Question 58mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is defining SLIs for a data pipeline that ingests events from a pub/sub topic and writes to BigQuery. Which two metrics are good SLIs for pipeline freshness? (Choose TWO.)

Question 59hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement an on-call rotation using Cloud Monitoring and third-party tools. Which three components are essential for setting up on-call alerting? (Choose THREE.)

Question 60mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which two practices are characteristic of a blameless postmortem? (Choose TWO.)

Question 61mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to define an SLO for a microservice that processes batch jobs. The service is considered healthy if each batch completes within 60 minutes. There are 100 batches per day. Which SLI should be used?

Question 62easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer needs to alert when the error budget burn rate exceeds 14x the budget over a 1-hour window. Which type of alert should be configured in Cloud Monitoring?

Question 63hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses Cloud Monitoring SLOs with request-based SLI for a microservice. They want to alert when the error budget is projected to be exhausted within 2 hours at current burn rate. The SLO target is 99.9% over 30 days. Which approach should they use?

Question 64mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.9% availability over a 30-day window. The team wants to automate a deployment rollback if the error budget burn rate exceeds 10x over a 30-minute window. Which combination of Cloud Monitoring and Cloud Build should be used?

Question 65easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A Site Reliability Engineer is tasked with reducing toil in their team. They identify that resetting expired database connections manually is a common task. What is the best way to automate this toil?

Question 66hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During a postmortem for a service outage, the team identifies that the root cause was a configuration change that disabled TLS on a critical internal service. The change was made by an automated deployment pipeline. Which tool or practice should be implemented to prevent this in the future?

Question 67mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to improve incident response by creating a runbook for a common failure scenario: a database replication lag exceeds 5 seconds. Which Cloud Monitoring feature should be used to automatically trigger the runbook?

Question 68easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to ensure that no single person can deploy to production without a peer review. Which Google Cloud service or feature should they use?

Question 69mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service expects to receive 10,000 requests per second. The team needs to monitor request latency with an SLI that measures the proportion of requests that complete in under 100 ms. The latency distribution is right-skewed. Which approach should be used to define the SLI in Cloud Monitoring?

Question 70hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team runs a service on Google Kubernetes Engine (GKE) and wants to inject faults to test resilience. They need to introduce latency into requests to a specific microservice without modifying code. Which tool should they use?

Question 71mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to track the amount of toil their team performs each week and set a goal to keep it under 50% of working time. Which approach should they use?

Question 72easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company has an SLA of 99.95% availability for its service. The SRE team defines an SLO of 99.99% availability. The error budget is calculated as 0.01% over a 30-day window. How much downtime is allowed per month according to the error budget?

Question 73mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is implementing a chaos engineering practice on GKE. They want to test the resilience of a microservice by injecting failures. Which TWO tools or services can they use? (Choose 2.)

Question 74hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring SLOs for a service that has an SLO of 99.9% availability. They want to create alerts that notify the on-call engineer when the error budget is burning too fast. Which TWO conditions should they configure? (Choose 2.)

Question 75mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer is leading a blameless postmortem for an incident. Which THREE practices should be included? (Choose 3.)

Question 76mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to define an SLO for their service based on availability. Over a 30-day window, the service handled 1,000,000 requests, of which 999,500 succeeded. What is the achieved availability, and what is the error budget consumed if the SLO is 99.95%?

Question 77easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to alert when the error budget burn rate exceeds 14x the allowed rate over a 1-hour window. Which Cloud Monitoring alert policy configuration is appropriate?

Question 78hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service has an SLO of 99.9% availability over 30 days. In the first 10 days, the service has already consumed 60% of the error budget. Which action best aligns with SRE principles?

Question 79mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An engineer needs to reduce toil. Which of the following tasks is considered toil according to SRE principles?

Question 80mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to automate the response to a common incident: restarting a service when it becomes unhealthy. Which GCP service is best suited to trigger a Cloud Function based on a Cloud Monitoring alert?

Question 81easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

What is the primary purpose of a blameless postmortem in incident management?

Question 82hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A service uses Cloud SQL for MySQL. To test resilience, you want to inject latency into database queries. Which chaos engineering approach is most suitable on Google Cloud?

Question 83mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to track the amount of toil each week and ensure it does not exceed 50% of the team's time. Which approach is most aligned with SRE best practices?

Question 84easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which of the following is NOT a characteristic of toil as defined by SRE?

Question 85mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander decides to escalate to a higher severity level. Which of the following best describes the incident commander's primary responsibility?

Question 86hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring SLO monitoring with a request-based SLI for availability. They define good requests as those returning HTTP 200. Which configuration correctly creates this SLO?

Question 87mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement a slow burn alert for error budget consumption. Which configuration should they use?

Question 88mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which TWO are valid methods to define an SLI for a request-driven service? (Choose 2.)

Question 89hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to conduct chaos engineering on a GKE cluster to test resilience. Which TWO tools or services can be used? (Choose 2.)

Question 90easymulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which THREE are components of an effective incident management process? (Choose 3.)

Question 91easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability engineer defines a service's availability SLI as the percentage of successful requests. Which of the following is the correct formula for this SLI?

Question 92mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team has an SLO of 99.9% availability over a 30-day period. How many minutes of downtime does the error budget allow per month?

Question 93hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team implements error budget alerting using Cloud Monitoring. They want to set a 'fast burn' alert that triggers when the error budget burn rate exceeds 14x over a 1-hour window. What is the purpose of this alert?

Question 94mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A DevOps team wants to reduce toil by automating manual, repetitive tasks that have no enduring value and scale with service growth. Which two characteristics define toil according to SRE principles?

Question 95easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An organization wants to implement a blameless postmortem culture after incidents. Which of the following is a key practice in blameless postmortems?

Question 96mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, an SRE team uses an incident command system. Which role is responsible for coordinating communication and resources, but not for technical debugging?

Question 97hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company uses Cloud Monitoring to create an SLO for a service. They want to define a request-based SLO with a ratio of good requests to valid requests. Which of the following is a valid way to define the SLI in Cloud Monitoring SLOs?

Question 98mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to inject latency faults into their microservices running on Google Kubernetes Engine (GKE) to test resilience. Which tool or service should they use?

Question 99easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to track the amount of toil their team performs each week. According to SRE best practices, what is the recommended maximum percentage of time that should be spent on toil?

Question 100mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team uses PagerDuty for on-call rotation. They receive a critical alert at 2 AM. According to incident management best practices, what should the on-call engineer do first?

Question 101hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company uses Cloud Monitoring SLO monitoring with error budget alerts. They set a slow burn alert with a 5x burn rate over a 6-hour window. If the error budget is 0.1% over 30 days, approximately how long would it take to exhaust the budget at a 5x burn rate?

Question 102easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which Google Cloud service can be used to automate repetitive operational tasks such as restarting a VM or clearing a cache, as part of toil reduction?

Question 103mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to implement chaos engineering on their GKE cluster. Which TWO options are valid tools or services for injecting faults into GKE workloads?

Question 104hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A company wants to set up error budget alerting in Cloud Monitoring for a service with a 99.9% SLO over 30 days. They want to receive alerts when the error budget burn rate reaches certain thresholds. Which TWO of the following are typical recommendations for alerting thresholds?

Question 105mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During a postmortem, an SRE team identifies several contributing factors. Which THREE items should be included in the action items section of a blameless postmortem?

Question 106mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability team wants to define an SLO for a service with a target availability of 99.9% over a 30-day window. The error budget is exhausted. Which action MUST the team take?

Question 107easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE wants to measure latency SLI for a web service. Which metric is the BEST indicator of user-perceived performance?

Question 108hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring to create an SLO for a request-based service. They want to alert when the error budget burn rate exceeds 14x the budget for a short window. Which alert type and window should they configure?

Question 109mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team notices that a routine database cleanup task takes 30 minutes of manual effort each week. The task does not add enduring value and scales linearly with the number of databases. How should the team classify this work?

Question 110easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

After a major incident, the SRE team conducts a postmortem. Which practice is ESSENTIAL for a blameless culture?

Question 111hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to perform fault injection testing on a GKE cluster by injecting network latency into a specific set of pods. Which tool should they use?

Question 112mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team uses Cloud Monitoring to track availability SLI as good-request-count / valid-request-count. They want to create a window-based SLO. Which metric filter should they use for the numerator?

Question 113easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SLA guarantees 99.9% monthly uptime. The team's SLO is 99.95% and error budget is 0.05%. What is the maximum allowed downtime per month according to the SLA?

Question 114mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

During an incident, the incident commander delegates tasks to multiple teams. Which communication model is MOST effective to reduce noise?

Question 115mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team wants to automate a manual process that involves multiple steps and conditional logic (e.g., if a backup fails, retry with a different method). Which TWO Google Cloud services can they use to orchestrate this workflow? (Choose 2 answers)

Question 116hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement error budget alerts in Cloud Monitoring. They need TWO policies to detect both rapid and gradual budget consumption. Which TWO alert policies should they configure? (Choose 2 answers)

Question 117easymulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

Which TWO statements correctly describe characteristics of toil in SRE? (Choose 2 answers)

Question 118mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to reduce toil by automating a recurring cloud resource update. Which THREE Google Cloud services can be used together to build an automated pipeline? (Choose 3 answers)

Question 119hardmulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team is conducting a blameless postmortem after an outage. Which THREE elements should be included in the postmortem document? (Choose 3 answers)

Question 120mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to implement chaos engineering on GKE to test resilience. Which THREE fault types can be injected using Chaos Mesh? (Choose 3 answers)

Question 121easymultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

An SRE team has defined a service's availability SLI as the proportion of successful requests over a 5-minute window. They set an SLO of 99.9% over 30 days. What is the error budget for a 30-day period?

Question 122mediummultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A team wants to set up alerting on error budget burn rate for a service with an SLO of 99.9%. They want to detect when the error budget is being consumed at a rate that would exhaust it in less than 24 hours, using a 1-hour assessment window. What is the appropriate burn rate threshold for a fast burn alert?

Question 123hardmultiple choice

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

You are implementing a chaos engineering experiment on a GKE cluster using Chaos Mesh. You want to test the resilience of a microservice by injecting a 5-second delay into 50% of HTTP requests to a specific service. Which Chaos Mesh resource should you use?

Question 124mediummulti select

Read the full Applying Site Reliability Engineering Practices to a Service explanation →

A site reliability team wants to reduce toil in their incident management process. They currently receive alerts via email and manually create tickets, page the on-call engineer, and update a shared spreadsheet. Which TWO Google Cloud services can help automate these tasks and reduce toil?