CCNA Pcde Sre Practices Questions

75 of 124 questions · Page 1/2 · Pcde Sre Practices topic · Answers revealed

1
Multi-Selectmedium

During a postmortem, an SRE team identifies several contributing factors. Which THREE items should be included in the action items section of a blameless postmortem?

Select 3 answers
A.Verification steps to confirm the fix is effective
B.Due dates for each action item
C.Generic recommendations like 'improve testing'
D.Assign blame to the engineer who caused the incident
E.Specific actions to address root causes, each with a single owner
AnswersA, B, E

Verification ensures the action item resolves the issue.

Why this answer

Action items should be specific, have an owner, and a due date. Assigning blame or vague plans are not appropriate. The items should address systemic issues.

2
MCQeasy

An SRE team wants to ensure that no single person can deploy to production without a peer review. Which Google Cloud service or feature should they use?

A.Cloud Build with a trigger that runs only on merges to main branch that have passed pull request approval in Cloud Source Repositories
B.Cloud Audit Logs to monitor who deploys
C.Cloud Scheduler to run deployments at fixed times
D.IAM roles to restrict deployment to a single user
AnswerA

This enforces peer review before deployment.

Why this answer

Cloud Source Repositories can enforce pull request approvals. Cloud Build can integrate with approval mechanisms. However, a common approach is to use Cloud Build triggers with a requirement for a pull request approval via Cloud Source Repositories.

Alternatively, Binary Authorization can enforce attestations from reviewers. But the most direct is to use Cloud Build with a trigger that only runs on pull request merges that have required approvals.

3
MCQeasy

An organization wants to implement a blameless postmortem culture after incidents. Which of the following is a key practice in blameless postmortems?

A.Firing the responsible engineer to prevent recurrence
B.Identifying contributing factors using techniques like the 5 Whys
C.Assigning blame to the engineer who made the mistake
D.Immediately implementing a fix without documentation
AnswerB

5 Whys is a root cause analysis technique used in blameless postmortems.

Why this answer

Blameless postmortems focus on identifying contributing factors and systemic issues, not individual blame. The 5 Whys technique helps uncover root causes.

4
MCQeasy

An SRE wants to measure latency SLI for a web service. Which metric is the BEST indicator of user-perceived performance?

A.Proportion of requests served in under 200ms.
B.Maximum latency observed in the last 5 minutes.
C.Average latency over the last hour.
D.99th percentile latency.
AnswerA

This directly tracks whether user requests meet a performance target, which is a good SLI.

Why this answer

The proportion of requests that complete within a defined threshold (e.g., 200ms) directly measures user-perceived performance. Other options are less representative.

5
MCQmedium

During an incident, the incident commander notices that multiple teams are working on the same issue without coordination. Which structure should be implemented to improve incident response?

A.Use a chatbot to broadcast updates without a commander
B.Have each team work independently and report after resolution
C.Escalate to the VP of Engineering to make decisions
D.Assign a single incident commander to coordinate all teams
AnswerD

The incident commander delegates tasks and ensures coordinated response.

Why this answer

An incident command system (ICS) establishes a clear hierarchy with an incident commander who coordinates teams, assigns roles (e.g., operations lead, communications lead), and ensures focused response.

6
MCQeasy

An SRE team wants to define an SLI for service availability. Which metric correctly represents the availability SLI?

A.Total requests that succeed / Total requests
B.Total requests that complete within 200 ms / Total requests
C.Total minutes the service is up / Total minutes in the window
D.Number of requests that return a 5xx status code
AnswerA

Correct definition of availability SLI.

Why this answer

Availability as an SLI is measured as the proportion of successful requests to total requests. Option A correctly defines this as 'Total requests that succeed / Total requests'. Option B is incorrect because it measures latency (response time within 200 ms), not availability.

Option C is a measure of uptime, not service availability from the user's perspective. Option D only counts errors, which is not a ratio.

7
MCQhard

Your microservice has an SLO of 99.95% availability over 30 days. A 5-minute outage occurs. The error budget consumption rate for that hour is extremely high. You want to alert on this quick consumption using a fast burn alert. What burn rate threshold and lookback window should you configure to detect if the budget would be exhausted in under 2 hours?

A.Burn rate: 5, Window: 1 hour
B.Burn rate: 14, Window: 6 hours
C.Burn rate: 5, Window: 6 hours
D.Burn rate: 14, Window: 1 hour
AnswerD

Standard fast burn alert; it will trigger as the burn rate far exceeds 14.

Why this answer

The question asks for a fast burn alert configuration to detect if the error budget would be exhausted in under 2 hours. However, the standard Google SRE fast burn alert uses a burn rate of 14 over a 1-hour window, which corresponds to budget exhaustion in approximately 30/14 ≈ 2.14 days, not under 2 hours. The 5-minute outage in one hour consumes 5 minutes of the 21.6-minute budget (23%), resulting in a burn rate of 5 / (21.6/720) ≈ 166.7, far exceeding 14.

Thus, the fast burn alert would trigger immediately. Options A and C (burn rate 5) would not fire because the burn rate is higher than 5. Option B (window 6 hours) would delay the alert and might not capture the spike quickly enough.

Therefore, option D (burn rate 14, window 1 hour) is the correct configuration, as it is the standard fast burn alert that catches such rapid consumption.

8
Multi-Selecthard

A team wants to implement error budget alerts in Cloud Monitoring. They need TWO policies to detect both rapid and gradual budget consumption. Which TWO alert policies should they configure? (Choose 2 answers)

Select 2 answers
A.Single alert with a burn rate of 10x over 1 hour.
B.Alert on error budget remaining < 10%.
C.Fast burn alert with 14x burn rate over 1 hour.
D.Slow burn alert with 5x burn rate over 6 hours.
E.Fast burn alert with 5x burn rate over 6 hours.
AnswersC, D

Catches rapid consumption.

Why this answer

The standard practice is to create a fast burn alert (e.g., 14x burn rate over 1 hour) for rapid consumption, and a slow burn alert (e.g., 5x burn rate over 6 hours) for gradual consumption. The other options are not standard.

9
MCQmedium

An engineer needs to set up alerting for error budget burn rate. For a fast burn alert, which burn rate multiplier and evaluation window are recommended?

A.6-hour window, 5x burn rate
B.1-hour window, 5x burn rate
C.1-hour window, 14x burn rate
D.6-hour window, 14x burn rate
AnswerC

Fast burn uses 1-hour window and 14x burn rate.

Why this answer

Fast burn alert uses a 1-hour window and 14x burn rate. Slow burn uses 6-hour window and 5x burn rate.

10
MCQmedium

A team wants to set up alerting on error budget burn rate for a service with an SLO of 99.9%. They want to detect when the error budget is being consumed at a rate that would exhaust it in less than 24 hours, using a 1-hour assessment window. What is the appropriate burn rate threshold for a fast burn alert?

A.5x burn rate
B.14x burn rate
C.100x burn rate
D.2x burn rate
AnswerB

Correct for fast burn alert (1-hour window).

Why this answer

A fast burn alert uses a 1-hour window and a burn rate that would exhaust the error budget in less than 1 hour. The formula: burn rate = (1-hour error budget consumption rate) / (budgeted rate). To exhaust in <1 hour, the burn rate must exceed the budget rate by a factor that consumes the entire budget in 1 hour.

For a 30-day budget (43200 min), if we measure over 1 hour (60 min), the proportion of budget consumed in 1 hour at the budgeted rate is 60/43200 = 0.00139 (0.139%). The actual consumption must exceed this by a factor that reaches 100% in 1 hour. The factor is 1 / (60/43200) = 720.

However, common SRE practice uses 14x for fast burn (exhaust in ~3 days? Actually standard: fast burn: 1h window, 14x burn rate; slow burn: 6h window, 5x burn rate). This is from Google's SRE workbook. So correct is 14x.

11
MCQmedium

A DevOps team wants to reduce toil by automating manual, repetitive tasks that have no enduring value and scale with service growth. Which two characteristics define toil according to SRE principles?

A.It is strategic and non-repetitive
B.It is automated and provides enduring value
C.It is manual and has enduring value
D.It is repetitive and scales with service growth
AnswerD

Repetitiveness and scaling with growth are key characteristics of toil.

Why this answer

According to Google's SRE principles, toil is manual, repetitive, automatable, tactical (no enduring value), and scales with service growth. The question asks for two, so the correct answer pairs two of these.

12
Multi-Selectmedium

Which two practices are characteristic of a blameless postmortem? (Choose TWO.)

Select 2 answers
A.Creating action items with owners and deadlines
B.Assigning punitive measures to prevent recurrence
C.Identifying the employee who made the error
D.Focusing on contributing factors in the system and process
E.Keeping the postmortem confidential within the incident response team
AnswersA, D

Action items drive improvement.

Why this answer

Blameless postmortems focus on systemic causes and create action items to prevent recurrence. They avoid blaming individuals.

13
MCQmedium

A team wants to reduce toil by automating a manual process that generates a report from Cloud Logging logs and emails it weekly. Which solution is most cost-effective and requires minimal operational overhead?

A.Cloud Scheduler + Cloud Functions
B.Cloud Run job manually started
C.Cloud Build trigger on schedule
D.Cloud Composer (Airflow) DAG
AnswerA

Cloud Scheduler triggers a Cloud Function that queries logs and sends email. Minimal overhead.

Why this answer

Cloud Scheduler triggers a Cloud Function or runs a query in BigQuery, but the most straightforward serverless option is a Cloud Function triggered by Cloud Scheduler. Cloud Composer is heavy for a simple report. Cloud Run requires containerization.

Cloud Build is for CI/CD.

14
MCQmedium

A team has an SLO of 99.9% availability over a 30-day month. They burn through their entire error budget in the first 10 days. Which of the following is the MOST appropriate immediate action according to SRE principles?

A.Reduce the error budget by lowering the SLO to 99%
B.Freeze all feature releases and focus on reliability improvements
C.Deploy a new feature to increase user engagement
D.Increase the SLO to 99.99% to act as a buffer
AnswerB

SRE practice dictates that when error budget is exhausted, releases are halted to restore reliability.

Why this answer

When error budget is exhausted, the team should halt all feature releases and focus on reliability improvements to prevent further degradation.

15
MCQmedium

An SRE team wants to automate a repetitive manual task that involves moving files from Cloud Storage to BigQuery and then deleting the source files. Which GCP service is BEST suited for this toil reduction?

A.Cloud Build
B.Cloud Scheduler
C.Compute Engine
D.Cloud Functions
AnswerD

Cloud Functions can be triggered by Cloud Storage events (e.g., object finalize) to run code that moves data to BigQuery and deletes the source.

Why this answer

Cloud Functions is ideal for event-driven automation. It can trigger on Cloud Storage events, process data, and delete files.

16
MCQmedium

A service expects to receive 10,000 requests per second. The team needs to monitor request latency with an SLI that measures the proportion of requests that complete in under 100 ms. The latency distribution is right-skewed. Which approach should be used to define the SLI in Cloud Monitoring?

A.Use the 99th percentile latency as the SLI
B.Use the median latency as the SLI
C.Use a window-based SLI that counts good minutes
D.Use a histogram metric and create a request-based SLI with good request count filtered by latency < 100 ms
AnswerD

This is the correct way: using a histogram or a pre-computed good request count.

Why this answer

For a latency SLI, the standard approach is to use a request-based SLI with a metric that counts the number of requests that are under the threshold (good) over total requests. This requires instrumenting the application to emit a metric for 'good' requests (latency < 100 ms). Alternatively, you can use a histogram metric and compute the ratio.

The simplest is to use a custom metric for good request count.

17
MCQmedium

An SRE team notices that a routine database cleanup task takes 30 minutes of manual effort each week. The task does not add enduring value and scales linearly with the number of databases. How should the team classify this work?

A.Technical debt.
B.Operational overhead.
C.Toil.
D.Engineering project.
AnswerC

Matches all characteristics of toil: manual, repetitive, no enduring value, scales with growth.

Why this answer

This work is manual, repetitive, does not add enduring value, and scales with service growth — the classic definition of toil. SREs should automate it and track it against the toil budget.

18
Multi-Selectmedium

Which TWO are valid methods to define an SLI for a request-driven service? (Choose 2.)

Select 2 answers
A.Proportion of requests completed within a latency threshold
B.CPU utilization of the server
C.Number of code commits per week
D.Number of running instances
E.Proportion of requests returning a successful HTTP status
AnswersA, E

This is latency SLI.

Why this answer

Common SLIs: proportion of successful requests (availability) and proportion of requests meeting a latency threshold.

19
MCQeasy

A site reliability engineer needs to alert when the error budget burn rate exceeds 14x the budget over a 1-hour window. Which type of alert should be configured in Cloud Monitoring?

A.Error budget burn rate alert with a 30-day window and 1x burn rate
B.Slow burn rate alert with a 6-hour window and 5x burn rate
C.SLI threshold alert when latency exceeds 500 ms
D.Fast burn rate alert with a 1-hour window and 14x burn rate
AnswerD

This matches the requirement for fast burn alerting.

Why this answer

Fast burn rate alerts use a short window (1 hour) and a high burn rate threshold (14x). Slow burn rate alerts use a longer window (6 hours) and a lower threshold (5x).

20
MCQmedium

During a blameless postmortem after an incident, the team identified that the root cause was a misconfigured load balancer health check. Which practice should the team prioritize to prevent recurrence?

A.Create a rollback plan for health check changes
B.Schedule a weekly manual review of load balancer settings
C.Add a test in CI/CD to validate health check configuration
D.Write a runbook for future load balancer incidents
AnswerC

Automated validation prevents misconfiguration from reaching production.

Why this answer

A blameless postmortem focuses on finding contributing factors and creating action items to improve the system. The action item should be to fix the health check configuration and automate validation. Implementing a runbook for future incidents is reactive, not preventive.

Creating a rollback plan is also reactive.

Exam trap

A common misconception is that writing a runbook is sufficient, but automated validation in the CI/CD pipeline is the only way to prevent recurrence of misconfigured health checks.

21
MCQmedium

A team wants to improve incident response by creating a runbook for a common failure scenario: a database replication lag exceeds 5 seconds. Which Cloud Monitoring feature should be used to automatically trigger the runbook?

A.Write a cron job on a VM that checks Cloud Monitoring metrics via API and runs the runbook
B.Configure a Cloud Monitoring alert to send a notification to a Cloud Function that executes the runbook steps
C.Create a Cloud Monitoring dashboard that displays replication lag and expect the on-call engineer to follow the runbook manually
D.Use Cloud Scheduler to run the runbook every 5 minutes
AnswerB

This automates the runbook execution in response to the alert.

Why this answer

Cloud Monitoring alert policies can trigger notifications to various channels, including webhooks that can integrate with incident management tools like PagerDuty or OpsGenie. These tools can then invoke runbooks automatically. Alternatively, Cloud Monitoring can directly send to Cloud Run or Cloud Functions that execute the runbook.

The key is that the alert triggers an automated response.

22
MCQeasy

An SRE team wants to define an SLO for a microservice that processes HTTP requests. They need an SLI that measures the proportion of requests that are answered within 200ms with a non-5xx status code. Which type of SLI should they use?

A.Request-based SLI counting good requests divided by total requests
B.Availability-based SLI using a rolling window of 30 days
C.Window-based SLI counting good minutes
D.Throughput-based SLI measuring requests per second
AnswerA

A request-based SLI where a good request is one with latency <=200ms and a non-5xx status fits this scenario.

Why this answer

A latency-based SLI with a threshold (200ms) combined with availability (non-5xx) is a request-based SLI measuring good requests (fast+successful) over total requests.

23
MCQhard

An SRE team uses Cloud Monitoring SLOs with an error budget policy. They want to receive an alert when the error budget is exhausted at a rate that would exhaust it in 1 hour (fast burn). The SLO is 99.9% over 30 days. The error budget is 43.2 minutes. What burn rate threshold should be used?

A.5
B.14
C.720
D.1
AnswerB

Standard fast burn alert uses a 14x burn rate over a 1-hour window.

Why this answer

In Google SRE practice, a fast burn alert uses a 1-hour window and a burn rate threshold of 14. This means the alert fires when the error rate exceeds 14 times the allowable rate, indicating that the budget is being consumed rapidly. At a burn rate of 14, the budget would be exhausted in 30 days / 14 ≈ 2.14 days.

The question's reference to 'exhaust it in 1 hour' refers to the alert's lookback window (1 hour), not the actual time to budget exhaustion. A burn rate of 720 would exhaust the budget in 1 hour, but that is not a standard alert threshold. Therefore, the correct answer is 14.

24
MCQmedium

A team wants to automate the response to a common incident: restarting a service when it becomes unhealthy. Which GCP service is best suited to trigger a Cloud Function based on a Cloud Monitoring alert?

A.Cloud Endpoints
B.Cloud Scheduler
C.Cloud Tasks
D.Cloud Monitoring alert with notification channel to Pub/Sub
AnswerD

Alert policies can send notifications to Pub/Sub, which can trigger a Cloud Function.

Why this answer

Cloud Monitoring alert notification channels can trigger a Cloud Function via Pub/Sub or webhook. This is the standard pattern for automated incident response.

25
MCQeasy

A site reliability engineer defines a service's availability SLI as the percentage of successful requests. Which of the following is the correct formula for this SLI?

A.good-request-count / total-requests (including invalid) * 100
B.error-request-count / valid-request-count * 100
C.valid-request-count / error-request-count * 100
D.good-request-count / valid-request-count * 100
AnswerD

This is the standard formula for request-based availability SLI.

Why this answer

Availability SLI is typically defined as the count of successful requests divided by total valid requests, measured over a rolling window.

26
MCQhard

A service has an SLO of 99.9% availability over a 30-day rolling window. The team wants to use Cloud Monitoring to create a request-based SLO. Which configuration is correct?

A.Use an error budget metric: total allowed errors over 30 days
B.Use a request-based metric: good request count / valid request count
C.Use a latency metric: proportion of requests under threshold
D.Use a window-based metric: good minutes / total minutes
AnswerB

Request-based SLOs directly measure the ratio of good requests to total valid requests.

Why this answer

For a request-based SLO, you define two metrics: good request count and valid request count. The SLO is the ratio of good to valid requests.

27
Multi-Selectmedium

A team wants to reduce toil by automating a recurring cloud resource update. Which THREE Google Cloud services can be used together to build an automated pipeline? (Choose 3 answers)

Select 3 answers
A.Cloud Pub/Sub
B.Cloud Deployment Manager
C.Cloud Scheduler
D.Cloud Functions
E.Cloud Build
AnswersC, D, E

Can trigger periodic jobs.

Why this answer

Cloud Build can run automation scripts, Cloud Functions can be triggered by events, and Cloud Scheduler can trigger periodic jobs. Together they can form a pipeline. Pub/Sub and Deployment Manager are also possible but the question asks for THREE from the given set that are commonly used together.

28
MCQeasy

A team's SLO for availability is 99.9% over a 30-day window. They have consumed 80% of their error budget halfway through the month. What is the remaining allowed downtime for the rest of the month?

A.About 17 minutes 17 seconds
B.About 34 minutes 34 seconds
C.About 43 minutes 12 seconds
D.About 8 minutes 38 seconds
AnswerD

20% of 43.2 minutes = 8.64 minutes = 8 minutes 38 seconds.

Why this answer

Total error budget = 100% - 99.9% = 0.1% of 30 days = 43.2 minutes (0.1% * 30 * 1440). 80% consumed means 20% remains: 0.2 * 43.2 = 8.64 minutes, approximately 8 minutes 38 seconds.

29
Multi-Selecteasy

Which TWO statements correctly describe characteristics of toil in SRE? (Choose 2 answers)

Select 2 answers
A.Toil is always automatable.
B.Toil is repetitive and manually performed.
C.Toil decreases as the service scales.
D.Toil adds enduring value to the service.
E.Toil provides no enduring value.
AnswersB, E

Manual and repetitive are key traits of toil.

Why this answer

Toil in SRE is defined as manual, repetitive work that does not produce enduring value. As a service scales, toil typically increases rather than decreases. Option B correctly identifies toil as repetitive and manually performed.

Option E correctly states that toil provides no enduring value. Option A is incorrect because toil is not always automatable; some toil may be difficult or impractical to automate. Option C is incorrect because toil scales with service growth, not decreases.

Option D is incorrect because toil does not add enduring value; that is a characteristic of engineering work.

30
MCQmedium

An SRE team uses PagerDuty for on-call rotation. They receive a critical alert at 2 AM. According to incident management best practices, what should the on-call engineer do first?

A.Ignore it until morning because it's off-hours
B.Start a postmortem immediately
C.Immediately escalate to the incident commander
D.Acknowledge the alert and begin triage according to the runbook
AnswerD

Acknowledging and following runbook procedures is the correct initial response.

Why this answer

The first step is to acknowledge the alert to signal that someone is responding, then assess severity and begin response per the runbook.

31
Multi-Selecthard

A team is conducting a blameless postmortem after a production incident. Which three actions are part of an effective blameless postmortem process? (Choose 3)

Select 3 answers
A.Define action items with specific owners and due dates
B.Identify contributing factors using the 5 Whys technique
C.Assign blame to the individual who made the error
D.Punish the on-call engineer for missing the alert
E.Focus on the system and process failures, not individuals
AnswersA, B, E

Action items ensure follow-through on improvements.

Why this answer

Blameless postmortems focus on identifying contributing factors, using techniques like 5 Whys, and defining action items with owners. Assigning blame, punishing, or ignoring minor factors are not part of the process.

32
MCQmedium

A team wants to implement chaos engineering on Google Kubernetes Engine (GKE) to test resilience against pod failures. Which tool is designed for injecting faults into GKE clusters?

A.GKE Node Auto-Repair
B.Traffic Director fault injection
C.Chaos Mesh on GKE
D.Cloud Armor
AnswerC

Chaos Mesh is specifically designed for chaos engineering on Kubernetes.

Why this answer

Chaos Mesh is an open-source chaos engineering platform for Kubernetes, including GKE. It can inject pod failures, network delays, etc. Traffic Director fault injection is for service mesh, not GKE-native.

33
Multi-Selecthard

A company wants to set up error budget alerting in Cloud Monitoring for a service with a 99.9% SLO over 30 days. They want to receive alerts when the error budget burn rate reaches certain thresholds. Which TWO of the following are typical recommendations for alerting thresholds?

Select 2 answers
A.100x burn rate over 1 minute
B.14x burn rate over 1 hour
C.5x burn rate over 6 hours
D.2x burn rate over 1 day
E.1x burn rate over 30 days
AnswersB, C

Fast burn alert for rapid consumption.

Why this answer

Common SRE practice uses a fast burn alert (e.g., 14x burn rate over 1 hour) and a slow burn alert (e.g., 5x burn rate over 6 hours) to cover both rapid and gradual budget consumption.

34
MCQeasy

What is the primary purpose of a blameless postmortem in incident management?

A.Identify the individual who caused the incident
B.Update the SLA to exclude the incident period
C.Document the timeline and technical root cause
D.Assign monetary penalties to the responsible team
AnswerC

This is the core purpose: learn from the incident.

Why this answer

Blameless postmortems focus on understanding contributing factors and improving systems, not assigning fault.

35
Multi-Selecteasy

Which THREE are components of an effective incident management process? (Choose 3.)

Select 3 answers
A.Automatic rollback of all changes during incidents
B.Blameless postmortem with action items
C.Monthly performance bonuses for on-call
D.On-call rotation with escalation paths
E.Incident commander role
AnswersB, D, E

Learning and improvement.

Why this answer

Key components: on-call rotation, incident command system, and postmortems with action items.

36
MCQeasy

An SRE team is setting up an on-call rotation for incident response. They want to use a tool that integrates with Cloud Monitoring and can escalate incidents if not acknowledged. Which service should they integrate with Cloud Monitoring?

A.Cloud Pub/Sub
B.PagerDuty
C.Cloud Functions
D.Cloud Run
AnswerB

PagerDuty integrates with Cloud Monitoring for alerting and on-call management.

Why this answer

Cloud Monitoring can send notifications to PagerDuty, OpsGenie, and other incident management tools. PagerDuty is a common choice for on-call rotations and escalation policies.

37
Multi-Selectmedium

A site reliability team wants to reduce toil in their incident management process. They currently receive alerts via email and manually create tickets, page the on-call engineer, and update a shared spreadsheet. Which TWO Google Cloud services can help automate these tasks and reduce toil?

Select 2 answers
A.Cloud Build
B.Cloud Monitoring
C.Cloud Functions
D.Cloud Scheduler
E.Cloud Run
AnswersB, C

Can send alert notifications to Pub/Sub, triggering automation.

Why this answer

Cloud Monitoring can send alerts to notification channels that trigger Cloud Functions via webhook or Pub/Sub. Cloud Functions can then automate ticket creation and page on-call via PagerDuty API. Cloud Build is for CI/CD, not incident management.

Cloud Scheduler is for cron jobs, not event-driven. Cloud Run is for stateless containers, not ideal for event-driven automation of incidents.

38
MCQeasy

Which of the following is NOT a characteristic of toil as defined by SRE?

A.No enduring value
B.Requires complex problem-solving
C.Repetitive
D.Manual
AnswerB

Toil does not require complex problem-solving; it's mundane.

Why this answer

Toil is manual, repetitive, automatable, and has no enduring value. Complex design work is the opposite.

39
MCQmedium

During an incident, the incident commander decides to escalate to a higher severity level. Which of the following best describes the incident commander's primary responsibility?

A.Managing the incident response process and communications
B.Debugging the root cause of the incident
C.Writing the postmortem document
D.Implementing the fix
AnswerA

This is the key role of incident commander.

Why this answer

The incident commander is responsible for coordinating response, communication, and prioritization, not necessarily fixing the issue.

40
MCQhard

An SRE team implements error budget alerting using Cloud Monitoring. They want to set a 'fast burn' alert that triggers when the error budget burn rate exceeds 14x over a 1-hour window. What is the purpose of this alert?

A.To detect gradual, long-term trends in error budget consumption
B.To alert when the error budget is completely exhausted
C.To alert immediately when error budget is being consumed at a rate that would exhaust the budget in approximately 51 hours (14x faster than allowed)
D.To trigger a page for on-call engineers when the service is likely to exceed SLO within the next 6 hours
AnswerC

Correct: 14x burn rate means the budget would be consumed in (30 days / 14) ≈ 51 hours. Fast burn alerts trigger within 1 hour.

Why this answer

Fast burn alerts (e.g., 14x burn rate over 1h) are designed to detect severe, rapid consumption of error budget, prompting immediate investigation. They complement slow burn alerts (5x over 6h) which catch gradual budget erosion.

41
MCQhard

An SRE team uses Cloud Monitoring to create an SLO for a service with a 99.9% availability target over 28 days. They set up a fast burn-rate alert on the error budget with a lookback window of 1 hour and a burn rate factor of 14. At what error budget consumption percentage will the alert fire?

A.When >10% of error budget is consumed in the last 1 hour
B.When >50% of error budget is consumed in the last 1 hour
C.When >2% of error budget is consumed in the last 1 hour
D.When >0.1% of error budget is consumed in the last 1 hour
AnswerC

With a 14x burn rate and 1-hour window, the alert triggers when consumption exceeds 2.08% (rounded to 2%).

Why this answer

Fast burn alert fires when the error budget consumption rate over the lookback window exceeds the burn rate threshold. With a 1-hour window and 14x burn rate, the alert fires when consumption exceeds 14 times the expected burn rate for that window. Expected burn for 1 hour is (1/672) of total budget (28 days = 672 hours).

So 14x = 14/672 = 2.08%. The alert fires when >2% consumed in 1 hour.

42
MCQhard

A team uses Cloud Monitoring SLO monitoring with a request-based SLI. The SLO is defined as the proportion of requests returning HTTP 200 with latency under 500ms over a 30-day window. They notice that the SLO is being violated due to a slow increase in latency from a specific backend. Which alerting strategy will best detect this gradual degradation early?

A.Alert on any single request exceeding 500ms
B.Fast burn alert with burn rate 14 and 1-hour window
C.Slow burn alert with burn rate 5 and 6-hour window
D.Alert when average latency exceeds 500ms for 1 minute
AnswerC

Slow burn alerts are designed to detect gradual budget consumption.

Why this answer

Slow burn alerts (burn rate threshold ~5, lookback window ~6 hours) detect gradual budget consumption that would exhaust the budget over several days. Fast burn alerts (14x, 1h) catch rapid budget consumption. The scenario describes a slow increase, so a slow burn alert is appropriate.

43
MCQhard

A team uses Cloud Monitoring to create an SLO for a request-based service. They want to alert when the error budget burn rate exceeds 14x the budget for a short window. Which alert type and window should they configure?

A.Slow burn alert with a 6-hour window.
B.Fast burn alert with a 6-hour window.
C.Fast burn alert with a 1-hour window.
D.Slow burn alert with a 1-hour window.
AnswerC

This matches the standard recommendation for fast burn alerts with 14x burn rate over 1 hour.

Why this answer

A fast burn alert fires when the burn rate is very high over a short window (1 hour). 14x burn rate means the entire budget would be consumed in ~1/14 of the SLO window. The correct configuration is a fast burn alert with a 1-hour window.

44
MCQhard

An SRE team wants to perform fault injection testing on a GKE cluster by injecting network latency into a specific set of pods. Which tool should they use?

A.Cloud Functions with network disruption script.
B.Traffic Director fault injection.
C.Google Cloud Armor.
D.Chaos Mesh on GKE.
AnswerD

Chaos Mesh is designed for injecting various faults (latency, failures) into Kubernetes workloads.

Why this answer

Chaos Mesh is an open-source chaos engineering platform specifically designed for Kubernetes. It can inject faults like network latency into targeted pods.

45
MCQhard

A service uses Cloud SQL for MySQL. To test resilience, you want to inject latency into database queries. Which chaos engineering approach is most suitable on Google Cloud?

A.Use gcloud sql instances patch to add artificial delay
B.Use Traffic Director fault injection filter on the Envoy proxy sidecar
C.Use Cloud SQL's built-in maintenance to simulate latency
D.Deploy Cloud Functions to throttle the database connection
AnswerB

Traffic Director with Envoy can inject latency into outbound requests, including to Cloud SQL via a sidecar.

Why this answer

Chaos Mesh is a Kubernetes-native chaos platform that can inject faults into services running on GKE. For Cloud SQL, you would inject latency at the application layer using a sidecar proxy or by using a Chaos Mesh network fault, but the best option among these is to use a proxy sidecar with fault injection.

46
Multi-Selectmedium

A team wants to implement chaos engineering on GKE to test resilience. Which THREE fault types can be injected using Chaos Mesh? (Choose 3 answers)

Select 3 answers
A.Pod failure (kill pods).
B.Code injection into running containers.
C.Whole cluster deletion.
D.CPU stress.
E.Network latency.
AnswersA, D, E

Common fault type in Chaos Mesh.

Why this answer

Chaos Mesh supports many fault types including pod failure (kill), network latency, and CPU stress. A whole cluster deletion is not a typical Chaos Mesh experiment (too destructive), and code injection is not a built-in fault type.

47
MCQhard

A site reliability engineer wants to implement chaos engineering on a Google Kubernetes Engine (GKE) cluster by injecting network latency into pods of a specific deployment. Which tool or service should they use?

A.GKE Sandbox
B.Cloud CDN
C.Traffic Director with HTTP fault filter
D.Chaos Mesh
AnswerD

Chaos Mesh provides fault injection capabilities for Kubernetes, including network latency.

Why this answer

Chaos Mesh is a Kubernetes-native chaos engineering platform that supports injecting faults like network latency, pod failures, and more on GKE.

48
MCQmedium

A team wants to create an SLO for a batch data pipeline that processes files hourly. They want to measure whether each batch completes successfully within the hour. Which type of SLI should they use?

A.Window-based SLI measuring good minutes over total minutes
B.Latency-based SLI measuring p99 of file processing time
C.Request-based SLI measuring good requests over total requests
D.Availability SLI based on successful API calls
AnswerA

Window-based SLIs are designed for batch or streaming pipelines where success is measured over time windows.

Why this answer

For batch or sliding window data, a window-based SLI is appropriate: it counts good minutes (or windows) where the pipeline succeeded within the time window, divided by total minutes. Request-based SLIs are for individual request/response pairs.

49
Multi-Selectmedium

An SRE team is designing an incident management process. Which TWO components are part of a typical incident command structure?

Select 2 answers
A.Incident Commander
B.Developer
C.On-call engineer
D.Communications Lead
E.Product Manager
AnswersA, D

The Incident Commander directs all response activities.

Why this answer

The incident command structure includes an Incident Commander (overall lead) and a Communications Lead (liaison for stakeholders). The SRE lead may be Incident Commander. Developer is a generic role; on-call engineer is a role but not part of the command structure per se.

50
MCQhard

A team uses Cloud Monitoring SLO monitoring with a request-based SLI for availability. They define good requests as those returning HTTP 200. Which configuration correctly creates this SLO?

A.good-request-count = requests with latency < 500ms, valid-request-count = total requests
B.good-request-count = total requests, valid-request-count = successful requests
C.good-request-count = 200 responses, valid-request-count = 400+500 responses
D.good-request-count = successful requests, valid-request-count = total requests
AnswerD

Correct definition.

Why this answer

For request-based SLO, define good-request-count as the number of successful requests and valid-request-count as total requests. Cloud Monitoring uses these metrics.

51
Multi-Selecthard

Which THREE are key principles of a blameless postmortem culture in SRE? (Choose 3)

Select 3 answers
A.Creating action items with owners and due dates
B.Ensuring incidents are never discussed outside the team
C.Identifying and documenting contributing factors
D.Assigning responsibility to individuals for errors
E.Focusing on systemic improvements rather than individual mistakes
AnswersA, C, E

Ensures follow-up.

Why this answer

Blameless postmortems focus on systemic improvements, not individual blame. They involve documenting contributing factors, creating action items with owners and dates, and fostering a culture of learning. Assigning blame is avoided.

52
MCQmedium

A team notices that a critical microservice often fails when the downstream database is slow. They want to test the service's resilience by injecting latency into the database dependency. Which GCP tool should they use?

A.Cloud Load Balancing logging
B.VPC Flow Logs
C.Traffic Director with HTTP fault filter
D.Cloud Armor
AnswerC

Traffic Director can inject latency or errors into traffic for testing resilience.

Why this answer

Chaos Engineering practices on GCP can be implemented using Traffic Director's fault injection or Chaos Mesh. Traffic Director's HTTP fault filter can inject latency into traffic.

53
Multi-Selectmedium

A site reliability engineer is leading a blameless postmortem for an incident. Which THREE practices should be included? (Choose 3.)

Select 3 answers
A.List contributing factors
B.Determine the incident commander for future incidents
C.Assign action items with owners and due dates
D.Use the 5 Whys technique to find root cause
E.Identify the person who caused the incident
AnswersA, C, D

Identifying contributing factors helps prevent recurrence.

Why this answer

Blameless postmortems focus on understanding contributing factors and creating action items. The five whys technique helps find root cause. Action items should have owners and due dates.

Assigning blame is explicitly avoided. Incident commander is part of incident management, not postmortem. A timeline is important.

54
MCQeasy

A company has an SLA of 99.95% availability for its service. The SRE team defines an SLO of 99.99% availability. The error budget is calculated as 0.01% over a 30-day window. How much downtime is allowed per month according to the error budget?

A.432 minutes
B.0.432 minutes
C.43.2 minutes
D.4.32 minutes
AnswerD

Correct calculation: 0.01% of 43200 minutes = 4.32 minutes.

Why this answer

Error budget = 100% - SLO. For 99.99% SLO, error budget = 0.01%. Over 30 days (720 hours), 0.01% of 720 hours = 0.072 hours = 4.32 minutes.

Alternatively, compute: 30 days * 24 * 60 minutes = 43200 minutes; 0.01% of that = 4.32 minutes.

55
MCQmedium

During a blameless postmortem, the team uses the '5 Whys' technique to identify root causes. What is the primary purpose of this technique?

A.To estimate the cost of the incident
B.To iteratively ask 'why' to uncover the root cause
C.To identify the immediate superficial cause of the incident
D.To determine which team member made an error
AnswerB

The technique involves asking 'why' repeatedly to drill down to root cause.

Why this answer

5 Whys is a root cause analysis technique used to uncover the underlying causes of an incident, not to assign blame or find contributors superficially.

56
MCQmedium

A team is automating toil reduction and needs to identify tasks that qualify as toil. Which of the following is a defining characteristic of toil according to SRE principles?

A.Work that has no enduring value
B.Work that requires deep domain expertise
C.Work that is performed by junior engineers only
D.Work that is critical to business operations
AnswerA

Toil does not produce lasting improvement; once done, it needs to be done again.

Why this answer

Toil is work that is manual, repetitive, automatable, has no enduring value, and scales linearly with service growth. One key characteristic is that it provides no enduring value; if the task were not done, the service would not suffer long-term loss.

57
MCQhard

An engineer needs to create an SLO in Cloud Monitoring for a service that processes requests. The SLO should measure the proportion of requests that complete successfully within 300ms. The metric type for successful requests is custom.googleapis.com/myapp/success_count and for total requests is custom.googleapis.com/myapp/total_count. Which type of SLO should they create?

A.Availability SLO using a single metric for uptime
B.Request-based SLO using a single distribution metric
C.Window-based SLO with metric distribution
D.Request-based SLO using two metrics (good count / valid count)
AnswerD

The engineer has two metrics: good and total. A request-based SLO calculates the ratio.

Why this answer

Cloud Monitoring SLOs can be request-based using a ratio of good requests to total requests from two metrics. This is a request-based SLO with a metric distribution.

58
MCQhard

A service has an SLO based on request latency: 99% of requests must complete under 500ms over a 28-day window. The team wants to monitor the error budget burn rate. Which Cloud Monitoring SLO type and configuration should be used?

A.Request-based SLO: good request count / valid request count with threshold 500ms
B.Window-based SLO: good minutes / total minutes with threshold 500ms
C.Request-based SLO: good request count / total request count with threshold 500ms
D.Window-based SLO: good requests / total requests with threshold 500ms
AnswerA

Correct: request-based SLO with a latency threshold.

Why this answer

A request-based SLO with a good-request-count/valid-request-count metric is appropriate for latency SLIs. The threshold is set at 500ms. Window-based SLOs are for uptime, not latency.

59
MCQmedium

During an incident, the incident commander delegates tasks to multiple teams. After the incident is resolved, the team holds a postmortem. Which of the following is a key principle of a blameless postmortem culture?

A.Analyze contributing factors and implement action items with owners
B.Focus on human error and retraining
C.Identify the individual responsible and assign corrective action
D.Share the postmortem only with the incident commander
AnswerA

This is the correct approach: identify contributing factors and create action items with owners and deadlines.

Why this answer

Blameless postmortems focus on identifying system and process failures, not individual mistakes. The goal is to improve reliability.

60
Multi-Selectmedium

An SRE team wants to implement error budget burn rate alerts for a service with SLO 99.9% over 30 days. They need to be notified both when the error budget is being consumed rapidly (full consumption in ~2 hours) and when it is being consumed slowly (full consumption in ~6 days). Which two alert configurations should they use? (Choose 2)

Select 2 answers
A.Burn rate threshold: 2, lookback window: 1 hour
B.Burn rate threshold: 14, lookback window: 1 hour
C.Burn rate threshold: 5, lookback window: 6 hours
D.Burn rate threshold: 10, lookback window: 1 hour
E.Burn rate threshold: 14, lookback window: 6 hours
AnswersB, C

Fast burn alert for rapid consumption.

Why this answer

Standard best practice: fast burn alert: burn rate 14, window 1 hour (consumes budget in ~2 days actual, but for 2 hours need custom? Actually 14x over 1 hour means budget would be consumed in ~2.14 days. For 2 hours, you'd need a higher threshold, but the question likely expects the standard fast and slow burn pair. Slow burn: burn rate 5, window 6 hours (consumes budget in ~6 days).

61
MCQeasy

An SRE team wants to alert when the error budget burn rate exceeds 14x the allowed rate over a 1-hour window. Which Cloud Monitoring alert policy configuration is appropriate?

A.Alert on error budget burn rate > 14 over a 1-hour window
B.Alert on error budget burn rate > 14 over a 6-hour window
C.Alert on error budget burn rate > 5 over a 1-hour window
D.Alert on error budget burn rate > 8 over a 2-hour window
AnswerA

Correct: fast burn alert uses 1-hour window with burn rate > 14.

Why this answer

For fast burn alert, use a 1-hour window with a burn rate threshold of 14. The burn rate is calculated as the ratio of actual failures to the allowed error budget per time unit.

62
MCQeasy

A Site Reliability Engineer is tasked with reducing toil in their team. They identify that resetting expired database connections manually is a common task. What is the best way to automate this toil?

A.Use Cloud Scheduler to invoke a Cloud Function that resets connections
B.Create a Cloud Workflow that runs every hour
C.Train all team members to reset connections manually
D.Write a cron job on a Compute Engine instance
AnswerA

This is serverless, event-driven, and reduces toil.

Why this answer

Automating toil typically involves using serverless automation or workflow services. Cloud Functions are ideal for event-driven automation like resetting connections on a schedule or in response to an alert. Cloud Workflows is for orchestrating longer running tasks.

Cloud Scheduler can trigger a Cloud Function to perform the reset. The best option is to use a Cloud Function triggered by Cloud Scheduler to reset connections periodically or on-demand.

63
MCQmedium

A team wants to define an SLO for a microservice that processes batch jobs. The service is considered healthy if each batch completes within 60 minutes. There are 100 batches per day. Which SLI should be used?

A.Latency SLI: proportion of batches under 60 minutes
B.Request-based SLI with good-request-count / valid-request-count
C.Window-based SLI with good-minutes / total-minutes
D.Availability SLI: successful batch completions / total batch completions
AnswerC

Window-based SLI is correct for time-bounded processing like batch jobs.

Why this answer

A window-based SLI measures the proportion of good minutes (or windows) over total minutes. Since the service has a time-bounded objective (60 minutes per batch), a window-based SLI that counts each minute as 'good' if the oldest incomplete batch is less than 60 minutes old is appropriate. Request-based SLIs are for request/response patterns.

64
MCQeasy

An SRE team defines an SLO for a batch processing pipeline. Which SLI is most appropriate for pipeline freshness?

A.Percentage of records with correct values
B.Number of records processed per second
C.Age of the most recent output record
D.Number of failed pipeline runs
AnswerC

Correct: indicates how fresh the data is.

Why this answer

Pipeline freshness measures how up-to-date the output data is compared to the input. It's best measured by the age of the most recent output record.

65
MCQmedium

A team wants to inject latency faults into their microservices running on Google Kubernetes Engine (GKE) to test resilience. Which tool or service should they use?

A.Cloud Build to orchestrate failure scenarios
B.Chaos Mesh
C.Traffic Director fault injection with HTTP fault filter
D.Cloud Functions to simulate latency by delaying responses
AnswerB

Chaos Mesh is purpose-built for chaos engineering on Kubernetes, including latency injection.

Why this answer

Chaos Mesh is a popular open-source chaos engineering platform for Kubernetes, capable of injecting faults like latency, pod failures, and network partitions. It integrates well with GKE.

66
MCQmedium

An SRE team defines an SLI as the proportion of good requests to valid requests over a 1-minute window. They set an SLO of 99.9% availability over 30 days. Which error budget burn rate alert configuration should they use to detect rapid consumption of the error budget within 1 hour?

A.Alert on burn rate of 14x over a 1-hour window
B.Alert on burn rate of 2x over a 24-hour window
C.Alert on burn rate of 10x over a 30-minute window
D.Alert on burn rate of 5x over a 6-hour window
AnswerA

14x over 1h is the recommended fast burn alert configuration for SLO monitoring.

Why this answer

A fast burn alert uses a short window (1h) and a high burn rate (14x) to detect rapid error budget consumption quickly.

67
MCQmedium

An engineer needs to reduce toil. Which of the following tasks is considered toil according to SRE principles?

A.Writing a postmortem after an incident
B.Manually rotating service account keys every week
C.Designing a new microservice architecture
D.Reviewing code changes for a new feature
AnswerB

Repetitive, manual, can be automated — fits toil definition.

Why this answer

Toil is manual, repetitive, automatable, and has no enduring value. Manually rotating service account keys weekly is a classic example of toil.

68
MCQmedium

An SRE team wants to reduce toil by automating the response to common alert notifications. For example, when a disk usage alert fires, they want to automatically run a script to clean up temporary files. Which Google Cloud service is best suited for this?

A.Cloud Build
B.Cloud Functions
C.Workflows
D.Cloud Scheduler
AnswerB

Cloud Functions can be triggered by Cloud Monitoring alert notifications via Pub/Sub to execute a cleanup script.

Why this answer

Cloud Functions can be triggered by Cloud Monitoring alerts via Pub/Sub and execute arbitrary code (e.g., script to clean disk). This is a common pattern for automated remediation. Workflows could orchestrate multiple steps but adds complexity.

Cloud Build is CI/CD, not event-driven. Cloud Scheduler is for scheduled tasks.

69
MCQeasy

What is the primary purpose of an error budget?

A.To track the number of bugs in production
B.To allocate budget for cloud resources
C.To measure customer satisfaction
D.To determine how much risk the service can tolerate while maintaining the SLO
AnswerD

Error budget = 100% - SLO; it defines the maximum allowed downtime/errors before violating the SLO.

Why this answer

An error budget is the amount of acceptable unreliability (e.g., downtime) within an SLO. It allows teams to balance reliability with innovation by permitting some failures.

70
MCQmedium

An SRE team wants to track the amount of toil each week and ensure it does not exceed 50% of the team's time. Which approach is most aligned with SRE best practices?

A.Use Cloud Monitoring to automatically classify all tasks as toil
B.Use Cloud Tasks to queue up toil items and measure completion time
C.Have each team member log their time and estimate toil percentage weekly
D.Set a Cloud Scheduler job to remind team members to automate toil
AnswerC

This is the standard practice: self-estimation and tracking.

Why this answer

SRE best practice is to have team members estimate and track toil time weekly, aiming for under 50%. Using a simple spreadsheet is a valid approach to start.

71
Multi-Selecthard

A team uses Cloud Monitoring SLOs for a service that has an SLO of 99.9% availability. They want to create alerts that notify the on-call engineer when the error budget is burning too fast. Which TWO conditions should they configure? (Choose 2.)

Select 2 answers
A.30-minute window with 20x burn rate
B.24-hour window with 2x burn rate
C.1-hour window with 14x burn rate
D.6-hour window with 5x burn rate
E.1-hour window with 5x burn rate
AnswersC, D

Fast burn alert: 1h window, 14x burn rate.

Why this answer

Standard SRE practice recommends a multi-window, multi-burn-rate alerting strategy: a fast burn rate alert with a 1-hour window and 14x burn rate, and a slow burn rate alert with a 6-hour window and 5x burn rate. These two alerts cover different exhaustion times.

72
MCQmedium

During an incident, the incident commander delegates tasks to multiple teams. Which communication model is MOST effective to reduce noise?

A.Use email for updates to avoid real-time noise.
B.Use a single incident channel where all teams post updates.
C.Each team communicates in separate channels.
D.All updates go through the incident commander only.
AnswerB

A single channel ensures everyone sees updates and reduces cross-talk.

Why this answer

The recommended approach is to use a single communication channel (e.g., a dedicated chat room) for all incident-related updates, and the incident commander coordinates via that channel.

73
MCQeasy

An SLA guarantees 99.9% monthly uptime. The team's SLO is 99.95% and error budget is 0.05%. What is the maximum allowed downtime per month according to the SLA?

A.7.2 hours.
B.43.8 minutes.
C.21.9 minutes.
D.4.38 hours.
AnswerB

0.1% of 43,800 minutes = 43.8 minutes, which is the SLA allowance.

Why this answer

99.9% uptime allows 0.1% downtime. Per month (30 days = 43,800 minutes), 0.1% is 43.8 minutes. The SLO is stricter but the question asks for SLA allowance.

74
MCQmedium

A team has an SLO of 99.9% availability over a 30-day period. How many minutes of downtime does the error budget allow per month?

A.4.32 minutes
B.43 minutes
C.432 minutes
D.4,320 minutes
AnswerB

Correct calculation: 43,200 * 0.001 = 43.2 minutes, approximately 43 minutes.

Why this answer

Error budget = 100% - SLO target = 0.1% of total time. 30 days = 43,200 minutes. 0.1% of 43,200 = 43.2 minutes. Rounding gives 43 minutes.

75
Multi-Selectmedium

A team wants to reduce toil in their operations. Which two of the following are characteristics of toil according to Google SRE principles? (Choose 2)

Select 2 answers
A.Work that provides enduring value for the service
B.Work that is manual and repetitive
C.Work that scales linearly with service growth
D.Work that is completely automated already
E.Work that requires creative problem-solving
AnswersB, C

Toil is manual and repetitive.

Why this answer

Toil is manual, repetitive, automatable, devoid of enduring value, and scales linearly with service growth. Work that is creative or strategic is not toil. Tasks that require deep analysis are not toil.

Page 1 of 2 · 124 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Pcde Sre Practices questions.

CCNA Pcde Sre Practices Questions — Page 1 of 2 | Courseiva