CCNA Reliability Ops Questions

24 of 99 questions · Page 2/2 · Reliability Ops topic · Answers revealed

76
MCQhard

A company runs a stateful application on Compute Engine with persistent disks. They want to ensure data durability across a zone failure. What is the best approach?

A.Replicate data at application level to another instance in a different zone
B.Use Google Cloud NetApp Volumes with replication
C.Use regional persistent disks
D.Take regular snapshots of the persistent disks and store them in a multiregional bucket
AnswerC

Regional PDs replicate data across zones with synchronous writes, ensuring durability.

Why this answer

Regional persistent disks (RPDs) synchronously replicate data between two zones in the same region, providing an RPO of zero and automatic failover without application-level changes. This ensures data durability across a zone failure while maintaining consistent performance and low latency.

Exam trap

Google Cloud often tests the distinction between synchronous replication (regional persistent disks) and asynchronous backup (snapshots), leading candidates to choose snapshots for durability when they actually need zero RPO across a zone failure.

How to eliminate wrong answers

Option A is wrong because replicating data at the application level adds complexity, latency, and requires custom code, whereas Compute Engine offers a managed, synchronous replication solution. Option B is wrong because Google Cloud NetApp Volumes is a third-party service that is not natively integrated with Compute Engine for this use case and introduces additional cost and management overhead. Option D is wrong because regular snapshots stored in a multiregional bucket provide point-in-time recovery but have an RPO of minutes to hours and do not offer synchronous replication, so data written between snapshots is lost during a zone failure.

77
Multi-Selectmedium

A company runs a web application on Compute Engine behind an HTTP load balancer. They want to improve reliability by implementing failover across two regions. Which TWO actions should they take?

Select 2 answers
A.Deploy a global external HTTP load balancer with backends in both regions.
B.Configure a backend service with a failover policy pointing to primary and secondary backends.
C.Configure DNS-based failover using Cloud DNS with health checks.
D.Use an internal load balancer to route traffic between regions.
E.Use a regional external HTTP load balancer with a multi-region backend.
AnswersA, B

Global load balancer automatically routes to healthy backends, providing cross-region failover.

Why this answer

A global external HTTP load balancer is required for cross-region failover because it uses a single anycast IP address and routes traffic to the closest healthy backend. By deploying backends in both regions, the load balancer automatically fails over to the secondary region if the primary region's backends become unhealthy, improving reliability without DNS propagation delays.

Exam trap

The trap here is that candidates confuse DNS-based failover (which is slow and not recommended for HTTP load balancing) with the instant, anycast-based failover of a global load balancer, or mistakenly think a regional load balancer can span multiple regions.

78
Multi-Selectmedium

Your organization is implementing a Disaster Recovery plan for a critical database. Which THREE components are essential for a robust DR strategy? (Choose 3)

Select 3 answers
A.A single global load balancer for both regions.
B.Automated failover process to switch traffic to the DR region.
C.Data replication strategy (synchronous or asynchronous) to a secondary region.
D.Regular DR drills (testing failover at least once per quarter).
E.Using a single zone for the primary region.
AnswersB, C, D

Automation minimizes manual errors and reduces RTO.

Why this answer

Option B is correct because an automated failover process is essential for minimizing Recovery Time Objective (RTO) in a Disaster Recovery strategy. Without automation, manual intervention introduces delays and risks of human error, which can extend downtime significantly. In cloud or on-premises environments, automated failover typically relies on health checks, DNS updates, or traffic manager rules to seamlessly redirect traffic to the DR region when the primary fails.

Exam trap

Google Cloud often tests the misconception that a single global load balancer provides high availability, when in fact it becomes a single point of failure unless it is itself deployed in a redundant, multi-region architecture.

79
MCQhard

An organization is migrating a legacy monolithic application to Google Cloud. The application currently runs on a single server with an on-premises database. The application is stateful and requires low-latency access to the database. The migration must minimize downtime and ensure high availability. Which architecture should the company adopt?

A.Deploy on GKE with StatefulSets and use Cloud Spanner for global consistency.
B.Deploy on Compute Engine with a regional persistent disk and use Cloud SQL for PostgreSQL with regional high availability.
C.Deploy on App Engine Standard Environment and use Cloud Firestore in Datastore mode.
D.Deploy on Cloud Run and use Cloud SQL with read replicas.
AnswerB

This provides HA and low-latency access needed for the stateful monolithic app.

Why this answer

Option B is correct because it combines Compute Engine with a regional persistent disk for synchronous replication across zones, ensuring high availability with minimal downtime during a zonal failure. Cloud SQL for PostgreSQL with regional high availability provides a managed, low-latency database with automatic failover, meeting the stateful application's need for low-latency access and high availability without the complexity of container orchestration.

Exam trap

The trap here is that candidates often overcomplicate the solution by choosing containerized or serverless options (GKE, Cloud Run, App Engine) without recognizing that a legacy monolithic stateful application with low-latency requirements is best served by a simple, proven VM-based architecture with regional persistent disks and a managed relational database with synchronous replication.

How to eliminate wrong answers

Option A is wrong because GKE with StatefulSets introduces orchestration overhead and potential downtime during cluster upgrades or node failures, and Cloud Spanner, while globally consistent, adds latency and cost overkill for a single-region low-latency requirement. Option C is wrong because App Engine Standard Environment is stateless by design and does not support stateful applications with persistent local storage, and Cloud Firestore in Datastore mode is a NoSQL database that does not provide the relational consistency and low-latency access expected from a legacy monolithic database. Option D is wrong because Cloud Run is stateless and ephemeral, requiring external storage for state, and Cloud SQL with read replicas does not provide synchronous replication for high availability; read replicas are asynchronous and cannot guarantee zero data loss during a failover.

80
MCQmedium

A company runs a critical application on Compute Engine instances in a managed instance group (MIG) with autoscaling. During a traffic spike, some instances become unhealthy but are not automatically replaced. What is the most likely cause?

A.The MIG is regional and one zone failed.
B.The autohealing health check is misconfigured.
C.The instance template has a startup script error.
D.The HTTP load balancer's health check is failing.
AnswerB

MIG autohealing relies on a health check to detect unhealthy instances and replace them; a misconfiguration prevents detection.

Why this answer

The most likely cause is that the autohealing health check is misconfigured. In a managed instance group, autohealing relies on a health check to detect unhealthy instances and trigger replacement. If the health check is misconfigured (e.g., wrong port, path, or protocol), the MIG will not recognize instances as unhealthy and will not automatically replace them, even during a traffic spike.

Exam trap

Google Cloud often tests the distinction between the MIG's autohealing health check and the load balancer's health check, leading candidates to incorrectly attribute instance replacement failures to load balancer issues rather than the MIG's own health check configuration.

How to eliminate wrong answers

Option A is wrong because a regional MIG with a single zone failure would still trigger autohealing in the remaining healthy zones, and the MIG would replace instances in the failed zone if the health check is correctly configured. Option C is wrong because a startup script error would cause instances to fail at boot, but the MIG would still attempt to replace them based on the health check; the issue is not about the template but the detection mechanism. Option D is wrong because the HTTP load balancer's health check is separate from the MIG's autohealing health check; a failing load balancer health check does not prevent the MIG from replacing unhealthy instances if its own health check is properly configured.

81
MCQmedium

An application uses Cloud Pub/Sub for asynchronous processing. Subscribers occasionally fail to acknowledge messages within the ack deadline, causing redelivery. How to improve reliability and prevent message buildup?

A.Increase the ack deadline to the maximum value
B.Set max delivery attempts to 1 to avoid redelivery
C.Implement exponential backoff in the subscriber retry logic
D.Use a dead-letter topic to capture failed messages
AnswerC

Exponential backoff allows the subscriber to retry after increasing delays, handling transient failures effectively.

Why this answer

Option C is correct because implementing exponential backoff in the subscriber retry logic allows the subscriber to gradually increase the delay between retries when messages are not acknowledged, reducing the likelihood of overwhelming the system and preventing message buildup. This approach aligns with Cloud Pub/Sub's recommended practices for handling transient failures, as it gives the subscriber time to recover without exhausting the ack deadline or causing excessive redelivery.

Exam trap

Google Cloud often tests the misconception that increasing the ack deadline or using a dead-letter topic alone solves reliability issues, but the key is implementing retry logic with backoff to handle transient failures without losing messages or causing buildup.

How to eliminate wrong answers

Option A is wrong because increasing the ack deadline to the maximum value (e.g., 600 seconds) does not address the root cause of subscriber failures; it only delays redelivery, potentially leading to message buildup if the subscriber never recovers. Option B is wrong because setting max delivery attempts to 1 prevents redelivery entirely, which means any message that fails to be acknowledged will be permanently lost, undermining the reliability of asynchronous processing. Option D is wrong because using a dead-letter topic captures failed messages after all delivery attempts are exhausted, but it does not prevent message buildup during the retry process; it is a last-resort mechanism, not a proactive reliability improvement.

82
Multi-Selecthard

Which THREE options are valid strategies for disaster recovery (DR) in Google Cloud?

Select 3 answers
A.Store hourly snapshots of Compute Engine disks in the same region.
B.Deploy a mirrored environment in another region and use Traffic Director to fail over.
C.Enable Cloud CDN to cache static content from multiple origins.
D.Use a Cloud Storage bucket in a different region with Object Versioning enabled.
E.Configure a cross-region replica for Cloud SQL and promote it during failover.
AnswersB, D, E

Traffic Director can route traffic to the DR environment.

Why this answer

Option B is correct because Traffic Director, based on the xDS API (Envoy), can manage traffic routing across regions. By deploying a mirrored environment in another region and configuring Traffic Director with failover policies, you can redirect traffic to the secondary region if the primary fails, enabling a robust active-passive or active-active DR strategy.

Exam trap

The trap here is confusing high-availability features (like snapshots or CDN) with true disaster recovery, which requires geographic separation and automated failover mechanisms.

83
MCQeasy

A company runs a critical application on Compute Engine instances in a managed instance group (MIG) with autoscaling. Users report intermittent 503 errors during traffic spikes. Which action should the company take to improve reliability?

A.Change the load balancer from regional to global
B.Configure a health check with a sufficient initial delay (grace period) in the MIG
C.Increase the autoscaling cool-down period from 60s to 120s
D.Increase the maximum number of instances in the MIG
AnswerB

Correct: ensures instances are healthy before traffic is sent.

Why this answer

Intermittent 503 errors during traffic spikes often indicate that new VM instances are being started but are not yet ready to serve traffic, causing the load balancer to forward requests to them prematurely. Configuring a health check with a sufficient initial delay (grace period) in the MIG ensures that newly created instances are given time to fully initialize and pass health checks before they receive traffic, preventing 503 errors. This directly addresses the root cause by allowing the application to become healthy before being added to the load balancer's backend.

Exam trap

Google Cloud often tests the misconception that scaling-related errors are always solved by increasing capacity or adjusting scaling parameters, when in fact the root cause is often a misconfigured health check or insufficient initialization time for new instances.

How to eliminate wrong answers

Option A is wrong because changing the load balancer from regional to global does not address the timing issue of new instances being marked healthy before they are ready; global load balancers improve cross-region routing but do not affect instance readiness. Option C is wrong because increasing the autoscaling cool-down period from 60s to 120s only delays the scaling decision after a scale-out event, but does not prevent the load balancer from sending traffic to instances that are still initializing; the cool-down period controls how often autoscaler evaluates metrics, not instance readiness. Option D is wrong because increasing the maximum number of instances in the MIG allows more capacity but does not fix the problem of instances being added to the backend pool before they are ready; it may even exacerbate the issue by creating more unhealthy instances.

84
Matchingmedium

Match each GCP networking concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Virtual Private Cloud for isolated network

Regional IP address range within a VPC

Controls ingress/egress traffic

Dynamically exchange routes using BGP

Connect two VPCs privately

Why these pairings

These are fundamental networking concepts in GCP.

85
MCQhard

You are responsible for incident management for a production service. You want to reduce manual toil during the initial response to common issues like high latency. What is the best approach?

A.Use Cloud Monitoring to trigger a Cloud Function that performs automated checks and rolls back the last deployment if latency spikes.
B.Set up Cloud Monitoring alerts with email notifications to the on-call engineer.
C.Create detailed runbooks and require the on-call to follow them step by step.
D.Enable Cloud Logging and set up a custom dashboard for the on-call.
AnswerA

Automated actions reduce manual toil and speed up response.

Why this answer

Option A is correct because it directly reduces manual toil by automating the initial response to common issues like high latency. Cloud Monitoring triggers a Cloud Function that performs automated checks and, if latency spikes, rolls back the last deployment, eliminating the need for human intervention during the critical first response phase.

Exam trap

Google Cloud often tests the distinction between 'alerting' (which still requires manual action) and 'automated remediation' (which reduces toil), so candidates mistakenly choose options that provide visibility or documentation instead of automation.

How to eliminate wrong answers

Option B is wrong because email notifications alone still require the on-call engineer to manually investigate and respond, which does not reduce toil; it merely alerts them. Option C is wrong because requiring the on-call to follow runbooks step by step still involves manual effort and does not automate the response, leaving toil unchanged. Option D is wrong because enabling Cloud Logging and setting up a custom dashboard provides visibility but does not automate any action, so the on-call must still manually diagnose and respond to the issue.

86
Multi-Selectmedium

A company runs a critical application on a Compute Engine instance. They want to ensure that the application remains available even if the instance crashes. Which two GCP features should they use? (Choose two.)

Select 2 answers
A.Regular snapshots of the persistent disk.
B.A load balancer distributing traffic to a single instance.
C.Instance template with automatic restart.
D.Managed Instance Group with autohealing.
E.A Cloud CDN to cache static content.
AnswersC, D

Automatic restart restarts the instance on host failure.

Why this answer

Option C is correct because an instance template with automatic restart enables Compute Engine to automatically restart a VM instance if it crashes or is terminated due to a non-user-initiated failure. This feature is configured at the instance level and ensures that the application recovers quickly without manual intervention, improving availability for a single-instance workload.

Exam trap

The trap here is that candidates often confuse automatic restart (which handles VM crashes) with autohealing (which handles application-level failures), or incorrectly assume a load balancer alone provides high availability without a redundant backend.

87
MCQmedium

Your organization uses Cloud Spanner for a customer database with a 99.999% availability SLA. You need a Disaster Recovery plan that ensures data consistency with zero RPO in case of a region failure. What should you do?

A.Use a single-region instance configuration and enable read replicas.
B.Export the database periodically to Cloud Storage and set up a cross-region load balancer.
C.Configure daily backups and store them in Cloud Storage in a different region.
D.Use a multi-region instance configuration (e.g., nam-eur-asia) for the Spanner instance.
AnswerD

Multi-region configs use synchronous replication across regions, providing automatic failover with zero RPO.

Why this answer

Option D is correct because Cloud Spanner multi-region instance configurations (e.g., nam-eur-asia) provide synchronous replication across multiple regions, ensuring strong global consistency and zero RPO. This architecture uses Paxos-based replication to commit writes only after they are durably stored in a majority of regions, so a region failure does not lose any committed data. The 99.999% availability SLA is met by automatic failover within the multi-region setup without manual intervention.

Exam trap

Google Cloud often tests the misconception that read replicas or periodic exports can achieve zero RPO, but only synchronous multi-region replication (as in Spanner's multi-region configurations) guarantees no data loss during a region failure.

How to eliminate wrong answers

Option A is wrong because single-region instance configurations with read replicas are not supported in Cloud Spanner; Spanner uses writable replicas, not read replicas, and a single-region setup cannot survive a full region failure, thus cannot achieve zero RPO. Option B is wrong because exporting the database periodically to Cloud Storage introduces a non-zero RPO (the time between exports) and does not guarantee data consistency at the point of failure; cross-region load balancers do not handle Spanner's transactional consistency. Option C is wrong because daily backups stored in a different region provide point-in-time recovery with a minimum RPO of 24 hours (or more), not zero RPO, and cannot ensure data consistency for transactions in flight at the time of failure.

88
MCQhard

Refer to the exhibit. A Deployment Manager template deploys a GKE cluster and a job that publishes to Pub/Sub. The job fails with a permission error. Which change would fix the issue?

A.Set the job's serviceAccountName to the default compute service account.
B.Change the oauthScopes to include https://www.googleapis.com/auth/cloud-platform.
C.Add dependsOn: [my-job] to the cluster resource to ensure the cluster is ready.
D.Add a serviceAccount field to nodeConfig with a custom service account that has roles/pubsub.publisher.
AnswerD

This ensures the nodes (and thus the job) have the required Pub/Sub publish permission.

Why this answer

The node pool's service account needs the Pub/Sub Publisher role. The exhibit shows the nodes are using the default compute engine service account with only pubsub scope (no roles). The fix is to assign a service account with the necessary IAM role.

89
MCQeasy

An organization needs to meet a RTO of 1 hour for a critical application running on GCE with persistent disks. What is the most cost-effective approach?

A.Use regional persistent disks.
B.Replica of compute instance in another zone.
C.Frequent disk image exports.
D.Regular snapshots to a regional bucket.
AnswerA

Synchronous replication, fast failover.

Why this answer

Regional persistent disks (PD) provide synchronous replication of data between two zones in the same region, enabling automatic failover for a GCE instance without manual intervention. This meets the 1-hour RTO by allowing the instance to be recreated or failed over to the secondary zone quickly, and it is more cost-effective than maintaining a full replica instance because you only pay for the disk storage and replication, not for an idle compute instance.

Exam trap

The trap here is that candidates often confuse regional persistent disks with snapshots or image exports, assuming that any backup method can meet a strict RTO, but they overlook the synchronous replication and automatic failover capability of regional PDs that make them the most cost-effective for this requirement.

How to eliminate wrong answers

Option B is wrong because maintaining a replica of the compute instance in another zone incurs additional compute costs for the idle replica, which is less cost-effective than using regional PDs that only replicate the disk. Option C is wrong because frequent disk image exports are time-consuming (exporting an image can take longer than 1 hour) and incur storage costs for each image, making it impractical for a 1-hour RTO and not cost-effective. Option D is wrong because regular snapshots to a regional bucket provide asynchronous backup, not synchronous replication; restoring from a snapshot requires creating a new disk and instance, which can exceed the 1-hour RTO due to snapshot export and disk creation times.

90
MCQhard

A company runs a multi-tier application on Google Cloud: a frontend on App Engine Standard, a backend on Cloud Run, and a Cloud SQL database. The application experiences intermittent 500 errors when users submit forms. The errors correlate with high CPU usage on the Cloud SQL instance (db-n1-standard-2, 7.5 GB memory). The Cloud Run service has a concurrency setting of 80 and a maximum of 10 instances. The App Engine service uses automatic scaling. The team has verified that the application code is not the issue. They suspect the database is hitting connection limits. Current max_connections on Cloud SQL is 250. The Cloud Run service uses a connection pool of 10 connections per instance. The App Engine service uses a connection pool of 5 connections per instance. They also have a few batch jobs that run occasionally, using up to 10 connections. The team wants to resolve the errors with minimal cost and complexity. Which course of action should they take?

A.Increase the maximum number of Cloud Run instances to 20 to handle more requests.
B.Upgrade the Cloud SQL instance to db-n1-standard-4 (15 GB memory) to handle more connections.
C.Increase the max_connections parameter on Cloud SQL to 500.
D.Reduce the concurrency setting on Cloud Run from 80 to 40.
AnswerC

This directly addresses the connection limit issue with minimal cost and no code changes.

Why this answer

The intermittent 500 errors are caused by the Cloud SQL instance hitting its max_connections limit of 250. With Cloud Run using 10 connections per instance and up to 10 instances (100 connections), App Engine using 5 connections per instance (unknown instance count but likely significant), and batch jobs using up to 10 connections, the total can easily exceed 250. Increasing max_connections to 500 directly addresses the connection limit without changing instance size or scaling behavior, which is the simplest and most cost-effective fix.

Exam trap

Google Cloud often tests the misconception that upgrading the instance tier (more memory/CPU) automatically increases connection limits, when in fact max_connections is a configurable parameter that can be increased independently without changing the instance size.

How to eliminate wrong answers

Option A is wrong because increasing Cloud Run instances to 20 would increase the total number of connections (up to 200 from Cloud Run alone), worsening the connection limit issue and potentially causing more 500 errors. Option B is wrong because upgrading to db-n1-standard-4 increases memory but does not change the default max_connections limit (which is based on tier, not memory alone); the current bottleneck is the connection count, not CPU or memory, so this adds cost without solving the problem. Option D is wrong because reducing concurrency on Cloud Run from 80 to 40 would decrease the number of concurrent requests per instance but does not reduce the number of connections per instance (still 10), and could lead to more instances being spun up, potentially increasing total connections.

91
MCQhard

You are investigating a Vertex AI Workbench instance (instance-2) that is showing UNHEALTHY status. Based on the exhibit, what is the most likely cause of the issue?

A.The container image gcr.io/my-project/my-image:latest does not exist, or the service account used by the Workbench instance does not have storage.objectViewer access to the container registry.
B.The container registry endpoint is blocked by a firewall rule that does not allow egress to gcr.io.
C.The instance's underlying Compute Engine resources are exhausted, causing the container creation to timeout.
D.The Workbench instance is using an outdated custom image that is not compatible with the latest runtime version.
AnswerA

Option C is correct because the error is an image pull failure, which is typically due to missing image or insufficient permissions.

Why this answer

The UNHEALTHY status in Vertex AI Workbench typically occurs when the instance fails to start its container. The most likely cause is that the specified container image (gcr.io/my-project/my-image:latest) does not exist in the Container Registry, or the service account attached to the instance lacks the storage.objectViewer role on the registry bucket. Without this permission, the instance cannot pull the image, leading to a container creation failure and an UNHEALTHY state.

Exam trap

Google Cloud often tests the distinction between container image availability/permissions and network-level issues; the trap here is that candidates may assume a firewall or resource exhaustion is the cause, but the exhibit's focus on a specific container image points directly to a missing image or insufficient IAM permissions on the Container Registry.

How to eliminate wrong answers

Option B is wrong because while a firewall blocking egress to gcr.io could cause a pull failure, the exhibit does not mention any firewall rules, and the question asks for the 'most likely' cause based on the exhibit—lack of image existence or permissions is a more common and direct issue. Option C is wrong because Compute Engine resource exhaustion (e.g., CPU/memory) would typically cause a timeout or error during instance creation, not a persistent UNHEALTHY status after the instance is running; Vertex AI Workbench handles resource allocation separately. Option D is wrong because an outdated custom image would likely cause compatibility warnings or startup failures, but the exhibit shows a specific container image reference (gcr.io/my-project/my-image:latest), not a custom image issue; the UNHEALTHY status is tied to container pull failures, not image version mismatches.

92
MCQeasy

You manage a batch data processing workload on Compute Engine that runs daily on a single VM. The VM uses a standard persistent disk (pd-standard) for input data and output results. Recently, the VM crashed due to a hardware failure, and the job failed. You need to implement a solution that automatically recovers from VM failures with minimal data loss. The job is idempotent and can restart from the beginning if necessary. Which solution should you choose?

A.Take a snapshot of the persistent disk every hour and create a new VM from the latest snapshot on failure
B.Use Cloud Scheduler to restart the VM every hour until the job completes
C.Add a startup script to the existing VM to rerun the job on boot, and enable automatic restart
D.Create a managed instance group (MIG) with an instance template that includes a startup script to run the job, and enable autohealing
AnswerD

Correct: MIG autohealing recreates VM on failure.

Why this answer

Option D is correct because a managed instance group (MIG) with autohealing automatically recreates a VM instance when it fails, and the startup script ensures the idempotent job reruns from the beginning on the new VM. This minimizes data loss by using the same persistent disk (or a fresh one) and leverages Compute Engine's health check mechanism to detect failure and trigger recovery without manual intervention.

Exam trap

The trap here is that candidates confuse automatic restart (which only works for transient failures on the same VM) with autohealing (which recreates the VM after hardware failure), leading them to pick Option C instead of D.

How to eliminate wrong answers

Option A is wrong because hourly snapshots introduce up to 1 hour of potential data loss and require manual steps to create a new VM from the snapshot, which does not provide automatic recovery. Option B is wrong because Cloud Scheduler restarting the VM every hour does not detect actual VM failure; it blindly restarts on a schedule, which could interrupt a running job and does not address hardware failure recovery. Option C is wrong because enabling automatic restart on a single VM only recovers from transient failures (e.g., host maintenance), not from hardware failures that destroy the VM; the VM must be recreated, and a startup script on a dead VM cannot execute.

93
MCQeasy

An application running on Compute Engine instances behind a load balancer experiences intermittent failures. Health checks show instances passing, but some users get errors. What should be the first troubleshooting step?

A.Increase instance size.
B.Review the application logs for errors.
C.Enable HTTP health checks.
D.Check the load balancer's backend service configuration for session affinity.
AnswerB

Logs reveal application-level errors.

Why this answer

The correct first step is to review the application logs (Option B) because the issue is intermittent failures despite healthy load balancer health checks. Since health checks confirm the instances are reachable and responding correctly at the health check endpoint, the problem likely lies within the application itself—such as request handling errors, timeouts, or resource contention. Application logs provide the most direct evidence of what is happening when users encounter errors, enabling targeted debugging before modifying infrastructure.

Exam trap

The trap here is that candidates assume health check failures are the cause of user errors, but Cisco tests the distinction between infrastructure-level health (passing) and application-level errors (logged), leading them to incorrectly adjust health checks or backend configuration instead of inspecting application logs.

How to eliminate wrong answers

Option A is wrong because increasing instance size addresses resource constraints (CPU/memory) but does not target the root cause of intermittent errors when health checks pass; it is a reactive scaling action, not a diagnostic step. Option C is wrong because enabling HTTP health checks (if not already enabled) would only change the health check protocol from TCP to HTTP, but the instances are already passing health checks, so the issue is not with health check configuration. Option D is wrong because checking the load balancer's backend service configuration for session affinity is premature; session affinity (sticky sessions) could cause uneven load distribution but would not explain intermittent errors if health checks are passing—this is a configuration review step, not the first troubleshooting action.

94
MCQmedium

A company uses Cloud Logging to monitor their application logs. They notice that some logs from their Compute Engine instances are missing. The instances have the required logging permission. What is the most likely cause?

A.The log sink is not configured correctly.
B.The logging agent is not configured to send logs to Cloud Logging.
C.The instances are using a custom image without the logging agent.
D.The log bucket is in a different project.
E.The log entries are being filtered by the exclusion filter.
AnswerB

The logging agent must be installed and configured to forward logs.

Why this answer

Compute Engine instances do not automatically send logs to Cloud Logging. They require the Cloud Logging agent (based on fluentd) to be installed and configured to forward logs. Even with correct IAM permissions, without the agent, logs will not be collected.

Option B correctly identifies this missing agent as the most likely cause.

Exam trap

Google Cloud often tests the distinction between log collection (agent) and log routing (sinks) — the trap here is that candidates assume IAM permissions alone are sufficient, overlooking the mandatory agent installation and configuration step.

How to eliminate wrong answers

Option A is wrong because a log sink controls where logs are routed (e.g., to BigQuery or Pub/Sub), not whether logs are collected from instances; missing logs are a collection issue, not a routing issue. Option C is wrong because while a custom image might lack the agent, the question states the instances have the required logging permission, implying the agent could be installed separately; the most likely cause is the agent not being configured, not the image itself. Option D is wrong because log buckets in a different project would still receive logs if the sink is configured correctly; the issue is logs not appearing at all, not appearing in the wrong project.

Option E is wrong because exclusion filters remove logs after they are ingested; if logs are missing entirely, they were never ingested, so exclusion is not the cause.

95
Multi-Selecteasy

A company deploys a critical application on Google Kubernetes Engine (GKE) and wants to ensure high availability during cluster upgrades. Which TWO practices should they follow?

Select 2 answers
A.Use a single-zone node pool with multiple replicas.
B.Use multiple node pools across different zones within the cluster.
C.Configure PodDisruptionBudgets to allow only a small number of pods to be unavailable during upgrades.
D.Enable cluster autoscaling to add nodes during upgrades.
E.Enable regional clusters for multi-zone control plane.
AnswersB, C

Multi-zone node pools allow pods to be rescheduled in other zones during upgrades.

Why this answer

Option B is correct because deploying multiple node pools across different zones ensures that if one zone fails or is taken down for maintenance, the application can continue serving from the other zones. This aligns with GKE's best practice for high availability by distributing workloads across failure domains. Option C is correct because PodDisruptionBudgets (PDBs) define the minimum number of pods that must remain available during voluntary disruptions like cluster upgrades, preventing the upgrade from taking down too many replicas at once.

Exam trap

The trap here is that candidates often confuse control plane high availability (regional clusters) with application-level high availability, or they assume autoscaling can compensate for disruption during upgrades, when in fact PDBs and multi-zone node pools are the correct mechanisms.

96
MCQeasy

A startup runs a web application on App Engine standard environment. They want to ensure the application can handle sudden traffic spikes without manual intervention. Which App Engine feature should they configure?

A.Manual scaling with a fixed number of instances.
B.Basic scaling with automatic instance creation.
C.Resident instances with a minimum number of always-on instances.
D.Custom scaling based on CPU utilization.
E.Automatic scaling with a maximum number of idle instances.
AnswerE

Automatic scaling dynamically creates instances to handle traffic spikes.

Why this answer

Option E is correct because App Engine's automatic scaling with a maximum number of idle instances is designed to handle sudden traffic spikes by dynamically creating and removing instances based on request load. This configuration allows the application to scale up quickly when traffic increases, ensuring responsiveness without manual intervention, while the maximum idle instances setting prevents over-provisioning and controls costs.

Exam trap

The trap here is that candidates often confuse 'basic scaling' with 'automatic scaling' because both involve dynamic instance creation, but basic scaling does not maintain idle instances and is unsuitable for handling sudden traffic spikes without latency.

How to eliminate wrong answers

Option A is wrong because manual scaling with a fixed number of instances requires manual intervention to adjust capacity, which does not handle sudden traffic spikes automatically. Option B is wrong because basic scaling creates instances only when a request is received and shuts them down after processing, leading to cold starts and latency under sudden spikes, and it does not maintain a pool of idle instances for immediate handling. Option C is wrong because resident instances with a minimum number of always-on instances are a feature of manual scaling, not automatic scaling, and they do not dynamically scale up or down in response to traffic spikes.

Option D is wrong because custom scaling based on CPU utilization is not a native App Engine scaling type; App Engine offers automatic, basic, and manual scaling, and custom scaling is not a supported configuration option.

97
MCQhard

Refer to the exhibit. The SLO for the payments-api service is 99.9% availability over 30 days. The current compliance is 99.89% and the error budget is exhausted. Which action should the SRE team take FIRST?

A.Increase the SLO target to 99.99% to reduce future burn rate.
B.Pause all non-critical deployments and investigate the cause of the increased error rate.
C.Trigger a rollback of the latest deployment to stabilize the service.
D.Scale up the service to handle more traffic and reduce error rate.
AnswerB

This aligns with error budget policy: when budget is exhausted, slow down or stop deployments to prevent further errors.

Why this answer

With the error budget exhausted and a high burn rate, the team should immediately stop all non-critical deployments to prevent further degradation and allow the error budget to recover.

98
Multi-Selecthard

A team is designing a disaster recovery (DR) plan for a critical application. Which THREE components are essential for a robust DR plan? (Choose 3)

Select 3 answers
A.Failover procedures and runbooks
B.Regular backups to a separate region
C.A single-region deployment for consistency
D.Monitoring and alerting for disaster events
E.Load testing to validate performance
AnswersA, B, D

Well-documented failover steps ensure quick recovery.

Why this answer

Failover procedures and runbooks (A) are essential because they provide step-by-step instructions for executing a controlled transition to the secondary site, ensuring minimal downtime and consistent recovery actions. Without documented runbooks, teams risk misconfigurations during a disaster, which can extend recovery time objectives (RTO) beyond acceptable limits.

Exam trap

Google Cloud often tests the misconception that a single-region deployment is acceptable for DR if it has high availability within that region, but the exam emphasizes that DR requires geographic separation to survive a full regional failure.

99
MCQeasy

A developer wants to monitor the CPU usage of a single Compute Engine VM and receive alerts when it exceeds 80%. What is the simplest way to achieve this?

A.Query the Compute Engine API periodically and check CPU usage.
B.Configure a Cloud Logging sink to BigQuery and set a scheduled query to detect high CPU.
C.Install the Cloud Monitoring agent and create an alerting policy based on the metric 'cpu.utilization'.
D.Use the managed instance group's autoscaling metric to trigger a notification.
AnswerC

The Monitoring agent collects CPU utilization from the OS and sends it to Cloud Monitoring, where you can set alerts.

Why this answer

Option C is correct because the Cloud Monitoring agent (formerly Stackdriver agent) collects CPU utilization metrics from Compute Engine VMs and sends them to Cloud Monitoring. You can then create an alerting policy directly on the metric 'cpu.utilization' with a threshold of 80% without any custom scripting or additional infrastructure. This is the simplest and most native approach for a single VM.

Exam trap

Google Cloud often tests the misconception that you need to export logs to BigQuery or query APIs manually, when in fact the Cloud Monitoring agent provides a built-in, agent-based metric that can be alerted on directly.

How to eliminate wrong answers

Option A is wrong because periodically querying the Compute Engine API for CPU usage is inefficient, requires custom code, and does not provide real-time alerting; the API does not expose high-frequency CPU metrics natively. Option B is wrong because exporting logs to BigQuery and running scheduled queries adds unnecessary complexity, latency, and cost; Cloud Logging sinks are for log data, not for real-time metric-based alerting. Option D is wrong because managed instance group autoscaling metrics are designed for scaling groups of VMs, not for alerting on a single VM's CPU usage; they do not trigger notifications directly.

← PreviousPage 2 of 2 · 99 questions total

Ready to test yourself?

Try a timed practice session using only Reliability Ops questions.