Free PCA Ensure solution and operations reliability Practice Questions (2026)

Q: How many Ensure solution and operations reliability questions are on the PCA exam?

The Ensure solution and operations reliability domain is one of the weighted domains on the PCA exam. The Courseiva question bank has 99 practice questions for this domain.

Q: How can I practice Ensure solution and operations reliability questions for PCA?

Click any of the 99 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Ensure solution and operations reliability domain.

Practice Ensure solution and operations reliability questions

10Q 20Q 30Q 50Q

All PCA Ensure solution and operations reliability questions (99)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A company runs a critical application on Compute Engine instances in a managed instance group (MIG) with autoscaling. During a traffic spike, some instances become unhealthy but are not automatically replaced. What is the most likely cause?

A company is designing a disaster recovery plan for a Cloud SQL for PostgreSQL instance. They want to failover to a different region with minimal data loss and recovery time under 10 minutes. The database is 500 GB and experiences 2,000 write transactions per second. Which solution should they use?

A company uses Cloud Spanner for a global financial application. They experience increased latency and transaction aborts during peak hours. Which measure should they take first to improve reliability?

A company deploys a microservices application on Google Kubernetes Engine (GKE). Pods in one deployment are frequently OOMKilled. The team sets memory requests and limits, but pods still crash. What is the most likely remaining cause?

An organization uses Cloud Functions (2nd gen) for event-driven processing. They notice that some functions fail with 'memory limit exceeded' errors during peak load. The function processes messages from Pub/Sub and writes to Firestore. What should they do to improve reliability without sacrificing throughput?

A company deploys a stateful workload using StatefulSets on GKE. They want to ensure that if a pod is evicted, its persistent volume claim (PVC) is reattached to the replacement pod in the same zone. Which configuration achieves this?

A company monitors their application with Cloud Monitoring. They set up an alerting policy to notify the on-call team when the 99th percentile latency exceeds 500 ms for 5 minutes. However, they receive false positive alerts due to short bursts. How should they refine the policy?

A company runs a web application on Compute Engine behind an HTTP load balancer. They want to improve reliability by implementing failover across two regions. Which TWO actions should they take?

A company uses Cloud CDN to accelerate content delivery. They notice that some users receive stale content even after purging the cache. Which THREE factors could cause this?

A company deploys a critical application on Google Kubernetes Engine (GKE) and wants to ensure high availability during cluster upgrades. Which TWO practices should they follow?

A company runs a multi-tier application on Google Cloud: a frontend on App Engine Standard, a backend on Cloud Run, and a Cloud SQL database. The application experiences intermittent 500 errors when users submit forms. The errors correlate with high CPU usage on the Cloud SQL instance (db-n1-standard-2, 7.5 GB memory). The Cloud Run service has a concurrency setting of 80 and a maximum of 10 instances. The App Engine service uses automatic scaling. The team has verified that the application code is not the issue. They suspect the database is hitting connection limits. Current max_connections on Cloud SQL is 250. The Cloud Run service uses a connection pool of 10 connections per instance. The App Engine service uses a connection pool of 5 connections per instance. They also have a few batch jobs that run occasionally, using up to 10 connections. The team wants to resolve the errors with minimal cost and complexity. Which course of action should they take?

A company runs a web application on Google Kubernetes Engine (GKE) with Cluster Autoscaler enabled. During a traffic spike, the application becomes slow and some requests timeout. The cluster has sufficient CPU and memory headroom. What is the most likely cause and solution?

An organization is migrating a legacy monolithic application to Google Cloud. The application currently runs on a single server with an on-premises database. The application is stateful and requires low-latency access to the database. The migration must minimize downtime and ensure high availability. Which architecture should the company adopt?

A company uses Cloud SQL for MySQL to host its production database. The database experiences high read traffic. The team wants to improve read performance without modifying the application. What should they do?

A company is running a critical application on Compute Engine. The application writes logs to a local persistent disk. The operations team wants to ensure logs are not lost if the VM fails. What should they do?

Which TWO options are best practices for ensuring high availability of an application running on Google Kubernetes Engine (GKE)?

Which THREE options are valid strategies for disaster recovery (DR) in Google Cloud?

A company runs a batch processing workload on Compute Engine that processes financial transactions. The workload runs daily and must complete within a 4-hour window. The application reads input data from Cloud Storage, processes it, and writes output to another Cloud Storage bucket. The current implementation uses a single VM with a 500 GB persistent disk. Recently, the data volume has increased, and the job is now taking over 6 hours, exceeding the SLA. The team is tasked with redesigning the solution to be faster and more reliable. They want to minimize costs and operational overhead. The data is critical and must not be lost. Which approach should they take?

A company has deployed a critical application on Google Kubernetes Engine (GKE) with a Regional cluster (us-central1). The application uses a Cloud SQL for PostgreSQL database with a cross-region replica for disaster recovery. The SRE team needs to ensure that the application can survive a regional outage with minimal data loss. Which TWO actions should the team take to improve the reliability of the solution?

You are investigating a Vertex AI Workbench instance (instance-2) that is showing UNHEALTHY status. Based on the exhibit, what is the most likely cause of the issue?

Your company runs an e-commerce platform on Google Cloud. The application is deployed on Compute Engine instances in a managed instance group (MIG) with autoscaling based on CPU utilization. The database uses Cloud SQL for MySQL with a single instance. During a recent flash sale, traffic spiked and the application became slow, resulting in a poor user experience. After analyzing the incident, you discovered that the MIG scaled up but the Cloud SQL instance reached its maximum connections limit, causing some requests to fail. You need to recommend a solution to improve the reliability of the application for future traffic spikes. What should you do?

Which TWO actions should you take to improve the reliability of a stateful application deployed on Compute Engine with regional persistent disks?

You are running a Kubernetes cluster in GKE with the default node pool configuration shown in the exhibit. Your application requires high disk I/O performance. You notice that the application is experiencing high latency for disk operations. What is the most likely cause?

Your company runs a critical application on Google Kubernetes Engine (GKE) with 5 nodes. The application experiences intermittent high latency every Friday afternoon. The team has ruled out infrastructure issues and suspects the application logic. You need to instrument the application to identify the root cause. Which approach should you take?

Drag and drop the steps to set up a Cloud VPN tunnel between Google Cloud and an on-premises network into the correct order.

Match each GCP networking concept to its definition.

A company deploys a web application on Compute Engine behind an HTTP Load Balancer. They want to ensure only healthy instances receive traffic. What should they configure?

An application uses Cloud Pub/Sub for asynchronous processing. Subscribers occasionally fail to acknowledge messages within the ack deadline, causing redelivery. How to improve reliability and prevent message buildup?

A global application uses Cloud Spanner with a multi-region configuration. During a regional outage, some transactions are failing. What is the recommended approach to maintain write availability?

A startup uses Cloud Functions for event-driven processing. They notice some functions are timing out. How to increase reliability without changing the business logic?

An organization wants to define an SLO for their API hosted on Cloud Endpoints. Which metric should they use as a Service Level Indicator (SLI) for availability?

A company uses Cloud Armor to protect their HTTP Load Balancer from DDoS attacks. During a traffic spike from a legitimate source, legitimate requests are being blocked. How should they tune the security policy to minimize false positives?

A developer wants to monitor a custom application metric from their application running on GKE. What should they use?

After a data corruption incident, a company needs to restore their Cloud SQL for PostgreSQL instance from a backup. What is the correct procedure to minimize downtime?

A company runs a stateful application on Compute Engine with persistent disks. They want to ensure data durability across a zone failure. What is the best approach?

A company wants to improve the reliability of their microservices architecture on Google Cloud. Which TWO practices should they implement? (Choose 2)

A team is designing a disaster recovery (DR) plan for a critical application. Which THREE components are essential for a robust DR plan? (Choose 3)

A company wants to monitor the health of their Cloud Run services. Which THREE metrics should they use to define a comprehensive health SLI? (Choose 3)

A company is deploying a critical application on Compute Engine with an HTTP load balancer. They want to ensure that if an instance health check fails, traffic is automatically rerouted to healthy instances. Which configuration should they implement?

A financial services company runs a stateful backend service on Google Kubernetes Engine (GKE) using StatefulSets with Persistent Volumes. They observe that after a node failure, the pod is rescheduled on a different node but the Persistent Volume cannot be attached because it is still "released" and not "available". What is the most likely cause and solution?

A startup runs a web application on App Engine standard environment. They want to ensure the application can handle sudden traffic spikes without manual intervention. Which App Engine feature should they configure?

A company uses Cloud Storage for backups of on-premises databases. They want to ensure that data is protected against accidental deletion or modification by users. Which combination of features should they enable?

A company uses Cloud Logging to monitor their application logs. They notice that some logs from their Compute Engine instances are missing. The instances have the required logging permission. What is the most likely cause?

A company uses Cloud NAT to allow private instances to access the internet. They notice intermittent connectivity issues. What should they check first?

A company needs to deploy a stateless web application that can handle variable traffic. Which compute option is the most cost-effective and scales automatically?

A company uses Cloud Interconnect to connect on-premises network to GCP. They want to ensure that if one interconnect link fails, traffic is automatically rerouted to another link. Which configuration should they implement?

A company runs a batch process every night that loads data into BigQuery. They want to ensure that if the job fails, it is retried automatically up to 3 times. Which configuration should they use?

A company runs a critical application on a Compute Engine instance. They want to ensure that the application remains available even if the instance crashes. Which two GCP features should they use? (Choose two.)

An organization deploys a microservices application on Google Kubernetes Engine (GKE) with multiple Deployments. They want to ensure that the application remains available during a cluster-wide upgrade. Which three best practices should they follow? (Choose three.)

A company uses Cloud Storage to store user-uploaded content. They want to ensure that the data is highly durable and protected against accidental deletion. Which two features should they enable? (Choose two.)

A developer ran the above command to create a health check for a backend service. Which of the following should they do to resolve the error?

After deploying the above configuration, the application is not receiving traffic from the Kubernetes Service. The Service is correctly configured to target port 8080. What is the most likely issue?

A company uses the above IAM policy on a Cloud Storage bucket. They find that Bob can view objects in the bucket. Which statement explains this?

Your company runs a stateless web application on Compute Engine. You want to ensure that if a zone fails, the application continues to serve traffic with minimal manual intervention. What should you do?

You are using Cloud SQL for PostgreSQL. You want to ensure that data can be recovered to any point within the last 7 days. What should you enable?

A developer wants to monitor the CPU usage of a single Compute Engine VM and receive alerts when it exceeds 80%. What is the simplest way to achieve this?

Your company's global e-commerce platform uses a managed instance group (MIG) in us-central1 and a Cloud Load Balancer. Traffic has grown, and you want to improve availability by distributing load across multiple regions. What should you do?

Your organization uses Cloud Spanner for a customer database with a 99.999% availability SLA. You need a Disaster Recovery plan that ensures data consistency with zero RPO in case of a region failure. What should you do?

Your team manages a service with a 99.9% uptime SLO over a 30-day window. The error budget for this period is 43 minutes. In the first week, outages consumed 30 minutes of the budget. You are planning a new release. What should you do?

You are designing a Dataflow streaming pipeline for real-time event processing. The pipeline must be cost-effective while tolerating worker failures without data loss. Which configuration should you use?

Your company runs a critical multi-tier application: a global HTTP(S) load balancer, multiple regional managed instance groups (MIGs) for the web tier, and Cloud Spanner for the data tier. You need to design for zone-level and region-level failures. What architecture ensures the highest availability?

You are responsible for incident management for a production service. You want to reduce manual toil during the initial response to common issues like high latency. What is the best approach?

You are deploying a stateless web application on Compute Engine. Which TWO actions improve availability? (Choose 2)

Your organization is implementing a Disaster Recovery plan for a critical database. Which THREE components are essential for a robust DR strategy? (Choose 3)

Your service has a 99.99% uptime SLO (monthly error budget ~ 4 minutes). Which TWO monitoring practices best support this SLO? (Choose 2)

The exhibit shows the output of a 'gcloud compute instances describe' command for an instance. What is the most likely impact on reliability if the host machine needs maintenance?

The exhibit shows a Cloud Storage bucket configuration. What does this configuration ensure?

The exhibit shows a managed instance group configuration. What is the primary purpose of the 'autoHealingPolicies' section?

A company runs a global e-commerce site on GKE. They want to ensure disaster recovery with multi-region deployment. What is the best practice for configuring GKE clusters?

An application running on Compute Engine instances behind a load balancer experiences intermittent failures. Health checks show instances passing, but some users get errors. What should be the first troubleshooting step?

A company uses Cloud SQL for PostgreSQL. They want to minimize downtime during maintenance. Which feature should they enable?

A company has a microservices architecture on GKE. One service is failing due to resource exhaustion. How can they proactively prevent this?

A company wants to monitor their Cloud Run services for errors and latency. Which Google Cloud product should they use?

An organization needs to meet a RTO of 1 hour for a critical application running on GCE with persistent disks. What is the most cost-effective approach?

A company has a Spanner instance for global transactions. They need to ensure reliability during a regional outage. What is the best approach?

A team is using Cloud Functions and wants to ensure retries on failure. What is the best practice?

A company uses Cloud Storage for backup data. They want to protect against accidental deletion. Which option is best?

A company is designing a highly available application on GCE. Which TWO steps should they take to ensure reliability?

A company runs a stateful application on GKE using StatefulSets. Which THREE practices improve reliability?

A company is migrating a critical database to Cloud SQL for MySQL. Which TWO actions ensure high availability?

A user reports that an application running on instance-1 is unreliable and often restarts. What is the most likely cause?

Company A runs a containerized application on Google Kubernetes Engine (GKE) with 3 node pools: one for frontend, one for backend, and one for stateful databases. The backend services experience periodic latency spikes. After investigation, they found that the spikes correlate with the node pool autoscaler scaling down nodes. The backend services are deployed as Deployments with resource requests and limits set to 100m CPU and 200Mi memory each. The node pool uses n1-standard-2 machine types. The cluster autoscaler is enabled. What should they do to prevent the latency spikes?

Company B uses Cloud Endpoints to expose their API. Recently, they started seeing 503 errors during periods of high traffic. They have enabled Cloud Endpoints with a moderate quota. The backend is running on Cloud Run. The Cloud Run service is configured with min instances = 0 and max instances = 100. The container concurrency is set to 80. The average request latency is 200ms. What is the most likely cause and what should they do?

A company runs a critical application on Compute Engine instances in a managed instance group (MIG) with autoscaling. Users report intermittent 503 errors during traffic spikes. Which action should the company take to improve reliability?

A company uses Cloud Spanner for a global financial application. They need to ensure that a regional outage does not cause data loss. The application requires strong consistency and low latency reads and writes across multiple regions. Which configuration meets the reliability requirements?

A company runs a microservices-based application on Google Kubernetes Engine (GKE) with a Regional cluster. They want to improve reliability by implementing best practices for pod scheduling and resilience. Which TWO actions should they take? (Choose two.)

A company runs a stateful workload on Compute Engine with regional persistent disks (PD). They need to implement a disaster recovery (DR) plan with a Recovery Point Objective (RPO) of less than 1 hour and Recovery Time Objective (RTO) of less than 4 hours. Which THREE steps should they include in their DR plan? (Choose three.)

You are the lead cloud architect for a startup that runs a web application on Google Kubernetes Engine (GKE) with a standard (zonal) cluster. The application is deployed with 3 replicas of a stateless frontend service. During a recent incident, a zone outage caused all GKE nodes to become unavailable, leading to application downtime of 45 minutes. You need to redesign the cluster to tolerate a single zone failure with no more than 5 minutes of downtime. Your budget allows for at most a 20% increase in compute costs. Which approach should you take?

You manage a batch data processing workload on Compute Engine that runs daily on a single VM. The VM uses a standard persistent disk (pd-standard) for input data and output results. Recently, the VM crashed due to a hardware failure, and the job failed. You need to implement a solution that automatically recovers from VM failures with minimal data loss. The job is idempotent and can restart from the beginning if necessary. Which solution should you choose?

Your company runs a customer-facing API on Cloud Run with a concurrency setting of 80. The API calls a backend Cloud Function that performs a heavy computation (2–5 seconds). During peak hours, the API experiences increased latency and some requests time out after 60 seconds. Monitoring shows that the Cloud Run max instances is set to 100, and the Cloud Function max instances is set to 10. The timeout for Cloud Run is set to 300 seconds. The Cloud Function's timeout is set to 540 seconds. You need to reduce end-to-end latency and prevent timeouts while minimizing cost. Which action is most effective?

You are designing a high-availability architecture for a global e-commerce platform that uses Cloud SQL for MySQL as the primary database. The application writes to a single Cloud SQL instance in us-central1 and reads from read replicas in us-central1 and us-west1. During a recent regional outage in us-central1, the primary instance became unavailable, and the application experienced full downtime for 3 hours because the failover to a read replica was not automatic. The application can tolerate up to 10 minutes of data loss but needs to recover within 30 minutes. You need to automate failover to a geographically distant region with minimal manual intervention. The application's connection string must not change. Which solution meets these requirements?

Your company runs a data pipeline on Google Cloud using Cloud Dataflow for streaming processing from Pub/Sub to BigQuery. The pipeline writes to a BigQuery table partitioned by day. The data is used for real-time dashboards. Recently, a spike in traffic caused the Dataflow pipeline to fall behind, and the dashboard displayed stale data. You need to design the pipeline to handle traffic spikes without data loss or long delays. The pipeline must be cost-efficient and use defaults where possible. Which solution should you implement?

A company runs a containerized application on Cloud Run. Which TWO actions will most improve the reliability of the service?

A financial services company is migrating a monolithic Java application to Google Kubernetes Engine (GKE) for improved scalability and reliability. The application serves real-time trading data and has strict latency requirements. Post-migration, the team observes frequent pod restarts due to OutOfMemory (OOM) errors, increased latency during peak trading hours, and occasional database connection timeouts. The current setup uses a single GKE cluster with a node pool of n1-standard-4 machines, a stateless application deployed as a Deployment with resource requests and limits set to 512 Mi memory and 1 CPU. The database is a Cloud SQL PostgreSQL instance with 2 vCPUs and 7.5 GB memory, and applications connect using a hardcoded connection string. The team wants to ensure reliable operation under load and during node maintenance events. Which course of action best addresses the reliability issues?

Refer to the exhibit. The SLO for the payments-api service is 99.9% availability over 30 days. The current compliance is 99.89% and the error budget is exhausted. Which action should the SRE team take FIRST?

Refer to the exhibit. The exhibit shows logs and a metric from a GCE instance that was terminated. The instance was part of a managed instance group. Which diagnostic step should be taken FIRST to prevent recurrence?

Refer to the exhibit. A Deployment Manager template deploys a GKE cluster and a job that publishes to Pub/Sub. The job fails with a permission error. Which change would fix the issue?

Refer to the exhibit. The process-image function fails intermittently with a memory limit exceeded error. Which action will MOST effectively resolve the issue?

Refer to the exhibit. The HPA is configured to scale based on CPU, but it has not scaled up despite the CPU usage being above the target. Which is the most likely cause?

Practice all 99 Ensure solution and operations reliability questions

Other PCA exam domains

Design and plan a cloud solution architecture Manage and provision cloud infrastructure Design for security and compliance Analyze and optimize technical and business processes Manage implementation of cloud architecture

Frequently asked questions

What does the Ensure solution and operations reliability domain cover on the PCA exam?

The Ensure solution and operations reliability domain covers the key concepts tested in this area of the PCA exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PCA domains — no account required.

How many Ensure solution and operations reliability questions are in the PCA question bank?

The Courseiva PCA question bank contains 99 questions in the Ensure solution and operations reliability domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Ensure solution and operations reliability for PCA?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Ensure solution and operations reliability questions for PCA?

Yes — the session launcher on this page draws questions exclusively from the Ensure solution and operations reliability domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PCA domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included