Google PCA Ensure solution and operations reliability — All Questions With Answers

Question 1mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A company runs a critical application on Compute Engine instances in a managed instance group (MIG) with autoscaling. During a traffic spike, some instances become unhealthy but are not automatically replaced. What is the most likely cause?

Question 2hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A company is designing a disaster recovery plan for a Cloud SQL for PostgreSQL instance. They want to failover to a different region with minimal data loss and recovery time under 10 minutes. The database is 500 GB and experiences 2,000 write transactions per second. Which solution should they use?

Question 3easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company uses Cloud Spanner for a global financial application. They experience increased latency and transaction aborts during peak hours. Which measure should they take first to improve reliability?

Question 4mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A company deploys a microservices application on Google Kubernetes Engine (GKE). Pods in one deployment are frequently OOMKilled. The team sets memory requests and limits, but pods still crash. What is the most likely remaining cause?

Question 5hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

An organization uses Cloud Functions (2nd gen) for event-driven processing. They notice that some functions fail with 'memory limit exceeded' errors during peak load. The function processes messages from Pub/Sub and writes to Firestore. What should they do to improve reliability without sacrificing throughput?

Question 6easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company deploys a stateful workload using StatefulSets on GKE. They want to ensure that if a pod is evicted, its persistent volume claim (PVC) is reattached to the replacement pod in the same zone. Which configuration achieves this?

Question 7mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A company monitors their application with Cloud Monitoring. They set up an alerting policy to notify the on-call team when the 99th percentile latency exceeds 500 ms for 5 minutes. However, they receive false positive alerts due to short bursts. How should they refine the policy?

Question 8mediummulti select

Read the full Ensure solution and operations reliability explanation →

A company runs a web application on Compute Engine behind an HTTP load balancer. They want to improve reliability by implementing failover across two regions. Which TWO actions should they take?

Question 9hardmulti select

Read the full Ensure solution and operations reliability explanation →

A company uses Cloud CDN to accelerate content delivery. They notice that some users receive stale content even after purging the cache. Which THREE factors could cause this?

Question 10easymulti select

Read the full Ensure solution and operations reliability explanation →

A company deploys a critical application on Google Kubernetes Engine (GKE) and wants to ensure high availability during cluster upgrades. Which TWO practices should they follow?

Question 11hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A company runs a multi-tier application on Google Cloud: a frontend on App Engine Standard, a backend on Cloud Run, and a Cloud SQL database. The application experiences intermittent 500 errors when users submit forms. The errors correlate with high CPU usage on the Cloud SQL instance (db-n1-standard-2, 7.5 GB memory). The Cloud Run service has a concurrency setting of 80 and a maximum of 10 instances. The App Engine service uses automatic scaling. The team has verified that the application code is not the issue. They suspect the database is hitting connection limits. Current max_connections on Cloud SQL is 250. The Cloud Run service uses a connection pool of 10 connections per instance. The App Engine service uses a connection pool of 5 connections per instance. They also have a few batch jobs that run occasionally, using up to 10 connections. The team wants to resolve the errors with minimal cost and complexity. Which course of action should they take?

Question 12mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A company runs a web application on Google Kubernetes Engine (GKE) with Cluster Autoscaler enabled. During a traffic spike, the application becomes slow and some requests timeout. The cluster has sufficient CPU and memory headroom. What is the most likely cause and solution?

Question 13hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

An organization is migrating a legacy monolithic application to Google Cloud. The application currently runs on a single server with an on-premises database. The application is stateful and requires low-latency access to the database. The migration must minimize downtime and ensure high availability. Which architecture should the company adopt?

Question 14easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company uses Cloud SQL for MySQL to host its production database. The database experiences high read traffic. The team wants to improve read performance without modifying the application. What should they do?

Question 15hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A company is running a critical application on Compute Engine. The application writes logs to a local persistent disk. The operations team wants to ensure logs are not lost if the VM fails. What should they do?

Question 16mediummulti select

Read the full Ensure solution and operations reliability explanation →

Which TWO options are best practices for ensuring high availability of an application running on Google Kubernetes Engine (GKE)?

Question 17hardmulti select

Read the full Ensure solution and operations reliability explanation →

Which THREE options are valid strategies for disaster recovery (DR) in Google Cloud?

Question 18hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A company runs a batch processing workload on Compute Engine that processes financial transactions. The workload runs daily and must complete within a 4-hour window. The application reads input data from Cloud Storage, processes it, and writes output to another Cloud Storage bucket. The current implementation uses a single VM with a 500 GB persistent disk. Recently, the data volume has increased, and the job is now taking over 6 hours, exceeding the SLA. The team is tasked with redesigning the solution to be faster and more reliable. They want to minimize costs and operational overhead. The data is critical and must not be lost. Which approach should they take?

Question 19mediummulti select

Read the full Ensure solution and operations reliability explanation →

A company has deployed a critical application on Google Kubernetes Engine (GKE) with a Regional cluster (us-central1). The application uses a Cloud SQL for PostgreSQL database with a cross-region replica for disaster recovery. The SRE team needs to ensure that the application can survive a regional outage with minimal data loss. Which TWO actions should the team take to improve the reliability of the solution?

Question 20hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

You are investigating a Vertex AI Workbench instance (instance-2) that is showing UNHEALTHY status. Based on the exhibit, what is the most likely cause of the issue?

Network Topology

Question 21easymultiple choice

Read the full Ensure solution and operations reliability explanation →

Your company runs an e-commerce platform on Google Cloud. The application is deployed on Compute Engine instances in a managed instance group (MIG) with autoscaling based on CPU utilization. The database uses Cloud SQL for MySQL with a single instance. During a recent flash sale, traffic spiked and the application became slow, resulting in a poor user experience. After analyzing the incident, you discovered that the MIG scaled up but the Cloud SQL instance reached its maximum connections limit, causing some requests to fail. You need to recommend a solution to improve the reliability of the application for future traffic spikes. What should you do?

Question 22mediummulti select

Read the full Ensure solution and operations reliability explanation →

Which TWO actions should you take to improve the reliability of a stateful application deployed on Compute Engine with regional persistent disks?

Question 23hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

You are running a Kubernetes cluster in GKE with the default node pool configuration shown in the exhibit. Your application requires high disk I/O performance. You notice that the application is experiencing high latency for disk operations. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
$ gcloud container clusters describe my-cluster --region us-central1
...
nodePools:
- config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS_CONTAINERD
    machineType: n1-standard-2
    oauthScopes:
    - https://www.googleapis.com/auth/devstorage.read_only
  initialNodeCount: 3
  management:
    autoRepair: true
    autoUpgrade: true
  name: default-pool
...
```

Question 24easymultiple choice

Read the full Ensure solution and operations reliability explanation →

Your company runs a critical application on Google Kubernetes Engine (GKE) with 5 nodes. The application experiences intermittent high latency every Friday afternoon. The team has ruled out infrastructure issues and suspects the application logic. You need to instrument the application to identify the root cause. Which approach should you take?

Question 25mediumdrag order

Read the full VPN explanation →

Drag and drop the steps to set up a Cloud VPN tunnel between Google Cloud and an on-premises network into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 26mediummatching

Read the full Ensure solution and operations reliability explanation →

Match each GCP networking concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Virtual Private Cloud for isolated network

Regional IP address range within a VPC

Controls ingress/egress traffic

Dynamically exchange routes using BGP

Connect two VPCs privately

Question 27easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company deploys a web application on Compute Engine behind an HTTP Load Balancer. They want to ensure only healthy instances receive traffic. What should they configure?

Question 28mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

An application uses Cloud Pub/Sub for asynchronous processing. Subscribers occasionally fail to acknowledge messages within the ack deadline, causing redelivery. How to improve reliability and prevent message buildup?

Question 29hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A global application uses Cloud Spanner with a multi-region configuration. During a regional outage, some transactions are failing. What is the recommended approach to maintain write availability?

Question 30easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A startup uses Cloud Functions for event-driven processing. They notice some functions are timing out. How to increase reliability without changing the business logic?

Question 31mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

An organization wants to define an SLO for their API hosted on Cloud Endpoints. Which metric should they use as a Service Level Indicator (SLI) for availability?

Question 32hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A company uses Cloud Armor to protect their HTTP Load Balancer from DDoS attacks. During a traffic spike from a legitimate source, legitimate requests are being blocked. How should they tune the security policy to minimize false positives?

Question 33easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A developer wants to monitor a custom application metric from their application running on GKE. What should they use?

Question 34mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

After a data corruption incident, a company needs to restore their Cloud SQL for PostgreSQL instance from a backup. What is the correct procedure to minimize downtime?

Question 35hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A company runs a stateful application on Compute Engine with persistent disks. They want to ensure data durability across a zone failure. What is the best approach?

Question 36mediummulti select

Read the full Ensure solution and operations reliability explanation →

A company wants to improve the reliability of their microservices architecture on Google Cloud. Which TWO practices should they implement? (Choose 2)

Question 37hardmulti select

Read the full Ensure solution and operations reliability explanation →

A team is designing a disaster recovery (DR) plan for a critical application. Which THREE components are essential for a robust DR plan? (Choose 3)

Question 38easymulti select

Read the full Ensure solution and operations reliability explanation →

A company wants to monitor the health of their Cloud Run services. Which THREE metrics should they use to define a comprehensive health SLI? (Choose 3)

Question 39mediummultiple choice

Review the full routing breakdown →

A company is deploying a critical application on Compute Engine with an HTTP load balancer. They want to ensure that if an instance health check fails, traffic is automatically rerouted to healthy instances. Which configuration should they implement?

Question 40hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A financial services company runs a stateful backend service on Google Kubernetes Engine (GKE) using StatefulSets with Persistent Volumes. They observe that after a node failure, the pod is rescheduled on a different node but the Persistent Volume cannot be attached because it is still "released" and not "available". What is the most likely cause and solution?

Question 41easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A startup runs a web application on App Engine standard environment. They want to ensure the application can handle sudden traffic spikes without manual intervention. Which App Engine feature should they configure?

Question 42mediummultiple choice

Read the full NAT/PAT explanation →

A company uses Cloud Storage for backups of on-premises databases. They want to ensure that data is protected against accidental deletion or modification by users. Which combination of features should they enable?

Question 43mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A company uses Cloud Logging to monitor their application logs. They notice that some logs from their Compute Engine instances are missing. The instances have the required logging permission. What is the most likely cause?

Question 44hardmultiple choice

Read the full NAT/PAT explanation →

A company uses Cloud NAT to allow private instances to access the internet. They notice intermittent connectivity issues. What should they check first?

Question 45easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company needs to deploy a stateless web application that can handle variable traffic. Which compute option is the most cost-effective and scales automatically?

Question 46mediummultiple choice

Review the full routing breakdown →

A company uses Cloud Interconnect to connect on-premises network to GCP. They want to ensure that if one interconnect link fails, traffic is automatically rerouted to another link. Which configuration should they implement?

Question 47easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company runs a batch process every night that loads data into BigQuery. They want to ensure that if the job fails, it is retried automatically up to 3 times. Which configuration should they use?

Question 48mediummulti select

Read the full Ensure solution and operations reliability explanation →

A company runs a critical application on a Compute Engine instance. They want to ensure that the application remains available even if the instance crashes. Which two GCP features should they use? (Choose two.)

Question 49hardmulti select

Read the full Ensure solution and operations reliability explanation →

An organization deploys a microservices application on Google Kubernetes Engine (GKE) with multiple Deployments. They want to ensure that the application remains available during a cluster-wide upgrade. Which three best practices should they follow? (Choose three.)

Question 50easymulti select

Read the full Ensure solution and operations reliability explanation →

A company uses Cloud Storage to store user-uploaded content. They want to ensure that the data is highly durable and protected against accidental deletion. Which two features should they enable? (Choose two.)

Question 51mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A developer ran the above command to create a health check for a backend service. Which of the following should they do to resolve the error?

Network Topology

Question 52mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

After deploying the above configuration, the application is not receiving traffic from the Kubernetes Service. The Service is correctly configured to target port 8080. What is the most likely issue?

Exhibit

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-container
        image: gcr.io/my-project/my-app:latest
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20

Question 53mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A company uses the above IAM policy on a Cloud Storage bucket. They find that Bob can view objects in the bucket. Which statement explains this?

Exhibit

{
  "bindings": [
    {
      "role": "roles/storage.objectViewer",
      "members": [
        "user:alice@example.com",
        "group:viewers@example.com"
      ]
    },
    {
      "role": "roles/storage.objectCreator",
      "members": [
        "user:bob@example.com"
      ]
    }
  ],
  "etag": "BwXQ=="
}

Question 54easymultiple choice

Read the full Ensure solution and operations reliability explanation →

Your company runs a stateless web application on Compute Engine. You want to ensure that if a zone fails, the application continues to serve traffic with minimal manual intervention. What should you do?

Question 55easymultiple choice

Read the full Ensure solution and operations reliability explanation →

You are using Cloud SQL for PostgreSQL. You want to ensure that data can be recovered to any point within the last 7 days. What should you enable?

Question 56easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A developer wants to monitor the CPU usage of a single Compute Engine VM and receive alerts when it exceeds 80%. What is the simplest way to achieve this?

Question 57mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

Your company's global e-commerce platform uses a managed instance group (MIG) in us-central1 and a Cloud Load Balancer. Traffic has grown, and you want to improve availability by distributing load across multiple regions. What should you do?

Question 58mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

Your organization uses Cloud Spanner for a customer database with a 99.999% availability SLA. You need a Disaster Recovery plan that ensures data consistency with zero RPO in case of a region failure. What should you do?

Question 59mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

Your team manages a service with a 99.9% uptime SLO over a 30-day window. The error budget for this period is 43 minutes. In the first week, outages consumed 30 minutes of the budget. You are planning a new release. What should you do?

Question 60hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

You are designing a Dataflow streaming pipeline for real-time event processing. The pipeline must be cost-effective while tolerating worker failures without data loss. Which configuration should you use?

Question 61hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

Your company runs a critical multi-tier application: a global HTTP(S) load balancer, multiple regional managed instance groups (MIGs) for the web tier, and Cloud Spanner for the data tier. You need to design for zone-level and region-level failures. What architecture ensures the highest availability?

Question 62hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

You are responsible for incident management for a production service. You want to reduce manual toil during the initial response to common issues like high latency. What is the best approach?

Question 63easymulti select

Read the full Ensure solution and operations reliability explanation →

You are deploying a stateless web application on Compute Engine. Which TWO actions improve availability? (Choose 2)

Question 64mediummulti select

Read the full Ensure solution and operations reliability explanation →

Your organization is implementing a Disaster Recovery plan for a critical database. Which THREE components are essential for a robust DR strategy? (Choose 3)

Question 65hardmulti select

Read the full Ensure solution and operations reliability explanation →

Your service has a 99.99% uptime SLO (monthly error budget ~ 4 minutes). Which TWO monitoring practices best support this SLO? (Choose 2)

Question 66mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

The exhibit shows the output of a 'gcloud compute instances describe' command for an instance. What is the most likely impact on reliability if the host machine needs maintenance?

Exhibit

Refer to the exhibit.

gcloud compute instances describe example-instance --format=json

{
  ...
  "scheduling": {
    "onHostMaintenance": "TERMINATE",
    "automaticRestart": true
  },
  "status": "RUNNING"
}

Question 67mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

The exhibit shows a Cloud Storage bucket configuration. What does this configuration ensure?

Exhibit

Refer to the exhibit.

{
  "kind": "storage#bucket",
  "id": "my-important-bucket",
  "name": "my-important-bucket",
  "retentionPolicy": {
    "retentionPeriod": "2592000",
    "effectiveTime": "2024-01-01T00:00:00Z",
    "isLocked": true
  },
  "versioning": {
    "enabled": true
  }
}

Question 68hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

The exhibit shows a managed instance group configuration. What is the primary purpose of the 'autoHealingPolicies' section?

Exhibit

Refer to the exhibit.

# Sample managed instance group configuration (YAML)
resource:
  type: compute.beta.instanceGroupManager
  properties:
    zone: us-central1-a
    targetSize: 3
    baseInstanceName: my-app
    instanceTemplate: global/instanceTemplates/my-template
    autoHealingPolicies:
    - healthCheck: global/healthChecks/http-health-check
      initialDelaySec: 300
    autoScaler:
      minNumReplicas: 3
      maxNumReplicas: 10
      coolDownPeriodSec: 60
      cpuUtilization:
        utilizationTarget: 0.6

Question 69easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company runs a global e-commerce site on GKE. They want to ensure disaster recovery with multi-region deployment. What is the best practice for configuring GKE clusters?

Question 70easymultiple choice

Read the full Ensure solution and operations reliability explanation →

An application running on Compute Engine instances behind a load balancer experiences intermittent failures. Health checks show instances passing, but some users get errors. What should be the first troubleshooting step?

Question 71easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company uses Cloud SQL for PostgreSQL. They want to minimize downtime during maintenance. Which feature should they enable?

Question 72mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A company has a microservices architecture on GKE. One service is failing due to resource exhaustion. How can they proactively prevent this?

Question 73easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company wants to monitor their Cloud Run services for errors and latency. Which Google Cloud product should they use?

Question 74easymultiple choice

Read the full Ensure solution and operations reliability explanation →

An organization needs to meet a RTO of 1 hour for a critical application running on GCE with persistent disks. What is the most cost-effective approach?

Question 75hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A company has a Spanner instance for global transactions. They need to ensure reliability during a regional outage. What is the best approach?

Question 76mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A team is using Cloud Functions and wants to ensure retries on failure. What is the best practice?

Question 77easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company uses Cloud Storage for backup data. They want to protect against accidental deletion. Which option is best?

Question 78mediummulti select

Read the full Ensure solution and operations reliability explanation →

A company is designing a highly available application on GCE. Which TWO steps should they take to ensure reliability?

Question 79hardmulti select

Read the full Ensure solution and operations reliability explanation →

A company runs a stateful application on GKE using StatefulSets. Which THREE practices improve reliability?

Question 80mediummulti select

Read the full Ensure solution and operations reliability explanation →

A company is migrating a critical database to Cloud SQL for MySQL. Which TWO actions ensure high availability?

Question 81hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A user reports that an application running on instance-1 is unreliable and often restarts. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
$ gcloud compute instances list --limit=5
NAME         ZONE        MACHINE_TYPE   PREEMPTIBLE
instance-1   us-east1-b  n1-standard-1  true
instance-2   us-east1-c  n1-standard-1  false
instance-3   us-central1-a e2-medium    false
instance-4   us-central1-b  n1-standard-1 false
instance-5   us-west1-a   n1-highcpu-4  false
```

Question 82mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

Company A runs a containerized application on Google Kubernetes Engine (GKE) with 3 node pools: one for frontend, one for backend, and one for stateful databases. The backend services experience periodic latency spikes. After investigation, they found that the spikes correlate with the node pool autoscaler scaling down nodes. The backend services are deployed as Deployments with resource requests and limits set to 100m CPU and 200Mi memory each. The node pool uses n1-standard-2 machine types. The cluster autoscaler is enabled. What should they do to prevent the latency spikes?

Question 83hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

Company B uses Cloud Endpoints to expose their API. Recently, they started seeing 503 errors during periods of high traffic. They have enabled Cloud Endpoints with a moderate quota. The backend is running on Cloud Run. The Cloud Run service is configured with min instances = 0 and max instances = 100. The container concurrency is set to 80. The average request latency is 200ms. What is the most likely cause and what should they do?

Question 84easymultiple choice

Read the full Ensure solution and operations reliability explanation →

A company runs a critical application on Compute Engine instances in a managed instance group (MIG) with autoscaling. Users report intermittent 503 errors during traffic spikes. Which action should the company take to improve reliability?

Question 85mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

A company uses Cloud Spanner for a global financial application. They need to ensure that a regional outage does not cause data loss. The application requires strong consistency and low latency reads and writes across multiple regions. Which configuration meets the reliability requirements?

Question 86hardmulti select

Read the full Ensure solution and operations reliability explanation →

A company runs a microservices-based application on Google Kubernetes Engine (GKE) with a Regional cluster. They want to improve reliability by implementing best practices for pod scheduling and resilience. Which TWO actions should they take? (Choose two.)

Question 87mediummulti select

Read the full Ensure solution and operations reliability explanation →

A company runs a stateful workload on Compute Engine with regional persistent disks (PD). They need to implement a disaster recovery (DR) plan with a Recovery Point Objective (RPO) of less than 1 hour and Recovery Time Objective (RTO) of less than 4 hours. Which THREE steps should they include in their DR plan? (Choose three.)

Question 88easymultiple choice

Read the full Ensure solution and operations reliability explanation →

You are the lead cloud architect for a startup that runs a web application on Google Kubernetes Engine (GKE) with a standard (zonal) cluster. The application is deployed with 3 replicas of a stateless frontend service. During a recent incident, a zone outage caused all GKE nodes to become unavailable, leading to application downtime of 45 minutes. You need to redesign the cluster to tolerate a single zone failure with no more than 5 minutes of downtime. Your budget allows for at most a 20% increase in compute costs. Which approach should you take?

Question 89easymultiple choice

Read the full Ensure solution and operations reliability explanation →

You manage a batch data processing workload on Compute Engine that runs daily on a single VM. The VM uses a standard persistent disk (pd-standard) for input data and output results. Recently, the VM crashed due to a hardware failure, and the job failed. You need to implement a solution that automatically recovers from VM failures with minimal data loss. The job is idempotent and can restart from the beginning if necessary. Which solution should you choose?

Question 90mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

Your company runs a customer-facing API on Cloud Run with a concurrency setting of 80. The API calls a backend Cloud Function that performs a heavy computation (2–5 seconds). During peak hours, the API experiences increased latency and some requests time out after 60 seconds. Monitoring shows that the Cloud Run max instances is set to 100, and the Cloud Function max instances is set to 10. The timeout for Cloud Run is set to 300 seconds. The Cloud Function's timeout is set to 540 seconds. You need to reduce end-to-end latency and prevent timeouts while minimizing cost. Which action is most effective?

Question 91hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

You are designing a high-availability architecture for a global e-commerce platform that uses Cloud SQL for MySQL as the primary database. The application writes to a single Cloud SQL instance in us-central1 and reads from read replicas in us-central1 and us-west1. During a recent regional outage in us-central1, the primary instance became unavailable, and the application experienced full downtime for 3 hours because the failover to a read replica was not automatic. The application can tolerate up to 10 minutes of data loss but needs to recover within 30 minutes. You need to automate failover to a geographically distant region with minimal manual intervention. The application's connection string must not change. Which solution meets these requirements?

Question 92hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

Your company runs a data pipeline on Google Cloud using Cloud Dataflow for streaming processing from Pub/Sub to BigQuery. The pipeline writes to a BigQuery table partitioned by day. The data is used for real-time dashboards. Recently, a spike in traffic caused the Dataflow pipeline to fall behind, and the dashboard displayed stale data. You need to design the pipeline to handle traffic spikes without data loss or long delays. The pipeline must be cost-efficient and use defaults where possible. Which solution should you implement?

Question 93easymulti select

Read the full Ensure solution and operations reliability explanation →

A company runs a containerized application on Cloud Run. Which TWO actions will most improve the reliability of the service?

Question 94hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

A financial services company is migrating a monolithic Java application to Google Kubernetes Engine (GKE) for improved scalability and reliability. The application serves real-time trading data and has strict latency requirements. Post-migration, the team observes frequent pod restarts due to OutOfMemory (OOM) errors, increased latency during peak trading hours, and occasional database connection timeouts. The current setup uses a single GKE cluster with a node pool of n1-standard-4 machines, a stateless application deployed as a Deployment with resource requests and limits set to 512 Mi memory and 1 CPU. The database is a Cloud SQL PostgreSQL instance with 2 vCPUs and 7.5 GB memory, and applications connect using a hardcoded connection string. The team wants to ensure reliable operation under load and during node maintenance events. Which course of action best addresses the reliability issues?

Question 95hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

Refer to the exhibit. The SLO for the payments-api service is 99.9% availability over 30 days. The current compliance is 99.89% and the error budget is exhausted. Which action should the SRE team take FIRST?

Exhibit

SLO History for last 30 days:
Service: payments-api
SLO: 99.9% availability over 30 days
Compliance: 99.89%
Error budget remaining: 0%
Burn rate: 2.5 over last 1 hour
Alert: Error budget exhausted

Question 96mediummultiple choice

Read the full NAT/PAT explanation →

Refer to the exhibit. The exhibit shows logs and a metric from a GCE instance that was terminated. The instance was part of a managed instance group. Which diagnostic step should be taken FIRST to prevent recurrence?

Exhibit

Stackdriver Logging query results:
resource.type="gce_instance"
logName="projects/my-project/logs/syslog"

Timestamps:
2023-11-01 10:15:23 UTC - Out of memory: Killed process 1234 (java)
2023-11-01 10:15:24 UTC - Out of memory: Killed process 5678 (python)
2023-11-01 10:15:25 UTC - Out of memory: Killed process 9012 (node)

Monitoring metric: instance/disk/bytes_used (gce_instance)
Value: 95% at 10:15:20 UTC

Question 97hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

Refer to the exhibit. A Deployment Manager template deploys a GKE cluster and a job that publishes to Pub/Sub. The job fails with a permission error. Which change would fix the issue?

Exhibit

Deployment Manager manifest:
resources:
- name: my-cluster
  type: container.v1.cluster
  properties:
    zone: us-central1-a
    initialNodeCount: 3
    nodeConfig:
      machineType: n1-standard-1
      oauthScopes:
      - https://www.googleapis.com/auth/pubsub
- name: my-job
  type: container.v1.job
  properties:
    cluster: $(ref.my-cluster.name)
    template:
      spec:
        containers:
        - image: gcr.io/my-project/publisher
          env:
          - name: TOPIC
            value: my-topic
        restartPolicy: Never
    dependsOn: [my-cluster]

After deployment, the job fails with "Permission denied" when publishing to Pub/Sub topic my-topic.

Question 98mediummultiple choice

Read the full Ensure solution and operations reliability explanation →

Refer to the exhibit. The process-image function fails intermittently with a memory limit exceeded error. Which action will MOST effectively resolve the issue?

Exhibit

Error message from Cloud Functions log:
"Function: process-image. Execution ID: abc123. Error: memory limit exceeded. Function invocation was interrupted. Consider increasing the memory allocation."

Function configuration:
- Runtime: Node.js 16
- Memory: 128MB
- Timeout: 60s
- Trigger: Cloud Storage (finalize event on bucket images-bucket)

Metrics:
- Average execution time: 45s
- Max concurrent executions: 10

Question 99hardmultiple choice

Read the full Ensure solution and operations reliability explanation →

Refer to the exhibit. The HPA is configured to scale based on CPU, but it has not scaled up despite the CPU usage being above the target. Which is the most likely cause?

Exhibit

gcloud container clusters describe my-cluster --zone us-central1-a
...
nodePools:
- name: default-pool
  config:
    machineType: n1-standard-4
    diskSizeGb: 100
    imageType: COS_CONTAINERD
    oauthScopes:
    - https://www.googleapis.com/auth/devstorage.read_only
  initialNodeCount: 3
  autoscaling:
    enabled: true
    minNodeCount: 1
    maxNodeCount: 10

Horizontal Pod Autoscaler:
  Name: my-hpa
  Min replicas: 3
  Max replicas: 20
  CPU target: 80%

Current state: The deployment has 5 pods. CPU usage is at 90%. The HPA has not scaled up.

Refer to the exhibit. ``` $ gcloud container clusters describe my-cluster --region us-central1 ... nodePools: - config: diskSizeGb: 100 diskType: pd-standard imageType: COS_CONTAINERD machineType: n1-standard-2 oauthScopes: - https://www.googleapis.com/auth/devstorage.read_only initialNodeCount: 3 management: autoRepair: true autoUpgrade: true name: default-pool ... ```

apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-container image: gcr.io/my-project/my-app:latest ports: - containerPort: 8080 readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: tcpSocket: port: 8080 initialDelaySeconds: 15 periodSeconds: 20

{ "bindings": [ { "role": "roles/storage.objectViewer", "members": [ "user:alice@example.com", "group:viewers@example.com" ] }, { "role": "roles/storage.objectCreator", "members": [ "user:bob@example.com" ] } ], "etag": "BwXQ==" }

Refer to the exhibit. gcloud compute instances describe example-instance --format=json { ... "scheduling": { "onHostMaintenance": "TERMINATE", "automaticRestart": true }, "status": "RUNNING" }

Refer to the exhibit. { "kind": "storage#bucket", "id": "my-important-bucket", "name": "my-important-bucket", "retentionPolicy": { "retentionPeriod": "2592000", "effectiveTime": "2024-01-01T00:00:00Z", "isLocked": true }, "versioning": { "enabled": true } }

Refer to the exhibit. # Sample managed instance group configuration (YAML) resource: type: compute.beta.instanceGroupManager properties: zone: us-central1-a targetSize: 3 baseInstanceName: my-app instanceTemplate: global/instanceTemplates/my-template autoHealingPolicies: - healthCheck: global/healthChecks/http-health-check initialDelaySec: 300 autoScaler: minNumReplicas: 3 maxNumReplicas: 10 coolDownPeriodSec: 60 cpuUtilization: utilizationTarget: 0.6

Refer to the exhibit. ``` $ gcloud compute instances list --limit=5 NAME ZONE MACHINE_TYPE PREEMPTIBLE instance-1 us-east1-b n1-standard-1 true instance-2 us-east1-c n1-standard-1 false instance-3 us-central1-a e2-medium false instance-4 us-central1-b n1-standard-1 false instance-5 us-west1-a n1-highcpu-4 false ```

Stackdriver Logging query results: resource.type="gce_instance" logName="projects/my-project/logs/syslog" Timestamps: 2023-11-01 10:15:23 UTC - Out of memory: Killed process 1234 (java) 2023-11-01 10:15:24 UTC - Out of memory: Killed process 5678 (python) 2023-11-01 10:15:25 UTC - Out of memory: Killed process 9012 (node) Monitoring metric: instance/disk/bytes_used (gce_instance) Value: 95% at 10:15:20 UTC

Deployment Manager manifest: resources: - name: my-cluster type: container.v1.cluster properties: zone: us-central1-a initialNodeCount: 3 nodeConfig: machineType: n1-standard-1 oauthScopes: - https://www.googleapis.com/auth/pubsub - name: my-job type: container.v1.job properties: cluster: $(ref.my-cluster.name) template: spec: containers: - image: gcr.io/my-project/publisher env: - name: TOPIC value: my-topic restartPolicy: Never dependsOn: [my-cluster] After deployment, the job fails with "Permission denied" when publishing to Pub/Sub topic my-topic.

Error message from Cloud Functions log: "Function: process-image. Execution ID: abc123. Error: memory limit exceeded. Function invocation was interrupted. Consider increasing the memory allocation." Function configuration: - Runtime: Node.js 16 - Memory: 128MB - Timeout: 60s - Trigger: Cloud Storage (finalize event on bucket images-bucket) Metrics: - Average execution time: 45s - Max concurrent executions: 10

gcloud container clusters describe my-cluster --zone us-central1-a ... nodePools: - name: default-pool config: machineType: n1-standard-4 diskSizeGb: 100 imageType: COS_CONTAINERD oauthScopes: - https://www.googleapis.com/auth/devstorage.read_only initialNodeCount: 3 autoscaling: enabled: true minNodeCount: 1 maxNodeCount: 10 Horizontal Pod Autoscaler: Name: my-hpa Min replicas: 3 Max replicas: 20 CPU target: 80% Current state: The deployment has 5 pods. CPU usage is at 90%. The HPA has not scaled up.