PCDOE Exam Questions and Answers

A company is setting up a new Google Cloud organization for DevOps. They want to enforce that all projects have a specific set of VPC Service Controls perimeters. Which approach should they use to ensure these perimeters are automatically applied to all new projects?

Configure Cloud Shell to run a script that creates a perimeter when a new project is created.

Define an organization policy with a constraint that requires all projects to be within a perimeter.

Organization policies can enforce constraints like 'vpcServiceControls' across projects.

Use Deployment Manager to deploy a configuration that creates a perimeter for each new project.

Create a VPC Service Controls perimeter and add the organization node as a member.

Why: Option B is correct because Google Cloud Organization Policies allow you to define and enforce constraints at the organization, folder, or project level. The `constraints/compute.restrictVpcServiceControls` constraint can be set to require all new projects to be within a specific VPC Service Controls perimeter, ensuring automatic enforcement without manual intervention.

You are bootstrapping a Google Cloud organization for a DevOps team. You need to set up a shared VPC host project that will be used by multiple service projects. What is the minimal set of roles required for the DevOps team to create and manage service projects in the host project?

Project Creator and Service Project Admin

Compute Network Admin and Service Project Admin

Compute Network Admin manages networks; Service Project Admin attaches service projects.

Compute Shared VPC Admin

Owner and Service Project Admin

Why: Option B is correct because the minimal set of roles required for a DevOps team to create and manage service projects in a shared VPC host project is Compute Network Admin (roles/compute.networkAdmin) on the host project and Service Project Admin (roles/compute.xpnAdmin) at the organization or folder level. Compute Network Admin grants permissions to manage networking resources, while Service Project Admin allows attaching service projects to the host project. Without both, the team cannot configure shared VPC networking or associate service projects.

During the bootstrapping of a Google Cloud organization, the DevOps team wants to implement a policy that prevents the deletion of certain resources, such as Cloud Storage buckets or Cloud SQL instances, unless a specific approval process is followed. Which approach best achieves this goal?

Configure Cloud Source Repositories to require code review for any changes to Terraform configurations that delete resources.

Implement Binary Authorization to require approvals for any delete commands.

Use Resource Manager locks on projects and set up a Cloud Function that triggers on audit logs to require approval before removing the lock.

Locks prevent deletion; Cloud Functions can automate approval workflows.

Use VPC Service Controls to block delete operations on specific services.

Why: Option C is correct because Resource Manager locks prevent accidental deletion of critical resources by placing a deletion prevention lock on the project or resource hierarchy. By combining this with a Cloud Function that monitors audit logs for lock removal attempts and requires an approval workflow before the lock is removed, the team enforces a controlled approval process for any deletion, meeting the policy requirement precisely.

A DevOps team is bootstrapping a new organization. They want to ensure that all projects created within the organization have a specific set of APIs enabled, such as Compute Engine, Cloud Storage, and Cloud Resource Manager. What is the most efficient way to achieve this?

Create a Cloud Function that triggers on project creation events and enables the required APIs.

Define an organization policy with a constraint that requires the APIs to be enabled.

Organization policies can enforce API enablement via constraints.

Use Cloud Foundation Toolkit to deploy a project template that includes API enablement.

Create a shared VPC and enable the APIs in the host project only.

Why: Option B is correct because Organization Policies with constraints (like `constraints/compute.requireOsLogin` or custom constraints using the Resource Manager API) allow you to enforce API enablement across all projects in the organization. This is the most efficient approach as it is declarative, centrally managed, and automatically applies to new projects without any additional infrastructure or manual intervention.

You are bootstrapping a Google Cloud organization. You need to set up a hierarchical structure that allows you to apply policies to groups of projects based on their environment (e.g., development, staging, production). What is the recommended way to organize resources?

Use resource tags to label projects by environment and apply policies via tag-based conditions.

Create folders under the organization for each environment and place projects in the appropriate folder.

Folders allow hierarchical policy inheritance and grouping.

Create separate organizations for each environment.

Use labels on projects to identify environments and then use Cloud Asset Inventory to enforce policies.

Why: Option B is correct because Google Cloud's resource hierarchy (Organization -> Folders -> Projects) is specifically designed to group projects by environment and apply consistent policies (e.g., IAM, organization policies) at the folder level. By creating folders for development, staging, and production, you can enforce environment-specific controls (like VPC Service Controls or resource location restrictions) without duplicating policies per project.

A company is bootstrapping their Google Cloud organization for DevOps. They want to implement a least-privilege model for service accounts used by CI/CD pipelines. The pipelines need to deploy resources in multiple projects. What is the best practice for managing service account keys?

Use a user account for the CI/CD pipeline and assign it the necessary roles.

Store service account keys in Secret Manager and have the pipeline retrieve them at runtime.

Generate a single service account key and securely distribute it to the CI/CD system.

Use workload identity federation to allow the CI/CD system to impersonate a service account without keys.

Eliminates the need for keys and follows least privilege.

Why: Option D is correct because workload identity federation allows an external CI/CD system (e.g., Jenkins, GitHub Actions) to impersonate a Google Cloud service account without managing or storing any long-lived keys. This eliminates the security risk of key leakage and aligns with the least-privilege principle by enabling short-lived, scoped credentials via the Security Token Service (STS) and OAuth 2.0 token exchange.

Want more Bootstrapping a Google Cloud organization for DevOps practice?

All Managing service incidents questions

Domain 2: Managing service incidents

A team uses Google Kubernetes Engine (GKE) with cluster telemetry enabled. During an incident, they notice that a deployment's pods are repeatedly crashing with Exit Code 137. The team wants to investigate the root cause. Which two Google Cloud services should they use together to correlate resource usage and logs?

Cloud Monitoring and Cloud Logging

Monitoring shows resource usage; Logging shows container logs and OOM events.

Security Command Center and Cloud Logging

Cloud Trace and Cloud Monitoring

Trace is for request latency, not resource usage or crash logs.

Cloud Error Reporting and Cloud Logging

Why: Exit Code 137 indicates that a container was killed by SIGKILL (signal 9), typically due to an out-of-memory (OOM) condition. Cloud Monitoring provides metrics such as memory usage and OOM kill counts, while Cloud Logging captures the container's termination logs and system events. By correlating these two services, the team can identify when memory usage spiked and confirm that the pod was OOM-killed, enabling root cause analysis.

A DevOps engineer receives an alert that the error budget for a critical service has been exhausted. The service runs on Compute Engine behind an HTTP(S) load balancer. The team wants to reduce the impact on users while investigating. What should the engineer do first?

Roll back the most recent deployment

Rolling back quickly restores the previous stable version.

Begin a detailed postmortem analysis

Disable the alerting policy to reduce noise

Increase the number of instances in the managed instance group

Why: Rolling back the most recent deployment is the correct first action because it immediately restores the service to a known stable state, stopping further consumption of the error budget. This aligns with the incident management principle of 'mitigate first, investigate later' — reducing user impact takes priority over root cause analysis. The HTTP(S) load balancer will automatically route traffic to the previous healthy version once the rollback is complete.

A company uses Cloud Run for a stateless API service with concurrency set to 80. During a traffic spike, some requests return HTTP 500 errors and latency spikes. Cloud Monitoring shows container CPU utilization at 100% and memory usage at 70%. What is the most likely cause and the best first step?

Concurrency per container is too high; reduce concurrency to 10

Lowering concurrency reduces CPU contention, preventing timeouts and 500s.

Maximum instances limit is too low; increase from 10 to 100

Min idle instances is too low; set min idle to 5 to reduce cold starts

Memory limit is too low; increase memory from 256 MiB to 512 MiB

Why: The correct answer is A because with CPU at 100% and memory at only 70%, the bottleneck is CPU, not memory. Cloud Run containers handle requests concurrently; setting concurrency to 80 means each container processes up to 80 requests simultaneously. When CPU is saturated, requests queue up, causing latency spikes and eventual HTTP 500 errors as the container becomes unresponsive. Reducing concurrency to 10 lowers the per-container request load, allowing each request to complete before CPU saturation occurs.

A team uses Cloud SQL for PostgreSQL. They receive an alert that the database's CPU utilization is above 95% for the past 30 minutes. Queries are taking longer than usual. They want to investigate without causing further impact. What should they do first?

Increase the number of vCPUs of the Cloud SQL instance

Restart the Cloud SQL instance to clear the cache

Migrate the database to Cloud Spanner

Use Cloud SQL Query Insights to find the most time-consuming queries

Query Insights shows top queries by CPU and latency.

Why: Cloud SQL Query Insights is a managed monitoring tool that automatically captures and analyzes query performance metrics, including CPU consumption, latency, and execution plans. In this scenario, it allows the team to identify the specific queries causing high CPU utilization without making any changes to the instance, thus avoiding further impact. This is the first and safest diagnostic step before any remediation.

A company's SRE team is designing an incident management process. They want to ensure that alerts are actionable and that on-call engineers are not overwhelmed by false positives. Which approach should they take?

Use only critical severity alerts and rely on manual dashboard review for lower severity

Create alerting policies for every available metric to ensure nothing is missed

Set all alert thresholds to 50% above the average value to avoid false positives

Define SLOs and set alert thresholds based on historical error budget consumption

SLO-based alerting focuses on user-facing impact and reduces noise.

Why: Option D is correct because defining SLOs and setting alert thresholds based on historical error budget consumption ensures alerts are directly tied to user-facing reliability. This approach prevents false positives by only triggering when the error budget is being consumed faster than expected, making alerts actionable and reducing noise for on-call engineers.

An incident is declared for a production service running on GKE. The on-call engineer suspects a recent code change may have introduced a memory leak. Which THREE actions should the engineer take to investigate and mitigate?

Increase the memory limit for the container as a temporary mitigation

Temporary increase buys time for a permanent fix.

Scale down the number of replicas to reduce memory pressure

Roll back the deployment immediately without further investigation

Check container logs for Out of Memory (OOM) killed messages

OOM messages confirm memory exhaustion.

Compare memory usage metrics before and after the deployment using Cloud Monitoring

Identifies if memory usage increased after the change.

Why: Option A is correct because increasing the memory limit for the container provides a temporary mitigation to prevent the service from being killed by the Out of Memory (OOM) killer while the root cause is investigated. In GKE, the container's memory limit is defined in the pod spec under `resources.limits.memory`, and raising it gives the application more headroom to continue serving requests without immediate termination. This is a standard incident response practice to buy time for deeper analysis, such as reviewing logs and metrics, before applying a permanent fix.

Want more Managing service incidents practice?

All Managing Google Cloud costs questions

Domain 3: Managing Google Cloud costs

A company is using Google Kubernetes Engine (GKE) with multiple node pools. They notice that their monthly costs are higher than expected. Upon review, they find that several preemptible VMs are being recreated frequently, leading to sustained usage costs. What is the most cost-effective solution to reduce costs?

Purchase committed use discounts for the preemptible VMs.

Increase the number of preemptible VMs to spread the workload.

Enable sustained use discounts for the existing VMs.

Migrate to Spot VMs, which have a lower price and no maximum runtime.

Spot VMs are the recommended replacement for preemptible VMs and offer lower costs without the 24-hour limit.

Why: Spot VMs are the recommended replacement for preemptible VMs, offering the same low price but without the 24-hour maximum runtime limit. This eliminates the frequent recreation and sustained usage costs caused by preemptible VMs being terminated and restarted, directly reducing monthly expenses.

A company runs a batch processing workload on Compute Engine that runs for 3 hours every night. They want to minimize costs while ensuring the job completes reliably. Which recommendation should they follow?

Use sole-tenant nodes to isolate the workload.

Use standard (on-demand) VMs and enable sustained use discounts.

Use preemptible VMs and design the job to handle interruptions gracefully.

Preemptible VMs are up to 60% cheaper and suitable for fault-tolerant batch jobs.

Purchase 1-year committed use discounts for the VMs.

Why: Preemptible VMs cost about 60-80% less than standard VMs and are ideal for batch workloads that can tolerate interruptions. Since the job runs for only 3 hours nightly, it can be designed to checkpoint progress and restart from the last checkpoint if a preemptible VM is terminated (which can happen at any time within 24 hours). This minimizes cost while ensuring reliability through graceful interruption handling.

A company uses Cloud Storage to store archival data. They want to minimize storage costs while maintaining availability. Which storage class should they use?

Nearline storage class.

Standard storage class.

Archive storage class.

Archive is the lowest-cost storage class for long-term archival data.

Coldline storage class.

Why: The Archive storage class is the correct choice because it offers the lowest storage cost for archival data that is accessed less than once per year, with a 365-day minimum storage duration and retrieval costs that are higher than other classes. This aligns with the requirement to minimize storage costs while maintaining availability, as Archive data is still available for retrieval (though with a longer latency) and is replicated for durability.

A company is using BigQuery for analytics and wants to control costs. They have many queries that scan large amounts of data. Which approach is most effective in reducing query costs?

Switch to flat-rate pricing to cap costs.

Partition tables by date and use partition pruning in queries.

Partitioning limits the data scanned, reducing query costs.

Reserve BigQuery slots for dedicated capacity.

Use clustering to organize data within partitions.

Why: Partitioning tables by date and using partition pruning in queries directly reduces the amount of data scanned by BigQuery, which is the primary driver of on-demand query costs. By filtering on the partition column, BigQuery can skip entire partitions that do not match the query criteria, minimizing the bytes processed. This is the most effective cost-control measure because it addresses the root cause of high costs—excessive data scanning—without requiring a pricing model change or additional resource commitments.

A company uses Cloud CDN to deliver content globally. They notice increasing egress costs. Which change will most effectively reduce egress costs?

Switch to a premium tier network for lower egress rates.

Enable gzip compression for all responses.

Use Cloud Armor to block malicious traffic.

Configure Cloud CDN to cache more content and increase cache hit ratio.

Higher cache hit ratio reduces the amount of data fetched from the origin, lowering egress costs.

Why: Increasing the cache hit ratio reduces the number of requests that reach the origin server, which directly lowers the volume of data transferred from the origin to the CDN edge. Since egress costs are primarily driven by data served from the CDN edges to users, caching more content at the edge minimizes the need to fetch and serve data from the origin, thereby reducing overall egress traffic and associated costs.

A company is using Google Cloud and wants to monitor and control costs. Which TWO actions should they take? (Choose two.)

Set up budget alerts to notify when spending exceeds thresholds.

Budget alerts help monitor and control costs by providing notifications.

Use labels to categorize resources and track costs by team.

Labels enable cost allocation and reporting, helping control costs.

Disable all unnecessary APIs at the organization level.

Export billing data to BigQuery for detailed analysis.

Grant billing account access to all project owners.

Why: Setting up budget alerts (option A) is a core cost control action because it allows you to define a budget amount and receive notifications when actual or forecasted spending exceeds a specified threshold (e.g., 50%, 90%, 100%). This enables proactive intervention before costs spiral out of control. Using labels (option B) to categorize resources (e.g., by team, project, or environment) allows you to track and allocate costs accurately in billing reports, making it easier to identify cost drivers and enforce accountability.

Want more Managing Google Cloud costs practice?

All Building and implementing CI/CD pipelines questions

Domain 4: Building and implementing CI/CD pipelines

A development team wants to automatically run unit tests and static code analysis on every push to a Cloud Source Repository, but only run integration tests on merges to the main branch. Which Cloud Build trigger configuration should they use?

Use a single trigger with a substitution variable like '_BRANCH' and set it to 'main' for integration tests.

Create one trigger with a build config that uses the 'branchName' substitution to conditionally skip integration test steps.

Create two triggers: one with a branch filter for '^main$' that runs integration tests, and another with a branch filter for '^.*$' that runs unit tests.

Correct: separate triggers with branch filters allow different pipelines per branch.

Configure one trigger with no branch filter and rely on developers to manually trigger integration tests.

Why: Option C is correct because Cloud Build triggers allow you to define separate triggers with branch filters to execute different build configurations based on the branch. By creating one trigger with a branch filter of '^main$' for integration tests and another with '^.*$' for unit tests, you ensure unit tests run on every push to any branch, while integration tests run only on merges to main. This approach directly maps the desired behavior without requiring conditional logic or manual intervention.

A team uses Cloud Build with a Kaniko builder to containerize their application. The build fails with the error: 'failed to push to destination: failed to get credentials: failed to get credential from metadata service: failed to fetch metadata...' What is the most likely cause?

Kaniko requires a running Docker daemon in the build step.

The base image specified in the Dockerfile is not accessible from the build environment.

The Dockerfile has an invalid instruction causing Kaniko to fail.

The Cloud Build service account does not have the storage.objectAdmin role on the Container Registry bucket.

Missing push permissions cause credential failures.

Why: The error indicates that Kaniko cannot authenticate to push the built image to Container Registry. Kaniko uses the Cloud Build service account's credentials to authenticate with the registry. By default, the Cloud Build service account has the storage.objectViewer role on the Container Registry bucket, which allows pulling images but not pushing. To push, the service account needs the storage.objectAdmin or storage.objectCreator role on the bucket. Option D correctly identifies this missing permission as the most likely cause.

A company uses Spinnaker for continuous delivery across multiple GKE clusters. After a recent infrastructure change, the 'Canary' deployment strategy fails during the 'disable' phase of the old version. The error log shows: 'Unable to disable server group: Not authorized to perform compute.instanceGroups.update.' What is the most likely root cause?

The GKE cluster has reached its maximum node quota.

The Cloud Deploy pipeline is missing the required IAM role for the Spinnaker service account.

The Spinnaker service account lacks the compute.instanceGroups.update permission on the project.

Correct: Spinnaker uses this permission to disable old server groups.

The Kayenta canary analysis service is not configured correctly.

Why: The error 'Unable to disable server group: Not authorized to perform compute.instanceGroups.update' directly indicates an IAM permissions issue. In Spinnaker, the service account used to interact with GCP must have the compute.instanceGroups.update permission to manage instance groups during the disable phase of a canary deployment. Option C correctly identifies that the Spinnaker service account lacks this specific permission on the project.

A team uses Cloud Build to deploy a Cloud Run service. The build fails with: 'ERROR: (gcloud.run.services.update) PERMISSION_DENIED: Permission 'run.services.update' denied on resource.' The Cloud Build service account has the Cloud Run Admin role. What is missing?

The build config must use the Cloud Run deployer step instead of the gcloud command.

The Cloud Build service account should have the Owner role on the project.

The Cloud Run service must be deployed in the same region as the build.

The Cloud Build service account needs the 'run.services.update' permission or the Cloud Run Admin role.

The error indicates missing permissions; Cloud Run Admin includes it.

Why: Option D is correct because the error message explicitly states that the 'run.services.update' permission is denied, which means the Cloud Build service account lacks this specific permission. Although the Cloud Run Admin role includes 'run.services.update', the error indicates the role is not properly assigned or the service account is not using it. Reassigning the Cloud Run Admin role or directly granting the 'run.services.update' permission resolves the issue.

An organization uses Cloud Build with a private pool to build container images that require access to on-premises Artifactory. After moving to a new VPC, builds fail with 'Connection refused' when fetching dependencies. What is the best step to troubleshoot?

Verify that VPC Network Peering is established between the Cloud Build private pool's service producer VPC and the customer VPC, and that routes to on-premises are present.

Private pools require peering; missing peering stops traffic.

Verify that the Cloud Build service account has the dns.networks.bindPrivateZone permission.

Check that the Cloud Build service account has the storage.objectViewer role on the Artifactory bucket.

Ensure that Cloud NAT is configured in the private pool's VPC.

Why: The error 'Connection refused' indicates that the Cloud Build private pool's worker VMs cannot reach the on-premises Artifactory server. Private pools are deployed in a Google-managed service producer VPC that must be connected to the customer VPC via VPC Network Peering. Without this peering and the correct routes to the on-premises network (e.g., via Cloud VPN or Dedicated Interconnect), traffic from the private pool is dropped, causing the connection refusal.

A team uses Cloud Build with a cloudbuild.yaml that deploys to multiple environments. They want to ensure that the production deployment step only runs when the build is triggered by a tag matching 'v*.*.*'. Which TWO configurations achieve this? (Choose two.)

In the cloudbuild.yaml, use a 'waitFor' condition that only runs the production step when the substitution variable $TAG_NAME matches 'v*.*.*'.

Conditional step execution based on tag substitution.

Create a Cloud Build trigger with a tag filter '^v[0-9]+\.[0-9]+\.[0-9]+$' and use that trigger for production deployments.

Tag filter restricts trigger to matching tags.

In the cloudbuild.yaml, add a condition that checks if the branch name matches 'v*.*.*'.

Create a separate cloudbuild.yaml for production and use a branch filter '^main$' to trigger it.

Configure a manual approval step in Cloud Build that requires a production manager to approve before running the production deployment.

Why: Option A is correct because Cloud Build supports substitution variables like $TAG_NAME, which can be used in a 'waitFor' condition or as part of a step's entrypoint logic to gate execution. By checking if $TAG_NAME matches the glob pattern 'v*.*.*', the production deployment step will only run when the build is triggered by a matching tag, ensuring environment-specific control within a single cloudbuild.yaml.

Want more Building and implementing CI/CD pipelines practice?

All Implementing service monitoring strategies questions

Domain 5: Implementing service monitoring strategies

A team is monitoring a production service on Google Kubernetes Engine (GKE) and notices that a deployment is occasionally returning HTTP 503 errors. The team has set up a ServiceMonitor in Prometheus to scrape metrics from the pods. What is the most likely cause of the intermittent 503 errors?

The pods are crashing and restarting frequently.

The Prometheus scrape interval is too long, causing missed metrics.

The readiness probes are failing, causing the pods to be removed from the service endpoints.

Readiness probe failures remove pods from service endpoints, causing 503s if all replicas fail.

The container resource limits are set too low, causing out-of-memory errors.

Why: Intermittent HTTP 503 errors in a GKE deployment typically indicate that the service's endpoints are temporarily unavailable. When a readiness probe fails, Kubernetes removes the pod from the Service's endpoints, causing traffic to be routed to remaining healthy pods. If multiple pods fail their readiness probes simultaneously or in quick succession, the Service may have no available endpoints, resulting in 503 errors for incoming requests.

A cloud operations team is implementing monitoring for a microservices application deployed on Compute Engine. They want to create a custom dashboard in Cloud Monitoring that shows the 99th percentile latency of a specific service over the last hour. Which combination of Cloud Monitoring features should they use?

Use a gauge metric with the max alignment function in a Metrics Explorer chart.

Use a distribution metric with the 99th percentile alignment function in a Metrics Explorer chart.

Distribution metrics support percentile alignments like 99th percentile.

Use an uptime check metric and configure the latency percentile in the chart.

Create a logs-based metric from application logs and use the count alignment.

Why: Option B is correct because Cloud Monitoring's distribution metrics inherently store a histogram of values, allowing percentile calculations like the 99th percentile. By selecting the 99th percentile alignment function in a Metrics Explorer chart, the dashboard directly computes and displays the desired latency threshold from the distribution data over the specified time window.

An e-commerce platform is using Cloud Load Balancing with a backend service that has a custom health check. The health check is failing intermittently, causing traffic to be routed away from healthy instances. The team has enabled Cloud Logging and wants to diagnose the issue. Which log view should they examine to see the health check probe results?

VPC flow logs

Cloud Audit Logs (Admin Activity)

Instance serial port output logs

Load balancer logs (type: 'loadbalancing.googleapis.com')

Load balancer logs contain health check probe results.

Why: Load balancer logs (type: 'loadbalancing.googleapis.com') contain detailed records of health check probes, including the probe source IP, target instance, response code, and latency. This is the exact log view that captures health check probe results, enabling the team to identify intermittent failures by correlating probe timestamps with instance health status changes.

A DevOps engineer is setting up alerting policies for a critical API service. They want to receive an alert if the error rate exceeds 5% for at least 5 minutes, but only during business hours (9 AM to 5 PM). Which approach should they use?

Create a log-based metric for errors and use a condition with a threshold, then set the alert policy to only run during business hours using the 'condition' schedule.

Create an alerting policy with a condition that triggers when the error rate is above 5% for 5 minutes, and configure the notification channel to only send notifications during business hours using a webhook receiver that checks time.

This approach uses a custom notification channel to filter by time.

Create two separate alert policies, one for business hours and one for off-hours, each with different thresholds.

Use Cloud Scheduler to enable and disable the alerting policy at the start and end of business hours.

Why: Option B is correct because it uses a single alerting policy with a condition that triggers when the error rate exceeds 5% for 5 minutes, and then controls notification delivery via a webhook receiver that checks the current time. This approach ensures the alert is evaluated continuously (so the 5-minute window is respected) but only notifications are suppressed outside business hours, which is the most reliable way to meet the requirement without missing alert evaluations or relying on external scheduling.

A company is running a stateful workload on Compute Engine and has configured a TCP health check on port 8080. The health check is failing, but the application is running and responding on port 8080 when tested manually from within the instance. What is the most likely cause of the health check failure?

The health check is configured to use port 80 instead of port 8080.

The firewall rules are not allowing traffic from the health check probe IP ranges.

Health check probes use specific IP ranges that must be allowed.

The instance's DNS resolution is failing, causing the health check to use the wrong IP.

The health check response timeout is set too low (e.g., 1 second).

Why: The health check probes originate from Google's health check systems, which use specific IP ranges (e.g., 35.191.0.0/16, 130.211.0.0/22). If firewall rules on the instance or VPC do not explicitly allow inbound traffic from these probe IP ranges on port 8080, the health check will fail even though the application is running and responding to manual tests from within the instance. This is the most common cause of health check failures when the application itself is healthy.

Which TWO of the following are best practices for implementing service monitoring in Google Cloud? (Choose 2)

Set static alert thresholds without considering historical baselines.

Use Cloud Monitoring uptime checks to verify that services are reachable from external locations.

Uptime checks verify external accessibility.

Use the USE method (Utilization, Saturation, Errors) for service-level monitoring.

Define service level indicators (SLIs) using the RED method (Rate, Errors, Duration).

RED metrics are a best practice for service monitoring.

Alert on cause-based metrics (e.g., CPU utilization) rather than symptom-based metrics (e.g., latency).

Why: Option B is correct because Cloud Monitoring uptime checks verify that a service is reachable from external locations by sending HTTP, HTTPS, or TCP requests from Google Cloud's global vantage points. This validates external availability and helps detect regional outages or DNS resolution issues, which is a core best practice for service monitoring.

Want more Implementing service monitoring strategies practice?

All Optimizing service performance questions

Domain 6: Optimizing service performance

Your team has deployed a microservices application on Google Kubernetes Engine (GKE). You notice that one service has high latency during peak hours. The service is CPU-bound and uses a HorizontalPodAutoscaler (HPA) based on CPU utilization. What is the most likely cause of the latency?

The GKE cluster uses preemptible nodes that are frequently reclaimed.

The HPA's target CPU utilization is set too high, causing the autoscaler to react slowly.

A high target CPU threshold delays scaling, leading to latency.

The service uses a global external HTTP(S) load balancer with session affinity.

The application does not implement request autoscaling at the application layer.

Why: Option B is correct because when the HPA's target CPU utilization is set too high, the autoscaler waits until the average CPU utilization exceeds that threshold before scaling up. During peak hours, the service becomes CPU-bound and latency increases as pods are overwhelmed, but the HPA reacts slowly because it only triggers when the high threshold is breached, causing a delay in adding new pods to handle the load.

A Cloud Run service is experiencing increased cold start latency. The service is written in Python and uses several large dependencies. Which action would most effectively reduce cold start latency?

Set concurrency to 1 to ensure each request gets a dedicated container.

Increase the CPU allocation to 4 vCPUs.

Set a minimum number of instances to keep containers warm.

Min instances eliminate cold start by keeping containers ready.

Increase memory to 2 GiB.

Why: Option C is correct because setting a minimum number of instances ensures that the Cloud Run service always has a pool of warm containers ready to serve requests, eliminating the cold start penalty. Cold starts in Python are particularly severe due to the time required to import large dependencies (e.g., NumPy, TensorFlow) and initialize the runtime. By keeping containers alive, you bypass the entire initialization phase, directly addressing the root cause of increased latency.

You are designing a globally distributed application using Cloud Spanner. The application has a write-heavy workload. You notice that write latency increases as the number of nodes increases. What is the most likely cause?

The instance is using a multi-region configuration with too many read-only replicas.

The workload has many cross-node transactions due to split rows.

Cross-split transactions require coordination, increasing latency.

The application is using stale reads for write transactions.

The number of splits is too low, causing hotspots.

Why: Option B is correct because in Cloud Spanner, write-heavy workloads with many cross-node transactions cause increased write latency as nodes are added. This occurs because Spanner splits rows across nodes, and transactions that span multiple splits require two-phase commit (2PC) coordination between nodes, which adds network overhead and latency. Adding more nodes increases the likelihood that a transaction touches multiple splits, exacerbating the coordination cost.

A company runs a stateful workload on Compute Engine VMs with persistent disks. They observe that disk I/O latency spikes periodically. The workload is sensitive to latency. What should they do to improve performance?

Increase the size of the persistent disk.

Migrate to local SSDs for better performance.

Use SSD persistent disks instead of standard persistent disks.

SSD offers lower latency and higher IOPS.

Configure a snapshot schedule to offload I/O.

Why: Option C is correct because SSD persistent disks provide consistent, low-latency I/O performance compared to standard persistent disks, which use spinning media and can exhibit periodic latency spikes under sustained load. For latency-sensitive stateful workloads, SSD persistent disks offer predictable IOPS and throughput, directly addressing the periodic spikes observed.

Your GKE cluster runs a batch job that processes large files from Cloud Storage. The job uses CPUs inefficiently, with low utilization. You want to reduce cost while maintaining throughput. Which approach should you take?

Use Cloud Storage FUSE to stream files directly into containers, avoiding local storage.

Streaming reduces latency and cost by eliminating disk.

Configure the node pool to use spot VMs.

Use local SSDs for faster file access.

Increase the CPU request for the job pods.

Why: Option A is correct because Cloud Storage FUSE allows containers to stream files directly from Cloud Storage without first downloading them to a local disk. This eliminates the I/O bottleneck of writing to local storage and reduces CPU overhead from disk operations, enabling the batch job to process files more efficiently and maintain throughput while using fewer CPU resources.

You are using Cloud CDN with an external HTTPS load balancer. Users in Asia report slow load times for static assets. The origin is in us-central1. What should you do to improve performance?

Switch the load balancer to an internal HTTPS load balancer with gRPC.

Use premium tier networking for the load balancer.

Enable Cloud CDN and configure cache modes for static content.

CDN caches content at edge locations, reducing latency.

Configure a serverless NEG to route traffic to Cloud Functions.

Why: Cloud CDN caches static assets at Google's global edge locations, reducing latency for users in Asia by serving content from a nearby point of presence instead of the us-central1 origin. Enabling Cloud CDN and configuring cache modes for static content (e.g., setting Cache-Control headers or using origin cache policies) ensures that frequently requested assets are served from cache, dramatically improving load times for geographically distant users.

Want more Optimizing service performance practice?

Browse all PCDOE questions Take a timed practice test

Frequently asked questions

How many questions are on the PCDOE exam?

The PCDOE exam has 60 questions and must be completed in 120 minutes. The passing score is 720/1000.

What types of questions appear on the PCDOE exam?

Scenario-based questions covering exam objectives with detailed answer explanations.

How are PCDOE questions organised by domain?

The exam covers 6 domains: Bootstrapping a Google Cloud organization for DevOps, Managing service incidents, Managing Google Cloud costs, Building and implementing CI/CD pipelines, Implementing service monitoring strategies, Optimizing service performance. Questions are weighted by domain — higher-weight domains appear more on your actual exam.

Are these the actual PCDOE exam questions?

No. These are original exam-style practice questions written against the official Google Cloud PCDOE exam objectives. They are not copied from the real exam. Courseiva focuses on genuine understanding, not memorisation of braindumps.

Ready to practice all 60 PCDOE questions?

Courseiva tracks your accuracy per domain and routes you toward weak areas automatically. Free, no account required.

Google Cloud · Free Practice Questions · Last reviewed May 2026

PCDOE Exam Questions and Answers

36real exam-style questions organised by domain, each with the correct answer highlighted and a plain-English explanation of why it's right — and why the others are wrong.

60 exam questions

120 min time limit

Pass: 720/1000 / 1000

6 exam domains

Overview Domain Blueprint Study Guide All QuestionsSample by Domain

Domain 1: Bootstrapping a Google Cloud organization for DevOps

All Bootstrapping a Google Cloud organization for DevOps questions

Configure Cloud Shell to run a script that creates a perimeter when a new project is created.

Define an organization policy with a constraint that requires all projects to be within a perimeter.

Organization policies can enforce constraints like 'vpcServiceControls' across projects.

Use Deployment Manager to deploy a configuration that creates a perimeter for each new project.

Create a VPC Service Controls perimeter and add the organization node as a member.

Project Creator and Service Project Admin

Compute Network Admin and Service Project Admin

Compute Network Admin manages networks; Service Project Admin attaches service projects.

Compute Shared VPC Admin

Owner and Service Project Admin

Configure Cloud Source Repositories to require code review for any changes to Terraform configurations that delete resources.

Implement Binary Authorization to require approvals for any delete commands.

Use Resource Manager locks on projects and set up a Cloud Function that triggers on audit logs to require approval before removing the lock.

Locks prevent deletion; Cloud Functions can automate approval workflows.

Use VPC Service Controls to block delete operations on specific services.

Create a Cloud Function that triggers on project creation events and enables the required APIs.

Define an organization policy with a constraint that requires the APIs to be enabled.

Organization policies can enforce API enablement via constraints.

Use Cloud Foundation Toolkit to deploy a project template that includes API enablement.

Create a shared VPC and enable the APIs in the host project only.

Use resource tags to label projects by environment and apply policies via tag-based conditions.

Create folders under the organization for each environment and place projects in the appropriate folder.

Folders allow hierarchical policy inheritance and grouping.

Create separate organizations for each environment.

Use labels on projects to identify environments and then use Cloud Asset Inventory to enforce policies.

Use a user account for the CI/CD pipeline and assign it the necessary roles.

Store service account keys in Secret Manager and have the pipeline retrieve them at runtime.

Generate a single service account key and securely distribute it to the CI/CD system.

Use workload identity federation to allow the CI/CD system to impersonate a service account without keys.

Eliminates the need for keys and follows least privilege.

Want more Bootstrapping a Google Cloud organization for DevOps practice?

All Managing service incidents questions

Domain 2: Managing service incidents

Cloud Monitoring and Cloud Logging

Monitoring shows resource usage; Logging shows container logs and OOM events.

Security Command Center and Cloud Logging

Cloud Trace and Cloud Monitoring

Trace is for request latency, not resource usage or crash logs.

Cloud Error Reporting and Cloud Logging

Roll back the most recent deployment

Rolling back quickly restores the previous stable version.

Begin a detailed postmortem analysis

Disable the alerting policy to reduce noise

Increase the number of instances in the managed instance group

Concurrency per container is too high; reduce concurrency to 10

Lowering concurrency reduces CPU contention, preventing timeouts and 500s.

Maximum instances limit is too low; increase from 10 to 100

Min idle instances is too low; set min idle to 5 to reduce cold starts

Memory limit is too low; increase memory from 256 MiB to 512 MiB

Increase the number of vCPUs of the Cloud SQL instance

Restart the Cloud SQL instance to clear the cache

Migrate the database to Cloud Spanner

Use Cloud SQL Query Insights to find the most time-consuming queries

Query Insights shows top queries by CPU and latency.

Use only critical severity alerts and rely on manual dashboard review for lower severity

Create alerting policies for every available metric to ensure nothing is missed

Set all alert thresholds to 50% above the average value to avoid false positives

Define SLOs and set alert thresholds based on historical error budget consumption

SLO-based alerting focuses on user-facing impact and reduces noise.

Increase the memory limit for the container as a temporary mitigation

Temporary increase buys time for a permanent fix.

Scale down the number of replicas to reduce memory pressure

Roll back the deployment immediately without further investigation

Check container logs for Out of Memory (OOM) killed messages

OOM messages confirm memory exhaustion.

Compare memory usage metrics before and after the deployment using Cloud Monitoring

Identifies if memory usage increased after the change.

Want more Managing service incidents practice?

All Managing Google Cloud costs questions

Domain 3: Managing Google Cloud costs

Purchase committed use discounts for the preemptible VMs.

Increase the number of preemptible VMs to spread the workload.

Enable sustained use discounts for the existing VMs.

Migrate to Spot VMs, which have a lower price and no maximum runtime.

Spot VMs are the recommended replacement for preemptible VMs and offer lower costs without the 24-hour limit.

Use sole-tenant nodes to isolate the workload.

Use standard (on-demand) VMs and enable sustained use discounts.

Use preemptible VMs and design the job to handle interruptions gracefully.

Preemptible VMs are up to 60% cheaper and suitable for fault-tolerant batch jobs.

Purchase 1-year committed use discounts for the VMs.

A company uses Cloud Storage to store archival data. They want to minimize storage costs while maintaining availability. Which storage class should they use?

Nearline storage class.

Standard storage class.

Archive storage class.

Archive is the lowest-cost storage class for long-term archival data.

Coldline storage class.

A company is using BigQuery for analytics and wants to control costs. They have many queries that scan large amounts of data. Which approach is most effective in reducing query costs?

Switch to flat-rate pricing to cap costs.

Partition tables by date and use partition pruning in queries.

Partitioning limits the data scanned, reducing query costs.

Reserve BigQuery slots for dedicated capacity.

Use clustering to organize data within partitions.

A company uses Cloud CDN to deliver content globally. They notice increasing egress costs. Which change will most effectively reduce egress costs?

Switch to a premium tier network for lower egress rates.

Enable gzip compression for all responses.

Use Cloud Armor to block malicious traffic.

Configure Cloud CDN to cache more content and increase cache hit ratio.

Higher cache hit ratio reduces the amount of data fetched from the origin, lowering egress costs.

A company is using Google Cloud and wants to monitor and control costs. Which TWO actions should they take? (Choose two.)

Set up budget alerts to notify when spending exceeds thresholds.

Budget alerts help monitor and control costs by providing notifications.

Use labels to categorize resources and track costs by team.

Labels enable cost allocation and reporting, helping control costs.

Disable all unnecessary APIs at the organization level.

Export billing data to BigQuery for detailed analysis.

Grant billing account access to all project owners.

Want more Managing Google Cloud costs practice?

All Building and implementing CI/CD pipelines questions

Domain 4: Building and implementing CI/CD pipelines

Use a single trigger with a substitution variable like '_BRANCH' and set it to 'main' for integration tests.

Create one trigger with a build config that uses the 'branchName' substitution to conditionally skip integration test steps.

Create two triggers: one with a branch filter for '^main$' that runs integration tests, and another with a branch filter for '^.*$' that runs unit tests.

Correct: separate triggers with branch filters allow different pipelines per branch.

Configure one trigger with no branch filter and rely on developers to manually trigger integration tests.

Kaniko requires a running Docker daemon in the build step.

The base image specified in the Dockerfile is not accessible from the build environment.

The Dockerfile has an invalid instruction causing Kaniko to fail.

The Cloud Build service account does not have the storage.objectAdmin role on the Container Registry bucket.

Missing push permissions cause credential failures.

The GKE cluster has reached its maximum node quota.

The Cloud Deploy pipeline is missing the required IAM role for the Spinnaker service account.

The Spinnaker service account lacks the compute.instanceGroups.update permission on the project.

Correct: Spinnaker uses this permission to disable old server groups.

The Kayenta canary analysis service is not configured correctly.

The build config must use the Cloud Run deployer step instead of the gcloud command.

The Cloud Build service account should have the Owner role on the project.

The Cloud Run service must be deployed in the same region as the build.

The Cloud Build service account needs the 'run.services.update' permission or the Cloud Run Admin role.

The error indicates missing permissions; Cloud Run Admin includes it.

Verify that VPC Network Peering is established between the Cloud Build private pool's service producer VPC and the customer VPC, and that routes to on-premises are present.

Private pools require peering; missing peering stops traffic.

Verify that the Cloud Build service account has the dns.networks.bindPrivateZone permission.

Check that the Cloud Build service account has the storage.objectViewer role on the Artifactory bucket.

Ensure that Cloud NAT is configured in the private pool's VPC.

In the cloudbuild.yaml, use a 'waitFor' condition that only runs the production step when the substitution variable $TAG_NAME matches 'v*.*.*'.

Conditional step execution based on tag substitution.

Create a Cloud Build trigger with a tag filter '^v[0-9]+\.[0-9]+\.[0-9]+$' and use that trigger for production deployments.

Tag filter restricts trigger to matching tags.

In the cloudbuild.yaml, add a condition that checks if the branch name matches 'v*.*.*'.

Create a separate cloudbuild.yaml for production and use a branch filter '^main$' to trigger it.

Configure a manual approval step in Cloud Build that requires a production manager to approve before running the production deployment.

Want more Building and implementing CI/CD pipelines practice?

All Implementing service monitoring strategies questions

Domain 5: Implementing service monitoring strategies

The pods are crashing and restarting frequently.

The Prometheus scrape interval is too long, causing missed metrics.

The readiness probes are failing, causing the pods to be removed from the service endpoints.

Readiness probe failures remove pods from service endpoints, causing 503s if all replicas fail.

The container resource limits are set too low, causing out-of-memory errors.

Use a gauge metric with the max alignment function in a Metrics Explorer chart.

Use a distribution metric with the 99th percentile alignment function in a Metrics Explorer chart.

Distribution metrics support percentile alignments like 99th percentile.

Use an uptime check metric and configure the latency percentile in the chart.

Create a logs-based metric from application logs and use the count alignment.

VPC flow logs

Cloud Audit Logs (Admin Activity)

Instance serial port output logs

Load balancer logs (type: 'loadbalancing.googleapis.com')

Load balancer logs contain health check probe results.

Create a log-based metric for errors and use a condition with a threshold, then set the alert policy to only run during business hours using the 'condition' schedule.

This approach uses a custom notification channel to filter by time.

Create two separate alert policies, one for business hours and one for off-hours, each with different thresholds.

Use Cloud Scheduler to enable and disable the alerting policy at the start and end of business hours.

The health check is configured to use port 80 instead of port 8080.

The firewall rules are not allowing traffic from the health check probe IP ranges.

Health check probes use specific IP ranges that must be allowed.

The instance's DNS resolution is failing, causing the health check to use the wrong IP.

The health check response timeout is set too low (e.g., 1 second).

Which TWO of the following are best practices for implementing service monitoring in Google Cloud? (Choose 2)

Set static alert thresholds without considering historical baselines.

Use Cloud Monitoring uptime checks to verify that services are reachable from external locations.

Uptime checks verify external accessibility.

Use the USE method (Utilization, Saturation, Errors) for service-level monitoring.

Define service level indicators (SLIs) using the RED method (Rate, Errors, Duration).

RED metrics are a best practice for service monitoring.

Alert on cause-based metrics (e.g., CPU utilization) rather than symptom-based metrics (e.g., latency).

Want more Implementing service monitoring strategies practice?

All Optimizing service performance questions

Domain 6: Optimizing service performance

The GKE cluster uses preemptible nodes that are frequently reclaimed.

The HPA's target CPU utilization is set too high, causing the autoscaler to react slowly.

A high target CPU threshold delays scaling, leading to latency.

The service uses a global external HTTP(S) load balancer with session affinity.

The application does not implement request autoscaling at the application layer.

A Cloud Run service is experiencing increased cold start latency. The service is written in Python and uses several large dependencies. Which action would most effectively reduce cold start latency?

Set concurrency to 1 to ensure each request gets a dedicated container.

Increase the CPU allocation to 4 vCPUs.

Set a minimum number of instances to keep containers warm.

Min instances eliminate cold start by keeping containers ready.

Increase memory to 2 GiB.

The instance is using a multi-region configuration with too many read-only replicas.

The workload has many cross-node transactions due to split rows.

Cross-split transactions require coordination, increasing latency.

The application is using stale reads for write transactions.

The number of splits is too low, causing hotspots.

Increase the size of the persistent disk.

Migrate to local SSDs for better performance.

Use SSD persistent disks instead of standard persistent disks.

SSD offers lower latency and higher IOPS.

Configure a snapshot schedule to offload I/O.

Use Cloud Storage FUSE to stream files directly into containers, avoiding local storage.

Streaming reduces latency and cost by eliminating disk.

Configure the node pool to use spot VMs.

Use local SSDs for faster file access.

Increase the CPU request for the job pods.

You are using Cloud CDN with an external HTTPS load balancer. Users in Asia report slow load times for static assets. The origin is in us-central1. What should you do to improve performance?

Switch the load balancer to an internal HTTPS load balancer with gRPC.

Use premium tier networking for the load balancer.

Enable Cloud CDN and configure cache modes for static content.

CDN caches content at edge locations, reducing latency.

Configure a serverless NEG to route traffic to Cloud Functions.

Want more Optimizing service performance practice?