CCNA Scaling with Google Cloud operations Questions

28 of 103 questions · Page 2/2 · Scaling with Google Cloud operations · Answers revealed

76
MCQeasy

An operations team wants to receive an automated alert when their web application's HTTP error rate exceeds 5% for more than 5 minutes. Which Google Cloud product is used to configure this type of metric-based alert?

A.Cloud Logging, by configuring a log-based metric and email notification
B.Cloud Monitoring, by creating an alerting policy on the HTTP error rate metric with a 5-minute evaluation window and notification channel
C.Cloud Trace, by setting a trace sampling threshold for error requests
D.Security Command Center, by configuring a finding for high error rates
AnswerB

Cloud Monitoring is the correct service. An alerting policy specifies: the metric to watch (HTTP error rate), the threshold (5%), the evaluation window (5 minutes), and the notification channel (email, PagerDuty, Slack, etc.). This is a core Cloud Monitoring capability.

Why this answer

Cloud Monitoring is the correct service because it is purpose-built for creating alerting policies based on metrics like HTTP error rates. You can define a condition that triggers when the error rate exceeds 5% for a specified evaluation window (e.g., 5 minutes) and route the alert through a notification channel (e.g., email, Slack). This directly matches the requirement for a metric-based alert with a time-based threshold.

Exam trap

Google Cloud often tests the misconception that Cloud Logging can directly send alerts, but in reality, Cloud Logging only stores logs and log-based metrics; the alerting policy must always be configured in Cloud Monitoring.

How to eliminate wrong answers

Option A is wrong because Cloud Logging is used for storing and querying log data, not for creating metric-based alerts on HTTP error rates; while log-based metrics can be created, the alert itself must be configured in Cloud Monitoring, and Cloud Logging does not natively support email notification channels for alerts. Option C is wrong because Cloud Trace is a distributed tracing tool for analyzing request latency and performance, not for monitoring error rates or triggering alerts based on percentage thresholds. Option D is wrong because Security Command Center is a security and risk management service that provides findings for vulnerabilities and threats, not for operational metric-based alerting on web application error rates.

77
MCQhard

A company's SRE team sets an SLO of 99.5% monthly availability for a non-critical internal tool. A business stakeholder argues the target should be 99.99%. The SRE team pushes back. Which SRE argument best supports keeping the 99.5% target?

A.Higher SLOs are always more expensive to achieve and the company cannot afford cloud infrastructure that provides 99.99% availability
B.For a non-critical internal tool, 99.99% reliability requires disproportionate engineering investment (redundancy, 24/7 on-call, chaos testing) compared to its business value; 99.5% matches the actual reliability need while preserving engineering capacity for higher-value work
C.Google Cloud cannot provide 99.99% availability for any service, so the SLO must be kept lower
D.The team should set 99.5% now and plan to increase it to 99.99% next quarter when the tool becomes more popular
AnswerB

This is the SRE argument. Reliability is not free — achieving 99.99% requires architectural complexity, 24/7 on-call readiness, and ongoing reliability engineering. For an internal tool, this investment would consume engineering time that could build features users value more. The SLO should match what the business actually needs, not maximize reliability for its own sake.

Why this answer

Option B correctly applies the SRE principle of aligning SLOs with business value. For a non-critical internal tool, the cost of achieving 99.99% availability—including redundant infrastructure, 24/7 on-call rotations, and chaos engineering—far exceeds the marginal benefit over 99.5%. This preserves engineering capacity for higher-value work, which is a core tenet of Google's SRE approach to error budgets and cost-benefit analysis.

Exam trap

Google Cloud often tests the misconception that higher SLOs are always better or that cloud providers universally guarantee high availability, when the correct SRE approach is to set SLOs based on the actual user experience and business impact, not arbitrary targets.

How to eliminate wrong answers

Option A is wrong because it incorrectly assumes higher SLOs are always more expensive; the real issue is disproportionate cost relative to business value, not absolute affordability. Option C is wrong because Google Cloud does offer services with 99.99% availability (e.g., Cloud Spanner multi-region configurations), so the statement is factually incorrect. Option D is wrong because it suggests a planned future increase without justification; SLOs should be set based on current reliability needs and error budget policy, not arbitrary future popularity.

78
Multi-Selectmedium

Which TWO statements correctly describe Cloud Run scaling behavior?

Select 2 answers
A.The maximum number of instances can be set to 'default' which is unlimited.
B.You can set a minimum number of instances to ensure zero cold starts.
C.You can define a target concurrency to control how many requests each container instance handles.
D.The number of container instances can be scaled to zero when there is no traffic.
E.Autoscaling uses CPU and memory utilization to make decisions.
AnswersC, D

Container concurrency setting controls the maximum number of concurrent requests per instance.

Why this answer

Option C is correct because Cloud Run allows you to set a target concurrency (the number of simultaneous requests a single container instance can handle). This is a key scaling parameter that controls how many requests are routed to each instance before Cloud Run spins up additional instances. By default, concurrency is set to 80, but you can adjust it up to 1000 or set it to 1 for sequential processing.

Exam trap

Google Cloud often tests the misconception that Cloud Run uses CPU or memory utilization for autoscaling, when in fact it uses request concurrency as the primary metric, and candidates may incorrectly select Option E because they associate autoscaling with resource metrics from other services.

79
MCQhard

A company is evaluating whether to adopt a multi-cloud strategy (using two or more cloud providers for different workloads). An engineer lists the following arguments: (1) resilience against a single cloud provider outage, (2) negotiating leverage on pricing, (3) using best-of-breed services from each provider. A cloud architect cautions that multi-cloud also introduces significant challenges. What is the most significant operational challenge of a multi-cloud approach?

A.Multi-cloud requires purchasing separate hardware for each cloud provider's environment
B.Significantly increased operational complexity: teams need expertise in multiple providers' tools, security models, and APIs, while governance, monitoring, and cost management must span inconsistent environments
C.Cloud providers refuse to allow customers to use competing providers simultaneously
D.Multi-cloud makes it impossible to use any managed services because applications must be portable across providers
AnswerB

This is the primary challenge. Every cloud provider has different services, CLIs, IAM systems, networking models, pricing, and monitoring tools. Maintaining expertise and governance across multiple providers dramatically increases the operational burden and requires larger, more specialized teams. The benefits must be weighed against this real cost.

Why this answer

Option B is correct because multi-cloud environments inherently increase operational complexity. Teams must master distinct APIs, security models (e.g., IAM policies differ between AWS and GCP), monitoring tools (e.g., CloudWatch vs. Cloud Monitoring), and cost management consoles.

Governance and compliance must be enforced consistently across heterogeneous platforms, which often requires custom tooling or third-party solutions, making day-to-day operations significantly more challenging than a single-cloud approach.

Exam trap

The trap here is that candidates may underestimate operational complexity and instead focus on perceived hardware or vendor lock-in issues, but the GCDL exam emphasizes that managing multiple distinct cloud environments is the primary operational challenge.

How to eliminate wrong answers

Option A is wrong because multi-cloud does not require purchasing separate hardware; cloud providers abstract the underlying infrastructure, and customers interact via APIs and virtualized resources. Option C is wrong because cloud providers do not prohibit customers from using competing providers; multi-cloud is a common and supported architecture. Option D is wrong because multi-cloud does not make managed services impossible; applications can use provider-specific managed services (e.g., GCP Cloud SQL, AWS RDS) while abstracting portability via containers or service meshes, though portability is not a strict requirement.

80
MCQeasy

A cloud architect is reviewing logs from a production incident. She wants to search all log entries across multiple Google Cloud projects for error messages containing a specific string. Which Google Cloud product enables centralized log searching and analysis across an entire organization?

A.Cloud Monitoring, which provides metric dashboards and alerting
B.Cloud Logging, which centralizes logs from all Google Cloud services and projects and supports powerful filtering and search queries across an organization
C.BigQuery, by exporting logs to a dataset and running SQL queries to find matching error entries
D.Cloud Trace, which provides distributed request tracing for latency analysis
AnswerB

Cloud Logging is the correct answer. It aggregates logs from all sources (Compute Engine, Cloud Run, GKE, App Engine, etc.) across all projects into a centralized store. Its query language allows searching for specific text strings, error levels, time ranges, and resource attributes across the entire organization.

Why this answer

Cloud Logging (formerly Stackdriver Logging) is the Google Cloud service designed to ingest, store, and analyze log data from all Google Cloud services and projects. It supports centralized log aggregation across an entire organization via aggregated sinks and the Logs Explorer, enabling powerful filtering and search queries (e.g., using the `textPayload` or `jsonPayload` fields) to find specific error strings across multiple projects without needing to export data elsewhere.

Exam trap

Google Cloud often tests the distinction between native log search (Cloud Logging) and log export/analysis (BigQuery), tempting candidates to choose BigQuery because they know SQL, but the question specifically asks for a product that enables centralized searching without requiring an export step.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring focuses on metrics, dashboards, and alerting based on time-series data, not on searching raw log entries for specific error strings. Option C is wrong because while BigQuery can query exported logs via SQL, it is not a native centralized log search tool; it requires an additional export step and does not provide real-time log searching across the organization without manual setup. Option D is wrong because Cloud Trace is designed for distributed request tracing and latency analysis, not for searching log entries for error messages.

81
MCQeasy

A company's web service has a Service Level Objective (SLO) of 99.9% monthly availability. In a 30-day month, how many minutes of downtime are allowed before the SLO is violated?

A.~4.3 minutes
B.~43.2 minutes
C.~7.2 hours
D.~8.6 hours
AnswerB

99.9% availability = 0.1% downtime. In a 30-day month (43,200 minutes), 0.1% = 43.2 minutes of allowed downtime — the classic 'three nines' error budget.

Why this answer

The SLO of 99.9% monthly availability means the service can be unavailable for 0.1% of the total monthly time. In a 30-day month, total minutes are 30 × 24 × 60 = 43,200 minutes. 0.1% of 43,200 minutes is 43.2 minutes, so option B is correct.

Exam trap

The trap here is that candidates often confuse 99.9% with 99.99% (four nines) and incorrectly calculate ~4.3 minutes, or they mistakenly compute 0.1% of 30 days in hours (0.072 hours) and then misread it as 7.2 hours.

How to eliminate wrong answers

Option A is wrong because ~4.3 minutes corresponds to 99.99% availability (0.01% of 43,200 minutes), not 99.9%. Option C is wrong because ~7.2 hours (432 minutes) corresponds to 99% availability (1% of 43,200 minutes). Option D is wrong because ~8.6 hours (516 minutes) is not a standard SLO calculation; it might arise from miscomputing 0.1% of 30 days in hours (0.1% of 720 hours = 0.72 hours, not 8.6 hours).

82
MCQmedium

A company's application experiences traffic spikes every weekday morning when employees log in at 9 AM. The team wants their infrastructure to automatically handle these spikes without manual intervention and without over-provisioning resources all day. Which Google Cloud capability addresses this?

A.Purchase reserved capacity for peak load and configure it to be active only on weekdays.
B.Configure autoscaling on the application's infrastructure to automatically scale up for load and scale down during off-peak hours.
C.Deploy additional VMs manually each weekday morning and terminate them at night.
D.Use Cloud Monitoring to send an email alert when CPU exceeds 80% so the team can manually scale.
AnswerB

Autoscaling monitors metrics (CPU, requests, custom) and automatically adds instances during the morning spike. Scheduled autoscaling can proactively scale before 9 AM. Resources scale down when load decreases.

Why this answer

Option B is correct because Google Cloud's managed instance groups (MIGs) with autoscaling can automatically adjust the number of VM instances based on load metrics (e.g., CPU utilization, requests per second). This handles the 9 AM traffic spike without manual intervention and avoids over-provisioning during off-peak hours by scaling down when demand decreases.

Exam trap

The trap here is that candidates confuse 'reserved capacity' (a billing commitment) with 'autoscaling' (an operational scaling mechanism), or they think manual or alert-based actions satisfy the 'automatic' requirement, but Cisco specifically tests the distinction between automated scaling policies and manual or notification-driven processes.

How to eliminate wrong answers

Option A is wrong because reserved capacity (committed use discounts) is a pricing model for consistent, long-term usage, not a mechanism to dynamically activate resources only on weekdays; it does not automatically handle spikes. Option C is wrong because manually deploying and terminating VMs each weekday contradicts the requirement for 'automatic' handling without manual intervention. Option D is wrong because Cloud Monitoring alerts require human action to scale, which is not automatic and introduces delay, failing the 'without manual intervention' requirement.

83
MCQhard

Refer to the exhibit. A DevOps engineer notices that the alert fires even when there is only a single 5-second spike of errors that lasts for one minute. What is the most likely cause?

A.The trigger count is set to 1, so a single minute of high rate fires the alert
B.The alignment period is too short (60s)
C.The threshold value (5) is too low
D.The trigger count is set to 2, so two consecutive periods are needed
AnswerA

With trigger count 1, any single period above threshold triggers the alert.

Why this answer

Option C is correct because the trigger count is 1, meaning the alert fires after just one alignment period (1 minute) with a rate above threshold. A single 5-second spike can cause a high rate during that minute, triggering the alert. Option A is incorrect because the alignment period is 60s, which is appropriate.

Option B is incorrect because the threshold value is set, but the question is about the trigger firing on a single spike. Option D is incorrect because the trigger count is 1, not 2.

84
MCQmedium

A DevOps team wants to adopt GitOps practices for managing their Google Cloud infrastructure. Which combination of tools and practices defines a GitOps approach to cloud infrastructure management?

A.Manually applying Terraform changes from engineers' local machines and documenting changes in a shared wiki
B.Storing all infrastructure as code (Terraform or Config Connector) in a Git repository, using pull requests for all changes, and automated CI/CD pipelines that apply changes and detect drift from the declared state
C.Using the Google Cloud Console to make infrastructure changes and exporting the configuration to Git after each change
D.GitOps only applies to application code deployment, not to cloud infrastructure management
AnswerB

This is GitOps. Git repo as truth: ✓. Pull request process for changes: ✓ (provides review, approval, audit trail). Automated reconciliation: ✓ (CI/CD applies changes and detects drift). This pattern makes infrastructure management reproducible, auditable, and collaborative.

Why this answer

Option B is correct because GitOps is defined by using a Git repository as the single source of truth for declarative infrastructure, with pull requests driving changes and automated CI/CD pipelines reconciling the actual state with the declared state. This approach enforces version control, auditability, and drift detection, which are core to managing Google Cloud infrastructure at scale with tools like Terraform or Config Connector.

Exam trap

The trap here is that candidates may confuse GitOps with simply storing code in Git (Option A) or think it only applies to applications (Option D), when in fact GitOps requires automated reconciliation and pull-request-driven workflows for infrastructure as code.

How to eliminate wrong answers

Option A is wrong because manually applying Terraform changes from local machines bypasses version control and automation, violating the GitOps principle of using Git as the single source of truth and eliminating audit trails and drift detection. Option C is wrong because making changes via the Google Cloud Console and exporting to Git afterward is a reactive, post-hoc approach that does not enforce declarative state management or prevent configuration drift, and it lacks the pull-request-based change workflow central to GitOps. Option D is wrong because GitOps is explicitly applicable to cloud infrastructure management, not just application code deployment; tools like Terraform and Config Connector are designed to manage infrastructure declaratively via Git-driven workflows.

85
MCQmedium

An SRE team has a monthly error budget of 43 minutes (99.9% SLO). In the first week of the month, a deployment causes a 50-minute outage. What should the SRE team do for the remainder of the month, and why?

A.Immediately deploy a hotfix to restore features that were rolled back during the outage.
B.Freeze feature deployments for the rest of the month, focus on reliability improvements, and investigate the deployment process that caused the outage.
C.Negotiate with stakeholders to increase the SLO to 99.5% to get more error budget.
D.Continue deploying features normally — the outage was a one-time event and won't happen again.
AnswerB

Budget exhausted = feature freeze. SRE teams use budget exhaustion as a signal to pause new features and focus on root cause analysis and reliability improvements before resuming velocity.

Why this answer

The team has already consumed more than the entire monthly error budget (50 minutes used vs. 43 minutes allowed). To avoid violating the 99.9% SLO for the rest of the month, they must freeze feature deployments and focus on reliability improvements. This is a core SRE practice: when the error budget is exhausted, the team shifts from feature velocity to stability, investigating the root cause and hardening the deployment process.

Exam trap

Cisco often tests the misconception that you can 'negotiate' or 'increase' the SLO to fix an error budget deficit, but increasing the SLO actually tightens the budget, and the correct response is to halt feature deployments until the next budget window.

How to eliminate wrong answers

Option A is wrong because deploying a hotfix to restore rolled-back features would introduce further change risk when the error budget is already negative, potentially causing additional downtime and SLO violations. Option C is wrong because negotiating to increase the SLO to 99.5% (which actually reduces the error budget to ~21.6 minutes per month) would make the situation worse, not better; the team needs more error budget, not less. Option D is wrong because continuing normal deployments ignores the fact that the error budget is exhausted; treating a 50-minute outage as a one-time event is a common fallacy that ignores the statistical reality of SLOs and the need to preserve remaining budget for unforeseen incidents.

86
MCQmedium

A company runs a web application on Compute Engine instances behind a managed instance group with autoscaling based on CPU utilization. After a marketing campaign, traffic spikes and the autoscaler adds instances quickly, but the application becomes slow. What is the most likely cause?

A.Autoscaler uses CPU utilization but the application is memory-bound
B.Instances are in different zones causing inter-zone latency
C.Autoscaling cooldown period is too short
D.Health check interval is too long
AnswerA

If the application is memory-bound, adding instances based on CPU does not help; the bottleneck remains memory.

Why this answer

The autoscaler adds instances based on CPU utilization, but if the application is memory-bound, adding more instances does not alleviate memory pressure. Each new instance still runs the same memory-intensive workload, so CPU may remain low while memory is exhausted, causing slowdowns. The autoscaler fails to address the actual bottleneck, leading to poor performance despite scaling out.

Exam trap

The trap here is that candidates assume CPU utilization is always the correct metric for scaling, but the question tests the understanding that autoscaling only works well when the chosen metric matches the actual bottleneck of the application.

How to eliminate wrong answers

Option B is wrong because managed instance groups with autoscaling can span multiple zones, but inter-zone latency within the same region is negligible (typically <1ms) and would not cause significant slowdowns. Option C is wrong because a cooldown period that is too short would cause the autoscaler to add instances too aggressively, not make the application slow; it might lead to over-provisioning or thrashing, but not directly to performance degradation. Option D is wrong because a health check interval that is too long delays detection of unhealthy instances, but does not cause the application to become slow; it affects availability, not performance under load.

87
MCQeasy

A large online retailer operates a microservices-based e-commerce platform on Google Kubernetes Engine (GKE) across multiple zones. The application consists of several stateless services that handle customer traffic, inventory, and order processing. Recently, the company migrated its relational database to Cloud Spanner to achieve global scalability and strong consistency. After the migration, during peak shopping periods (e.g., Black Friday), the application experiences significant performance degradation. The operations team monitors CPU utilization of the pods and finds it consistently below 60% even under heavy load. However, Cloud Spanner metrics show high query latency and increased number of transactions waiting for lock conflicts. The team suspects that the bottleneck is now the database, not the compute. The application is designed to scale horizontally by adding more pod replicas. The team wants to ensure that scaling decisions are based on the actual performance bottleneck. What should they do?

A.Scale the GKE cluster to use larger node instances.
B.Increase the CPU request limit for the pods to allow higher CPU usage.
C.Reduce the number of pods to decrease Spanner load.
D.Modify the Horizontal Pod Autoscaler (HPA) to scale based on a custom metric that reflects Cloud Spanner query latency.
AnswerD

This aligns scaling with the actual bottleneck, increasing pods when Spanner latency rises.

Why this answer

Option D is correct because the Horizontal Pod Autoscaler (HPA) can be configured to scale based on custom metrics, such as Cloud Spanner query latency. Since the bottleneck is the database, scaling pods based on CPU utilization (which remains low) would not resolve the issue; instead, scaling based on Spanner latency ensures that the application adds replicas only when the database can handle more connections, reducing lock contention and improving overall performance.

Exam trap

Google Cloud often tests the misconception that CPU utilization is always the correct metric for scaling, but in this scenario, the bottleneck is the database, so candidates must recognize that custom metrics (like Spanner latency) are needed to scale the application appropriately.

How to eliminate wrong answers

Option A is wrong because scaling the GKE cluster to use larger node instances increases compute resources, but the bottleneck is the database (Cloud Spanner), not CPU or memory; larger nodes would not reduce Spanner query latency or lock conflicts. Option B is wrong because increasing the CPU request limit for pods does not address the database bottleneck; it would allow pods to consume more CPU, but CPU utilization is already below 60%, so this change would not improve Spanner performance and could waste resources. Option C is wrong because reducing the number of pods would decrease the load on Spanner, but it would also reduce the application's ability to handle customer traffic, potentially causing service degradation; the goal is to scale based on the actual bottleneck, not to arbitrarily reduce capacity.

88
MCQeasy

A company has a stateful application running on Compute Engine. They want to scale horizontally while preserving state. Which configuration should they use?

A.Use Cloud Run with volumes.
B.Unmanaged instance group.
C.Managed instance group with stateful configuration.
D.Managed instance group with autoscaling and no stateful configuration.
AnswerC

Stateful MIGs preserve instance names, disks, and metadata, allowing horizontal scaling while maintaining state.

Why this answer

Option C is correct because a managed instance group (MIG) with stateful configuration preserves instance-specific state (such as disks, hostnames, and metadata) across autohealing and rolling updates. This allows the stateful application to scale horizontally while maintaining its persistent data, as each instance retains its unique state even when the group is resized or instances are recreated.

Exam trap

The trap here is that candidates often assume all managed instance groups automatically preserve state, but without explicit stateful configuration, MIGs treat instances as ephemeral and will delete persistent disks on instance deletion or during rolling updates.

How to eliminate wrong answers

Option A is wrong because Cloud Run is a serverless platform designed for stateless containers; while it supports volumes, they are ephemeral or read-only (e.g., Cloud Storage FUSE or NFS), and Cloud Run does not natively preserve instance-level state across scaling events or container restarts. Option B is wrong because an unmanaged instance group does not provide autohealing, autoscaling, or stateful configuration; it requires manual management and cannot automatically preserve state during horizontal scaling. Option D is wrong because a managed instance group with autoscaling and no stateful configuration treats all instances as stateless; when instances are terminated or recreated, any local state (e.g., data on persistent disks) is lost, making it unsuitable for stateful applications.

89
MCQeasy

A company wants to optimize Cloud Storage costs for a bucket containing 100 TB of access logs. The logs from the last 7 days are frequently analyzed; logs from 8–90 days are occasionally reviewed; logs older than 90 days are archived for compliance but rarely accessed. What is the most cost-effective storage class configuration?

A.Store all 100 TB in Standard storage for consistent access performance.
B.Configure lifecycle rules: Standard (0-7 days) → Nearline (8-90 days) → Archive (90+ days).
C.Delete all logs older than 7 days to minimize storage costs.
D.Store all logs in Archive storage since most are rarely accessed.
AnswerB

Lifecycle Management automatically transitions objects between storage classes as they age. Standard for active logs, Nearline for occasional review, Archive for compliance retention — each class priced for its access pattern.

Why this answer

Option B is correct because it aligns the storage class with the access patterns of the logs: Standard for frequently accessed recent data, Nearline for occasional access, and Archive for rarely accessed compliance data. This minimizes costs by using cheaper storage for older data while maintaining performance for active analysis. Lifecycle rules automate the transition, ensuring no manual intervention is needed.

Exam trap

Google Cloud often tests the misconception that Archive storage is always the cheapest option, ignoring the retrieval costs and latency for frequently accessed data, leading candidates to choose Option D.

How to eliminate wrong answers

Option A is wrong because storing all 100 TB in Standard storage is unnecessarily expensive for logs older than 7 days that are rarely accessed. Option C is wrong because deleting logs older than 7 days violates compliance requirements and loses data that may be needed for audits or occasional review. Option D is wrong because storing all logs in Archive storage would cause high retrieval costs and latency for the frequently accessed last 7 days of logs, making it impractical for active analysis.

90
MCQeasy

A company's cloud environment has grown rapidly and the team is struggling to understand what cloud resources exist across dozens of projects. Which Google Cloud product provides a unified inventory of all cloud assets across an organization's projects and folders?

A.Cloud Billing console, which lists all resources that have incurred charges
B.Cloud Asset Inventory, which provides a searchable, unified inventory of all resources and IAM policies across an organization's projects and folders
C.Google Cloud Console project dashboard, which shows resources within a single project
D.Security Command Center, which lists security vulnerabilities in cloud resources
AnswerB

Cloud Asset Inventory is the correct service. It maintains a complete, searchable catalog of all resources (and their configurations) across the entire organization, supports historical queries, and integrates with policy analysis tools. This is the purpose-built service for organizational resource visibility.

Why this answer

Cloud Asset Inventory is the correct answer because it is the Google Cloud service specifically designed to provide a unified, searchable inventory of all cloud assets (resources and IAM policies) across an organization's projects, folders, and organization nodes. It supports real-time and historical snapshots, enabling teams to discover and track resources as the environment scales. This directly addresses the need to understand what resources exist across dozens of projects.

Exam trap

Cisco often tests the distinction between a unified inventory service (Cloud Asset Inventory) and a security-focused tool (Security Command Center), leading candidates to mistakenly choose the latter because they associate 'inventory' with security asset management.

How to eliminate wrong answers

Option A is wrong because the Cloud Billing console only lists resources that have incurred charges, not a comprehensive inventory of all assets (including free-tier or non-billable resources), and it does not provide a unified view across projects and folders. Option C is wrong because the Google Cloud Console project dashboard shows resources only within a single project, not across dozens of projects and folders as required. Option D is wrong because Security Command Center focuses on security vulnerabilities and threats, not on providing a unified inventory of all cloud assets.

91
MCQhard

A company's application is composed of 15 microservices. When a performance issue occurs, the team struggles to determine which service is causing latency since request traces span multiple services. Which Google Cloud service helps identify which specific service in a microservices chain is causing slowdowns?

A.Cloud Logging — search logs for error messages across all 15 services.
B.Cloud Trace — captures distributed request traces showing end-to-end latency across all microservices.
C.Cloud Monitoring dashboards — create per-service CPU utilization graphs.
D.Security Command Center — scan for misconfigurations causing performance issues.
AnswerB

Cloud Trace shows the complete request journey: which service was called, in what order, and how long each call took. The Gantt-chart view immediately reveals the latency culprit service.

Why this answer

Cloud Trace is designed specifically for distributed tracing in microservices architectures. It captures end-to-end latency data for each request as it traverses multiple services, allowing you to pinpoint which service in the chain is introducing the most delay. This directly addresses the problem of identifying the specific service causing slowdowns in a 15-service application.

Exam trap

The trap here is that candidates confuse Cloud Logging (which shows error messages) with Cloud Trace (which shows latency timing), or assume CPU utilization graphs (Cloud Monitoring) can pinpoint request-level slowdowns, when only distributed tracing can reveal the exact service in the chain causing the delay.

How to eliminate wrong answers

Option A is wrong because Cloud Logging is for aggregating and searching log entries, not for tracing request latency across services; it cannot show the per-service timing breakdown needed to identify the slowest service. Option C is wrong because Cloud Monitoring dashboards showing per-service CPU utilization can indicate resource pressure but do not trace individual requests across services, so they cannot reveal which service in a specific request chain is causing latency. Option D is wrong because Security Command Center focuses on security misconfigurations and vulnerabilities, not on performance latency or distributed tracing.

92
MCQmedium

A company uses Google Cloud and wants to understand their monthly cloud spend before the invoice arrives, track spending trends, and identify the top cost drivers across all services. Which built-in Google Cloud tool provides this visibility?

A.Cloud Monitoring dashboards with cost metrics.
B.Cloud Billing reports and cost breakdown in the Billing console.
C.Cloud Asset Inventory — it lists all resources and their costs.
D.Google Cloud pricing calculator — it shows estimated costs.
AnswerB

The Cloud Billing console provides pre-built reports: cost by service, cost by project, cost over time, and spend forecasts. Billing export to BigQuery enables deeper custom analysis.

Why this answer

Cloud Billing reports and cost breakdown in the Billing console provide built-in, out-of-the-box visibility into monthly spend before the invoice arrives, spending trends, and top cost drivers across all services. This tool aggregates billing data from all projects and services, allowing you to filter by time range, project, service, or SKU, and view cost trends and breakdowns without additional configuration.

Exam trap

Cisco often tests the misconception that Cloud Monitoring can natively show cost metrics, but in reality, cost metrics require billing export to BigQuery and custom dashboard setup, whereas Cloud Billing reports provide this visibility immediately without additional configuration.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring dashboards with cost metrics require you to export billing data to BigQuery and then create custom dashboards, which is not a built-in, out-of-the-box solution for immediate cost visibility. Option C is wrong because Cloud Asset Inventory lists all resources and their metadata, but it does not provide cost data or spending trends; it is designed for asset discovery and governance, not cost analysis. Option D is wrong because the Google Cloud pricing calculator is a planning tool used to estimate costs before deployment, not a tool for viewing actual incurred spend or tracking trends.

93
MCQhard

A platform engineering team is designing a self-service cloud environment for development teams. They want developers to be able to provision approved cloud resources quickly without waiting for central IT approval for every request, while still ensuring compliance with security and cost policies. Which architectural approach best balances developer agility with governance?

A.Require all resource provisioning requests to be submitted as tickets to the central IT team for manual review and approval before any resources are created
B.Give all developers Owner access to all Google Cloud projects so they can provision any resources without delays
C.Provide a self-service catalog of pre-approved, policy-compliant infrastructure templates with automated provisioning, budget alerts, and org policy guardrails — enabling developer agility while enforcing compliance automatically
D.Allow developers to provision resources freely in a shared sandbox project only, keeping production entirely controlled by central IT
AnswerC

This is the platform engineering approach: build the rails, not the roads. Pre-approved templates (Terraform modules, Config Connector blueprints) let developers self-serve within defined boundaries. Org policies prevent non-compliant configurations. Budget alerts enforce cost controls. Developers move fast; governance is automated, not manual.

Why this answer

Option C is correct because it uses a self-service catalog with pre-approved, policy-compliant templates (e.g., Deployment Manager or Terraform configurations) combined with Organization Policy Service guardrails and automated budget alerts. This approach allows developers to provision resources on demand while enforcing security and cost policies automatically, balancing agility with governance without manual bottlenecks.

Exam trap

Google Cloud often tests the misconception that giving developers full access (Option B) or restricting them to a sandbox (Option D) are acceptable trade-offs, when in fact the correct answer requires a policy-as-code approach that enforces guardrails automatically without manual intervention.

How to eliminate wrong answers

Option A is wrong because requiring manual ticket-based approval for every request creates a central IT bottleneck that destroys developer agility, contradicting the goal of self-service provisioning. Option B is wrong because giving all developers Owner access to all projects violates the principle of least privilege, bypasses all governance controls, and creates severe security and compliance risks. Option D is wrong because restricting developers to a shared sandbox project only does not address their need to provision approved resources in production-like environments; it still forces central IT control for production, failing to balance agility with governance across the full lifecycle.

94
MCQeasy

A company wants to automatically scale their Compute Engine managed instance group based on the number of requests per second. Which metric should they use?

A.CPU utilization
B.HTTP load balancing serving capacity
C.Instance group size
D.custom metric from Cloud Monitoring
AnswerD

Custom metrics allow you to export application-level request rates.

Why this answer

Option D is correct because the company needs to scale based on requests per second, which is a custom application-level metric. Cloud Monitoring allows you to create custom metrics from your application, and managed instance groups can use these custom metrics for autoscaling, enabling precise scaling based on actual request throughput rather than proxy indicators.

Exam trap

The trap here is that candidates often confuse 'HTTP load balancing serving capacity' with request rate, but that metric measures the load balancer's backend capacity utilization (a ratio), not the raw number of requests per second, which requires a custom application metric.

How to eliminate wrong answers

Option A is wrong because CPU utilization is a system-level metric that does not directly correlate with requests per second; an instance could be CPU-bound for other reasons, leading to inaccurate scaling. Option B is wrong because HTTP load balancing serving capacity is a metric related to the load balancer's capacity, not the number of requests per second hitting the application; it measures backend capacity utilization, not request rate. Option C is wrong because instance group size is a static count of instances, not a metric that drives scaling decisions; using it as a scaling metric would create a circular dependency.

95
MCQhard

A cloud operations engineer notices that the managed instance group 'my-mig' has been scaling up frequently, but the application performance is still degraded. The CPU utilization metric shows high values. What is most likely the issue?

A.The target size is set to 10, which is lower than the current needed capacity.
B.The instance group is using preemptible VMs which are being reclaimed frequently.
C.The autoscaler is using a cooldown period that is too long, preventing it from scaling down.
D.The scaling metric is not appropriate; consider using a custom metric that better reflects application load.
AnswerD

CPU utilization is not always the best indicator; a custom metric like request latency or queue depth might be better.

Why this answer

Option D is correct because the autoscaler is using CPU utilization as the scaling metric, but high CPU does not necessarily correlate with application performance degradation. If the application is bottlenecked on memory, I/O, or request queuing, CPU may remain high while throughput suffers. A custom metric (e.g., requests per second, latency, or queue depth) would better reflect actual application load and enable more accurate scaling decisions.

Exam trap

The trap here is that candidates assume high CPU utilization always means the application needs more compute capacity, but the question tests the understanding that the scaling metric must be aligned with the actual performance bottleneck, not just a generic system metric.

How to eliminate wrong answers

Option A is wrong because the target size being lower than needed capacity would prevent scaling up sufficiently, but the question states the instance group is scaling up frequently, so the autoscaler is actively adding instances; the issue is that scaling up is not fixing the performance problem. Option B is wrong because preemptible VMs being reclaimed would cause instance churn and potential performance degradation, but the question does not mention preemptible VMs, and the symptom of frequent scaling up with high CPU is not directly caused by preemption. Option C is wrong because a cooldown period that is too long would delay scaling down, not prevent scaling up; the issue here is that scaling up is happening but not resolving the degradation, so the cooldown period is not the root cause.

96
MCQmedium

Refer to the exhibit. A DevOps engineer wants to create a chart showing the rate of items sold per second over time. What is a limitation of this metric for that purpose?

A.The metric kind is GAUGE, so it cannot be used to calculate rate
B.The interval should include a startTime
C.The metric has no labels to filter
D.The value should be DOUBLE instead of INT64
AnswerA

GAUGE metrics are snapshots; rate requires DELTA or CUMULATIVE.

Why this answer

Option A is correct because a GAUGE metric type represents a point-in-time value (e.g., current number of items), not a cumulative counter. To calculate a rate (items per second), you need a CUMULATIVE counter metric that monotonically increases, allowing Cloud Monitoring to compute the derivative over time. GAUGE metrics lack the necessary monotonicity and cumulative semantics, so they cannot be used to derive a meaningful rate of change.

Exam trap

Google Cloud often tests the misconception that any numeric metric can be used to compute a rate, when in fact only CUMULATIVE counters support rate-of-change calculations in Cloud Monitoring.

How to eliminate wrong answers

Option B is wrong because including a startTime in the interval is not a limitation of the metric itself; it is a standard parameter for time-series queries and does not prevent rate calculation. Option C is wrong because the absence of labels does not prevent rate calculation; labels are for filtering and aggregation, not for the fundamental ability to compute a rate. Option D is wrong because the data type (INT64 vs DOUBLE) does not affect the ability to calculate a rate; Cloud Monitoring can compute rates on integer values, and the limitation is the metric kind (GAUGE vs CUMULATIVE), not the value type.

97
MCQmedium

A company's cloud operations team is implementing a tagging strategy for cost allocation. They want to ensure that the 'cost-center' label is present on every Compute Engine VM and Cloud Storage bucket created in their Google Cloud organization. Currently, some resources are created without this label. Which combination of controls best enforces and remediates this requirement?

A.Organization Policy custom constraint to prevent creation of resources without the 'cost-center' label (preventive), plus Cloud Asset Inventory to identify existing unlabeled resources for remediation (detective)
B.Only organization policy — once new resources are blocked, existing unlabeled resources don't matter
C.Only Cloud Asset Inventory monitoring — alerting on unlabeled resources is sufficient without preventing their creation
D.Grant all engineers the 'Labels Admin' role to encourage them to add labels voluntarily
AnswerA

This is the complete two-layer approach: prevention (org policy blocks future non-compliant resources at creation time) and detection/remediation (Cloud Asset Inventory finds existing unlabeled resources so they can be labeled retroactively). Together they address both the future and existing state.

Why this answer

A preventive control (org policy custom constraint requiring the label) stops future non-compliant resources. A detective/corrective control (Cloud Asset Inventory + Cloud Functions or Security Command Center) finds and remediates existing unlabeled resources. Both are needed for comprehensive enforcement.

98
MCQmedium

After a major production outage, the engineering team conducts a review of what happened, why it happened, and how to prevent it in the future. This document is shared with all engineering teams. What is this practice called, and why does Google's SRE culture emphasize it?

A.Performance review — identifying which engineers caused the outage for disciplinary action.
B.Blameless postmortem — documenting the incident, root causes, and preventive actions to drive systemic learning without individual blame.
C.Capacity planning review — ensuring enough servers are provisioned to prevent future outages.
D.Change advisory board (CAB) review — approving that the outage fix is safe to deploy.
AnswerB

Blameless postmortems build organizational knowledge from failures. By avoiding blame, teams can honestly analyze contributing factors, including cultural and process issues, to make permanent improvements.

Why this answer

Option B is correct because a blameless postmortem is a core SRE practice that focuses on documenting incidents, root causes, and preventive actions without assigning individual blame. Google's SRE culture emphasizes this to foster psychological safety, enabling teams to openly share failures and drive systemic improvements, which is essential for maintaining high reliability in large-scale distributed systems.

Exam trap

The trap here is that candidates may confuse a blameless postmortem with a performance review or a change management process, failing to recognize that the key differentiator is the absence of blame and the focus on systemic learning rather than individual accountability.

How to eliminate wrong answers

Option A is wrong because a performance review is an HR process for evaluating employee contributions, not a post-incident analysis; blaming individuals contradicts the blameless culture that encourages honest incident reporting. Option C is wrong because capacity planning review is a proactive process to ensure sufficient resources (e.g., servers, network bandwidth) are provisioned to meet demand, not a reactive review of a specific outage's causes and fixes. Option D is wrong because a change advisory board (CAB) review is an ITIL process for approving changes before deployment, not a retrospective analysis of an incident that has already occurred.

99
MCQhard

A company wants to implement SLOs for their API service. They need to measure the proportion of successful requests over a 30-day window. Which metric should they use?

A.availability (uptime)
B.latency at 99th percentile
C.requests/success
D.SLI = good events / total events
AnswerD

SLI directly measures the proportion of successful requests.

Why this answer

Option D is correct because an SLI (Service Level Indicator) is defined as the ratio of good events to total events, which directly measures the proportion of successful requests over a 30-day window. This aligns with the requirement to track request success rate, not just system uptime. In Google Cloud operations, SLOs are built on SLIs that count discrete events like HTTP 200 responses versus all requests.

Exam trap

The trap here is that candidates confuse availability (uptime) with request success rate, not realizing that a service can be 'up' 100% of the time yet fail a large proportion of requests due to application errors.

How to eliminate wrong answers

Option A is wrong because availability (uptime) measures the percentage of time the service is reachable, not the proportion of individual request successes; a service can be up but still return errors for many requests. Option B is wrong because latency at the 99th percentile measures response time distribution, not success rate; it addresses performance, not correctness or error rate. Option C is wrong because requests/success is an inverted ratio that would decrease as success increases, and it is not a standard SLI formula; the correct SLI is good events divided by total events.

100
MCQmedium

A cloud team wants to automatically enforce that all new Compute Engine VMs are created with a specific label (environment: production) and that no VMs are created with external IP addresses in the production project. Which Google Cloud capability enforces these organizational policies at resource creation time?

A.Cloud Monitoring alerting policies that detect and notify when non-compliant VMs are created
B.Organization Policy Service constraints that enforce no external IPs and required labels at resource creation time, blocking non-compliant VMs before they are created
C.Cloud IAM roles that prevent developers from creating VMs without the proper labels
D.Cloud Billing budget alerts that detect when VM spending exceeds expected amounts for labeled resources
AnswerB

Organization Policy Service is the correct answer. The 'compute.vmExternalIpAccess' constraint prevents external IP assignment at creation. Custom org policy constraints can enforce required labels. Both are evaluated before resource creation — if the policy would be violated, the API call is rejected.

Why this answer

Organization Policy Service constraints, specifically `compute.vmExternalIpAccess` and `compute.requireOsLogin` or custom constraints for labels, are evaluated at resource creation time. They block non-compliant VM creation before the API call succeeds, enforcing policies like 'no external IPs' and 'required labels' without relying on post-creation detection or IAM permissions.

Exam trap

Cisco often tests the distinction between reactive monitoring (Cloud Monitoring alerts) and proactive enforcement (Organization Policy Service), leading candidates to pick the monitoring option because they confuse detection with prevention.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring alerting policies are reactive, not preventive; they detect non-compliant VMs after creation but do not block them. Option C is wrong because Cloud IAM roles control who can create VMs but cannot enforce specific label values or external IP restrictions at resource creation time; IAM lacks the granularity to validate resource configuration. Option D is wrong because Cloud Billing budget alerts monitor spending, not resource compliance; they cannot prevent VM creation or enforce labels or external IP policies.

101
MCQhard

An operations team tracks the following metrics for their customer portal: request latency p99, error rate, and requests per second. In Site Reliability Engineering terminology, what are these metrics called, and what do they collectively define?

A.Key Performance Indicators (KPIs) that define the overall health of the business
B.Service Level Agreements (SLAs), defining the contractual commitments made to customers
C.Service Level Indicators (SLIs), which measure specific dimensions of service behavior from the user's perspective and collectively define how reliability is quantified
D.Operational metrics that are only relevant to the infrastructure team and not to business stakeholders
AnswerC

SLIs are the specific measurable quantities that capture how users experience the service. Latency (is it fast enough?), error rate (is it working?), and throughput (is it keeping up?) are the canonical SLI types. Together they provide a quantitative picture of reliability that can be used to set SLO targets.

Why this answer

In Site Reliability Engineering (SRE), the metrics p99 latency, error rate, and requests per second are classified as Service Level Indicators (SLIs). SLIs are carefully chosen quantitative measures of specific aspects of the service's behavior, such as availability, latency, or throughput, as experienced by the end user. Collectively, these SLIs define how reliability is quantified and are used to set and monitor Service Level Objectives (SLOs).

Exam trap

The trap here is that candidates confuse SLIs with SLAs or KPIs, not realizing that SLIs are the raw measurements that feed into SLOs, which then underpin SLAs, and that they are specifically defined from the user's perspective to quantify reliability.

How to eliminate wrong answers

Option A is wrong because while these metrics can be part of business KPIs, the SRE terminology specifically calls them Service Level Indicators (SLIs), not generic KPIs, and they define reliability quantification, not overall business health. Option B is wrong because SLAs are contractual commitments based on SLOs, which are in turn derived from SLIs; the metrics themselves are not the agreements. Option D is wrong because SLIs are explicitly defined from the user's perspective and are critical for business stakeholders to understand service reliability, not just for the infrastructure team.

102
MCQeasy

A company's production database is running on a Compute Engine VM with a 500 GB Persistent Disk. The operations team wants to create a backup they can restore from in case of data corruption or accidental deletion. Which Google Cloud capability provides point-in-time backup for Persistent Disks?

A.Cloud Storage bucket replication, by continuously copying the database files to a storage bucket
B.Persistent Disk Snapshots, which capture the disk state at a point in time and enable restoration or creation of new disks from that snapshot
C.Cloud SQL automated backups, which protect databases running on Compute Engine VMs
D.VM live migration, which moves the running VM between physical hosts, automatically creating a backup in the process
AnswerB

Persistent Disk Snapshots are the correct mechanism. They capture a consistent point-in-time image of the disk (application-consistent when used with snapshot agent or after flushing I/O). Snapshots are stored in Cloud Storage, incremental after the first snapshot, and can be used to create a new disk or restore data.

Why this answer

Persistent Disk Snapshots are the correct Google Cloud feature for creating point-in-time backups of Persistent Disks. They capture the disk's data and configuration at a specific moment, allowing you to restore the disk or create new disks from that snapshot. This is the native, recommended method for backup and disaster recovery of Compute Engine VM disks.

Exam trap

The trap here is that candidates confuse Cloud SQL backups (which are for managed databases) with the need to back up a database running on a Compute Engine VM, leading them to select option C instead of the correct Persistent Disk Snapshots.

How to eliminate wrong answers

Option A is wrong because Cloud Storage bucket replication is a feature for objects in buckets, not for Persistent Disks; continuously copying database files to a bucket would require custom scripting and does not provide crash-consistent point-in-time backups of the entire disk. Option C is wrong because Cloud SQL automated backups protect Cloud SQL managed databases, not databases running on Compute Engine VMs; Cloud SQL is a separate managed service, not a feature for Compute Engine disks. Option D is wrong because VM live migration moves a running VM between physical hosts for maintenance without downtime, but it does not create a backup or capture a point-in-time state of the disk.

103
MCQmedium

A company wants to set up automated checks that continuously verify their website's homepage, login page, and API endpoints are accessible from multiple global locations. If any endpoint becomes unreachable for more than 2 minutes, the on-call engineer should be alerted. Which Cloud Monitoring feature provides this?

A.Cloud Logging log-based alerts that detect 5xx errors in application logs.
B.Cloud Monitoring uptime checks that probe endpoints from global locations with alerting on failure.
C.Cloud Trace that records response times for each user request.
D.Custom scripts on Compute Engine VMs that ping endpoints every minute.
AnswerB

Uptime checks send probe requests from multiple global PoPs at configurable intervals. Failures across multiple locations trigger alerting policies — the managed solution for external availability monitoring.

Why this answer

Cloud Monitoring uptime checks are specifically designed to probe HTTP, HTTPS, or TCP endpoints from multiple global locations at configurable intervals (e.g., every 1 minute). They can trigger alerting policies when a check fails for a specified duration (e.g., 2 minutes), directly matching the requirement for continuous, multi-location endpoint accessibility verification with alerting on sustained failure.

Exam trap

The trap here is that candidates confuse log-based alerts (which detect errors in logs) with proactive uptime checks (which test connectivity), leading them to choose Option A because they think 5xx errors are the only way to detect unreachability, ignoring that a completely down endpoint may not generate logs at all.

How to eliminate wrong answers

Option A is wrong because Cloud Logging log-based alerts analyze log entries (e.g., 5xx errors) but do not actively probe endpoints from global locations; they react to logs already generated, not to connectivity failures that may not produce logs. Option C is wrong because Cloud Trace is a distributed tracing tool that captures latency and request paths for individual user requests, not a monitoring feature for endpoint availability from multiple locations. Option D is wrong because custom scripts on Compute Engine VMs would require manual setup, lack native multi-location probing, and do not integrate with Cloud Monitoring's alerting policies; they are an ad-hoc solution, not a managed service.

← PreviousPage 2 of 2 · 103 questions total

Ready to test yourself?

Try a timed practice session using only Scaling with Google Cloud operations questions.