Knowledge + Practice

CCNA Aio Ai Infrastructure Questions

25 of 100 questions · Page 2/2 · Aio Ai Infrastructure topic · Answers revealed

Practice these questions Exam hub All questions

76

MCQhard

An organization must ensure that an AI model deployed on an IoT device meets stringent latency requirements. The model is currently in FP32 and runs at 200ms per inference on the device; the target is 50ms. Which technique will provide the greatest latency reduction with the least accuracy loss?

A.Quantize the model to INT8

B.Apply weight pruning to remove 50% of parameters

C.Switch from TensorFlow Lite to Core ML

D.Distill the model into a smaller architecture

AnswerA

INT8 quantization reduces bit width from 32 to 8, accelerating arithmetic and memory access, often achieving ~4x latency reduction.

Why this answer

Quantizing the model from FP32 to INT8 reduces the precision of weights and activations, which directly decreases memory bandwidth and computational load. On IoT devices with limited resources, this typically yields a 2-4x speedup, bringing the 200ms inference time close to the 50ms target, while INT8 quantization often retains over 90% of the original accuracy when using calibration techniques.

Exam trap

Cisco often tests the misconception that any optimization technique (like pruning or framework switching) can achieve the same latency reduction as quantization, but only INT8 quantization directly addresses the computational precision bottleneck to deliver the required 4x speedup with minimal accuracy loss.

How to eliminate wrong answers

Option B is wrong because weight pruning removes parameters but does not reduce the precision of the remaining values; the model still operates in FP32, so the latency reduction is limited (often 20-30%) and may not achieve the 4x speedup needed, while aggressive pruning can cause significant accuracy loss. Option C is wrong because switching from TensorFlow Lite to Core ML is a framework change that may optimize for Apple hardware but does not inherently reduce computational precision or model size; it typically provides marginal latency improvements (10-20%) and is platform-specific, not a general solution for the required 4x reduction. Option D is wrong because knowledge distillation creates a smaller student model, but training a new architecture from scratch is time-consuming and may not guarantee the exact 50ms target; the latency reduction depends on the student model's size and hardware compatibility, and distillation often requires extensive retuning to avoid accuracy degradation.

Practice this question →

77

MCQmedium

A company uses AWS SageMaker to train a large language model. The training job fails with an out-of-memory error. The team is already using the largest available GPU instance. Which step should the team take to resolve the issue without modifying the model architecture?

A.Increase the learning rate to converge faster

B.Enable gradient accumulation in the training script

C.Switch to a CPU-based instance

D.Reduce the number of attention heads in the model

AnswerB

Gradient accumulation reduces per-step memory by splitting the batch into micro-batches, allowing training on limited GPU memory.

Why this answer

Gradient accumulation allows the model to simulate a larger batch size by accumulating gradients over several forward/backward passes before performing an optimizer step. This reduces per-step memory usage because the gradients are stored and averaged rather than requiring the entire batch to be loaded into GPU memory at once. Since the team cannot change the instance type or model architecture, enabling gradient accumulation is the correct approach to resolve the out-of-memory error.

Exam trap

Cisco often tests the misconception that memory issues can be solved by adjusting hyperparameters like learning rate or by switching to a less powerful instance, rather than recognizing that gradient accumulation is a standard technique to fit large models into limited GPU memory without altering the architecture.

How to eliminate wrong answers

Option A is wrong because increasing the learning rate does not reduce memory consumption; it only changes the step size during optimization and can lead to training instability or divergence. Option C is wrong because switching to a CPU-based instance would drastically reduce computational throughput and memory bandwidth, likely making the training infeasible for a large language model, and does not address the root cause of memory exhaustion. Option D is wrong because reducing the number of attention heads modifies the model architecture, which the question explicitly states should not be done.

Practice this question →

78

MCQhard

A team has trained a large transformer model that achieves 95% accuracy but requires 8 GB of GPU memory for inference. They need to deploy it on edge devices with only 2 GB of memory and minimal accuracy loss. Which combination of techniques should they apply?

A.Use model distillation to create a smaller student model

B.Apply INT8 quantization and weight pruning only

C.Apply INT8 quantization, pruning, and model distillation

D.Use FP16 precision and increase batch size

AnswerC

Combining all three techniques can achieve the necessary memory reduction while preserving accuracy.

Why this answer

Option C is correct because the team needs to reduce the model's memory footprint from 8 GB to under 2 GB while preserving accuracy. INT8 quantization reduces memory by 4x (from 32-bit floats to 8-bit integers), weight pruning removes redundant connections, and model distillation trains a smaller student model to mimic the teacher, collectively achieving the required compression with minimal accuracy loss.

Exam trap

Cisco often tests the misconception that a single technique (like quantization alone) is sufficient for extreme memory reduction, when in reality the combination of distillation, quantization, and pruning is required to meet aggressive edge deployment constraints without unacceptable accuracy loss.

How to eliminate wrong answers

Option A is wrong because model distillation alone reduces model size but typically yields a student model that still requires more than 2 GB if the original is 8 GB; without quantization or pruning, the memory reduction is insufficient for the 2 GB target. Option B is wrong because applying only INT8 quantization and weight pruning can reduce memory but often causes significant accuracy degradation on complex transformer models without the knowledge transfer provided by distillation; the combination of all three techniques is needed to balance compression and accuracy. Option D is wrong because using FP16 precision only halves memory (to ~4 GB), which still exceeds the 2 GB limit, and increasing batch size increases memory usage, making the problem worse.

Practice this question →

79

MCQmedium

An organization uses Azure Machine Learning to manage the ML lifecycle. They want to automatically retrain a model when new data arrives in Azure Blob Storage. Which Azure service should they integrate with Azure ML to trigger retraining?

A.Azure Event Grid

B.Azure Logic Apps

C.Azure Data Factory

D.Azure Functions

AnswerA

Event Grid provides reliable event delivery from Blob Storage to Azure ML, triggering retraining pipelines.

Why this answer

Azure Event Grid is the correct service because it provides a native event-driven architecture that can react to Blob Storage events (e.g., BlobCreated) and route them directly to Azure Machine Learning workspaces via webhooks or event subscriptions. This allows automatic retraining pipelines to be triggered as soon as new data lands in the storage container, without polling or custom code.

Exam trap

Cisco often tests the distinction between event-driven (Event Grid) and compute-driven (Functions, Logic Apps) services, and the trap here is that candidates confuse Azure Functions' ability to run code on events with the native, lower-latency integration that Event Grid provides for Azure ML retraining triggers.

How to eliminate wrong answers

Option B (Azure Logic Apps) is wrong because while Logic Apps can trigger on Blob Storage events, they are designed for workflow orchestration and integration, not for directly invoking Azure ML retraining pipelines with minimal latency; they add unnecessary overhead and cost. Option C (Azure Data Factory) is wrong because it is a data integration and ETL service, not an event-driven trigger; it would require polling or scheduled triggers, which defeats the real-time retraining requirement. Option D (Azure Functions) is wrong because although Functions can respond to Blob events, they are a general-purpose compute service and lack the native integration with Azure ML's pipeline endpoints that Event Grid provides; using Functions would require custom code to call the ML pipeline, whereas Event Grid can directly invoke the pipeline via a webhook.

Practice this question →

80

MCQmedium

A company wants to build an AI pipeline that processes streaming data from IoT sensors, performs feature engineering, trains a model incrementally, and deploys the updated model. Which data pipeline technology is BEST suited for the streaming ingestion step?

A.Amazon S3

B.Apache Spark

C.Apache Airflow

D.Apache Kafka

AnswerD

Kafka is purpose-built for ingesting and storing high-volume streaming data with low latency.

Why this answer

Apache Kafka is the best choice for the streaming ingestion step because it is a distributed event streaming platform designed for high-throughput, fault-tolerant ingestion of real-time data streams. It acts as a durable message broker that can ingest IoT sensor data in real time and make it available for downstream processing, which aligns perfectly with the requirement for streaming data ingestion.

Exam trap

Cisco often tests the distinction between data ingestion (Kafka), data processing (Spark), and data storage (S3), so the trap here is confusing Apache Spark's streaming capability with a dedicated ingestion tool, leading candidates to choose Spark instead of Kafka.

How to eliminate wrong answers

Option A is wrong because Amazon S3 is an object storage service designed for batch storage of static files, not for real-time streaming ingestion; it lacks the low-latency publish-subscribe mechanism needed for streaming data. Option B is wrong because Apache Spark is a distributed processing engine that can handle streaming data via Spark Streaming, but it is not a data ingestion technology—it consumes data from sources like Kafka rather than ingesting it directly. Option C is wrong because Apache Airflow is a workflow orchestration tool for scheduling and managing batch pipelines, not a real-time streaming ingestion platform; it cannot handle continuous, low-latency data streams.

Practice this question →

81

Multi-Selecthard

A machine learning engineer is designing a pipeline to train a computer vision model using PyTorch on a large dataset stored in an S3 data lake. They need to preprocess images (resize, normalize) and stream them efficiently to GPUs. Which THREE components are essential in this pipeline? (Select THREE.)

Select 3 answers

A.GPU-accelerated training with CUDA

B.CPU-only inference pipeline

C.Apache Airflow to orchestrate the training job

D.PyTorch DataLoader with multi-processing for batching and shuffling

E.Distributed data parallel (DDP) training across multiple GPUs

AnswersA, D, E

GPU acceleration is essential for fast training of deep neural networks.

Why this answer

Option A is correct because GPU-accelerated training with CUDA is essential for efficiently training computer vision models on large datasets. PyTorch leverages CUDA to parallelize tensor operations and model computations on NVIDIA GPUs, which is critical for reducing training time from days to hours when processing high-resolution images.

Exam trap

Cisco often tests the distinction between essential pipeline components (like data loading and GPU acceleration) versus optional orchestration tools (like Airflow) that are not required for the core training loop.

Practice this question →

82

Multi-Selectmedium

A team is evaluating MLOps platforms to manage experiments, track model versions, and deploy models to production. Which THREE platforms provide end-to-end capabilities including experiment tracking and model deployment?

Select 3 answers

A.SageMaker Pipelines

B.MLflow

C.Kubeflow

D.Weights & Biases

E.Vertex AI Pipelines

AnswersA, B, E

SageMaker Pipelines is AWS's managed MLOps service that includes experiment tracking and deployment to SageMaker endpoints.

Why this answer

SageMaker Pipelines is correct because it provides a fully managed MLOps service that integrates experiment tracking, model versioning, and automated deployment pipelines within the AWS ecosystem. It allows teams to define, visualize, and execute end-to-end workflows, including model training, evaluation, and deployment to production endpoints, all with native integration to SageMaker's experiment management and model registry.

Exam trap

Cisco often tests the distinction between a specialized tool (like Weights & Biases for experiment tracking) and a full MLOps platform that also handles deployment, leading candidates to select options that only cover part of the workflow.

Practice this question →

83

Multi-Selectmedium

A startup is building a recommendation system that requires low-latency similarity search over millions of product embeddings. They need a vector database that offers high performance and has a managed cloud option. Which TWO databases are best suited for this requirement?

Select 2 answers

A.Chroma

B.Weaviate

C.pgvector (PostgreSQL extension)

D.Amazon DynamoDB

E.Pinecone

AnswersB, E

Weaviate offers a managed cloud service with vector search.

Why this answer

Weaviate and Pinecone are both purpose-built vector databases that natively support high-performance approximate nearest neighbor (ANN) search using algorithms like HNSW (Weaviate) or proprietary indexing (Pinecone). They offer managed cloud services with automatic scaling, making them ideal for low-latency similarity search over millions of product embeddings without requiring manual infrastructure management.

Exam trap

Cisco often tests the distinction between general-purpose databases with vector extensions (like pgvector) and purpose-built vector databases (like Weaviate and Pinecone), where candidates mistakenly assume any database with vector support is suitable for production-scale low-latency workloads.

Practice this question →

84

MCQmedium

An AI team uses SageMaker Pipelines to orchestrate their ML workflow. They need to version the pipeline and track experiments across runs. Which complementary MLflow feature should they integrate?

A.MLflow Tracking

B.MLflow Models

C.MLflow Projects

D.MLflow Model Registry

AnswerA

MLflow Tracking logs parameters, metrics, and artifacts per run, enabling experiment comparison and reproducibility.

Why this answer

MLflow Tracking is the correct complementary feature because it provides a centralized API and UI for logging parameters, metrics, and artifacts (e.g., model checkpoints, datasets) from each SageMaker Pipeline run. This enables the team to version their pipeline executions and compare experiments across different runs, directly addressing the requirement for tracking and versioning.

Exam trap

Cisco often tests the distinction between tracking (logging run metadata) and registry (managing model versions), so the trap here is that candidates confuse MLflow Model Registry's versioning of models with the pipeline versioning and experiment tracking requirement, leading them to select D instead of A.

How to eliminate wrong answers

Option B (MLflow Models) is wrong because it focuses on packaging ML models in a standardized format (e.g., MLflow Model flavor) for deployment, not on logging run metadata or versioning pipeline executions. Option C (MLflow Projects) is wrong because it is a packaging format for code and dependencies to enable reproducible runs, not a tool for tracking experiments or pipeline versions. Option D (MLflow Model Registry) is wrong because it manages model lifecycle stages (e.g., staging, production) and versioning of registered models, not the tracking of pipeline runs or experiment parameters.

Practice this question →

85

MCQeasy

An AI team wants to version control datasets, track experiments, and log model parameters across multiple projects. Which MLOps platform is specifically designed for experiment tracking and model management?

A.MLflow

B.SageMaker Pipelines

C.Vertex AI Pipelines

D.Kubeflow

AnswerA

MLflow is the correct answer; it provides experiment tracking, model registry, and project packaging.

Why this answer

MLflow is an open-source MLOps platform specifically designed for experiment tracking, model management, and reproducibility. It provides a unified API to log parameters, metrics, and artifacts across multiple projects, making it the correct choice for versioning datasets, tracking experiments, and managing models.

Exam trap

Cisco often tests the distinction between general-purpose pipeline orchestration tools (like SageMaker Pipelines, Vertex AI Pipelines, and Kubeflow) and purpose-built experiment tracking platforms (like MLflow), so the trap is assuming any pipeline tool inherently includes experiment tracking and model management capabilities.

How to eliminate wrong answers

Option B (SageMaker Pipelines) is wrong because it is a fully managed CI/CD service for building, training, and deploying ML pipelines on AWS, but it is not specifically designed for experiment tracking and model management; it focuses on workflow orchestration. Option C (Vertex AI Pipelines) is wrong because it is a serverless ML pipeline service on Google Cloud that orchestrates training and deployment workflows, but it lacks the dedicated experiment tracking and model registry features that MLflow provides. Option D (Kubeflow) is wrong because it is a Kubernetes-native platform for deploying and managing ML workflows, but its primary focus is on orchestration and portability across clusters, not on experiment tracking and model management as a core feature.

Practice this question →

86

MCQeasy

Which of the following is a key advantage of using ONNX (Open Neural Network Exchange) format for model deployment?

A.It automatically quantizes models to INT8

B.It enables framework interoperability for model inference

C.It compresses model size by 90%

D.It reduces training time

AnswerB

ONNX provides a standard format that can be used across different frameworks and runtimes.

Why this answer

ONNX provides a standardized, open format for representing machine learning models, enabling seamless interoperability between different frameworks (e.g., PyTorch, TensorFlow, scikit-learn). This allows a model trained in one framework to be deployed for inference using a different runtime or hardware accelerator without requiring retraining or manual conversion, which is a key advantage in heterogeneous production environments.

Exam trap

Cisco often tests the misconception that ONNX provides built-in performance optimizations like quantization or compression, when in fact its primary value is framework interoperability, and any performance gains come from the runtime or additional tools, not the format itself.

How to eliminate wrong answers

Option A is wrong because ONNX does not automatically quantize models to INT8; quantization is a separate optimization step that can be applied to ONNX models using tools like ONNX Runtime or Intel Neural Compressor, but it is not an inherent feature of the format itself. Option C is wrong because ONNX does not inherently compress model size by 90%; while ONNX models may be slightly more compact than some framework-specific formats due to serialization, significant compression requires techniques like pruning or quantization, and 90% reduction is not guaranteed. Option D is wrong because ONNX is a model representation format for inference and interoperability, not a training framework; it does not reduce training time, which depends on the training framework, hardware, and algorithm used.

Practice this question →

87

MCQeasy

Which open-source framework is commonly used for building, training, and deploying machine learning models and provides high-level APIs like Keras?

A.TensorFlow

B.Hugging Face Transformers

C.scikit-learn

D.PyTorch

AnswerA

TensorFlow provides Keras and is widely used for production ML.

Why this answer

TensorFlow is the correct answer because it is the open-source framework that provides high-level APIs like Keras for building, training, and deploying machine learning models. Keras, now integrated as tf.keras, offers a user-friendly interface for rapid prototyping while TensorFlow handles the underlying computation graph, distributed training, and model serving via TensorFlow Serving.

Exam trap

Cisco often tests the misconception that PyTorch is the only framework with dynamic computation graphs and high-level APIs, but the question specifically asks for the framework that provides Keras, which is exclusive to TensorFlow.

How to eliminate wrong answers

Option B (Hugging Face Transformers) is wrong because it is a specialized library for natural language processing (NLP) models like BERT and GPT, not a general-purpose framework for building and deploying any ML model, and it does not natively include Keras as its high-level API. Option C (scikit-learn) is wrong because it is designed for traditional machine learning algorithms (e.g., decision trees, SVMs) and lacks deep learning capabilities, GPU acceleration, and a high-level API like Keras for neural networks. Option D (PyTorch) is wrong because, although it is a popular deep learning framework, it does not provide Keras as its high-level API; instead, it uses torch.nn and higher-level wrappers like Lightning or Fastai, and Keras is specifically integrated with TensorFlow.

Practice this question →

88

MCQhard

A team is deploying a machine learning model on a Kubernetes cluster. They need to ensure low-latency inference and efficient resource utilization. Which approach should they use to dynamically scale inference pods based on request volume?

A.Use a Job resource to process requests in batch

B.Deploy a single large pod on a powerful node

C.Use a Horizontal Pod Autoscaler (HPA) with target CPU utilization

D.Set a fixed number of pod replicas equal to the maximum expected load

AnswerC

HPA dynamically adjusts replicas based on real-time metrics, optimizing resource usage and latency.

Why this answer

The Horizontal Pod Autoscaler (HPA) is the correct choice because it automatically scales the number of inference pods based on observed CPU utilization or custom metrics, ensuring low-latency inference by adding replicas during traffic spikes and reducing waste during idle periods. This dynamic scaling aligns with the need for efficient resource utilization in a Kubernetes cluster, as it adjusts pod count in real-time to match request volume without manual intervention.

Exam trap

Cisco often tests the misconception that batch processing (Jobs) or static scaling is suitable for real-time inference, when in fact dynamic scaling with HPA is required to balance latency and resource efficiency in Kubernetes.

How to eliminate wrong answers

Option A is wrong because a Job resource is designed for batch processing and runs pods to completion, not for serving continuous inference requests that require low-latency responses; it cannot dynamically scale based on request volume. Option B is wrong because deploying a single large pod on a powerful node creates a single point of failure and cannot handle variable request loads efficiently, leading to either over-provisioning or under-provisioning and increased latency during spikes. Option D is wrong because setting a fixed number of pod replicas equal to the maximum expected load wastes resources during low-traffic periods and fails to adapt to actual request volume, contradicting the goal of efficient resource utilization.

Practice this question →

89

MCQmedium

A company needs to store large volumes of unstructured data (PDFs, images, logs) for future AI model training. The data must be easily accessible by data scientists using Spark and must support cost-effective storage. Which data infrastructure is MOST appropriate?

A.Snowflake data warehouse

B.Relational database like Amazon RDS

C.Pinecone vector database

D.Amazon S3 data lake

AnswerD

S3 is a scalable, low-cost object store for unstructured data; it integrates with Spark and is ideal for a data lake.

Why this answer

A data lake stores raw, unstructured data at low cost and integrates with Spark. Data warehouses are for structured, processed data; vector databases are for embeddings.

Practice this question →

90

MCQeasy

Which hardware accelerator is specifically designed by Google for training and inference of machine learning models, particularly their TensorFlow framework?

A.NPU

B.FPGA

C.GPU

D.TPU

AnswerD

TPU is Google's custom chip for ML, optimized for TensorFlow.

Why this answer

TPU (Tensor Processing Unit) is Google's custom ASIC designed to accelerate ML workloads, especially with TensorFlow.

Practice this question →

91

MCQhard

An ML team uses Kubeflow to orchestrate a pipeline that includes data preprocessing, model training, and evaluation. The pipeline runs on a Kubernetes cluster. After a cluster upgrade, the pipeline fails at the training step with an 'OOMKilled' error. What is the MOST likely cause?

A.The training code has a memory leak

B.The pipeline definition is missing a step dependency

C.The Kubernetes node's memory resources were not correctly allocated to the pod's resource requests or limits

D.The training data is corrupted

AnswerC

After upgrade, default resource limits may have changed, or the pod's memory request exceeded available node memory, causing OOMKill.

Why this answer

OOMKilled indicates the container exceeded its memory limit. The resource requests/limits likely were not adjusted for the new cluster configuration, or the node's allocatable memory decreased after upgrade.

Practice this question →

92

MCQmedium

A data engineer is building a pipeline to process streaming clickstream data and feed it into a real-time ML feature store. Which tool is BEST suited for the streaming ingestion?

A.Amazon S3

B.Apache Airflow

C.Apache Spark (batch mode)

D.Apache Kafka

AnswerD

Kafka provides low-latency, durable streaming, ideal for real-time clickstream ingestion into feature stores.

Why this answer

Apache Kafka is the industry standard for high-throughput, fault-tolerant streaming data ingestion. It can handle real-time clickstream data and integrate with feature stores.

Practice this question →

93

Multi-Selecthard

A company is building a secure AI system that must comply with GDPR. They want to allow users to request deletion of their personal data from training sets and model outputs. Which THREE techniques should they implement?

Select 3 answers

A.Model ensembling

B.Differential privacy

C.Data retention and deletion policies

D.Machine unlearning

E.Federated learning

AnswersB, C, D

Differential privacy ensures that the model does not memorize individual data points.

Why this answer

Differential privacy (B) is correct because it adds calibrated noise to training data or model outputs, ensuring that the inclusion or exclusion of any individual's data does not significantly affect the model's behavior. This provides a mathematical guarantee of privacy, which is essential for GDPR compliance when handling personal data. By limiting information leakage, differential privacy helps protect user data even if deletion requests are not fully implemented.

Exam trap

Cisco often tests the misconception that federated learning alone satisfies GDPR deletion requirements, when in fact it only addresses data locality, not the ability to remove a specific user's influence from a trained model.

Practice this question →

94

MCQmedium

A team is building a retrieval-augmented generation (RAG) pipeline. They need to store embeddings of company documents and perform fast similarity searches. Which data store is BEST suited for this task?

A.Snowflake

B.Pinecone

C.Apache Kafka

D.Amazon S3

AnswerB

Pinecone is a vector database designed for high-dimensional embeddings and fast similarity search.

Why this answer

Pinecone is a purpose-built vector database designed for storing and querying high-dimensional embeddings with fast approximate nearest neighbor (ANN) search. In a RAG pipeline, embeddings of company documents must be retrieved quickly to feed relevant context to the LLM, and Pinecone’s optimized indexing (e.g., HNSW or IVF) and serverless scaling make it the ideal choice for this task.

Exam trap

The trap here is that candidates may confuse general-purpose storage (like S3 or Snowflake) with specialized vector databases, assuming any database can handle embeddings efficiently, but Cisco tests the understanding that only purpose-built vector stores provide the required ANN search performance for RAG.

How to eliminate wrong answers

Option A is wrong because Snowflake is a cloud data warehouse optimized for SQL-based analytical queries on structured data, not for low-latency vector similarity searches on embeddings. Option C is wrong because Apache Kafka is a distributed event streaming platform for real-time data pipelines and message brokering, not a storage and retrieval system for vector embeddings. Option D is wrong because Amazon S3 is an object storage service for static files and does not natively support vector indexing or similarity search operations.

Practice this question →

95

MCQeasy

A developer is building a mobile app that uses a pre-trained image classification model on-device. Which framework should they use to run the model on iOS devices?

A.Hugging Face Transformers

B.TensorFlow Lite

C.PyTorch Mobile

D.Core ML

AnswerD

Core ML is Apple's native framework for on-device ML inference on iOS devices.

Why this answer

Core ML is Apple's framework for on-device machine learning inference on iOS. TensorFlow Lite is for mobile and embedded, but Core ML is native to iOS and optimized.

Practice this question →

96

MCQeasy

Which AI accelerator is specifically designed by Google to accelerate the training and inference of large neural networks, especially in their cloud environment?

A.GPU

B.NPU

C.TPU

D.FPGA

AnswerC

TPUs are Google's custom chips for ML workloads.

Why this answer

The Tensor Processing Unit (TPU) is Google's custom-designed ASIC specifically built to accelerate the training and inference of large neural networks. Unlike general-purpose hardware, TPUs are optimized for TensorFlow workloads and are a core component of Google Cloud's AI infrastructure, offering high throughput for matrix operations common in deep learning.

Exam trap

Cisco often tests the distinction between custom-designed accelerators (like TPU) and general-purpose or reconfigurable hardware (like GPU, NPU, FPGA), expecting candidates to know that TPU is Google's proprietary solution for neural network acceleration in their cloud.

How to eliminate wrong answers

Option A is wrong because GPUs (Graphics Processing Units) are general-purpose parallel processors designed for graphics and compute, not specifically by Google for neural network acceleration in their cloud; they are widely used but not Google's custom accelerator. Option B is wrong because NPU (Neural Processing Unit) is a generic term for processors designed to accelerate neural networks, but it is not a specific Google-designed chip; Google's custom accelerator is the TPU. Option D is wrong because FPGAs (Field-Programmable Gate Arrays) are reconfigurable hardware that can be programmed for various tasks, but they are not specifically designed by Google for neural network training and inference in their cloud environment; Google uses TPUs for that purpose.

Practice this question →

97

Multi-Selecthard

An organisation is deploying a fine-tuned LLM for internal use. They need to ensure the API endpoint is secure and cost-effective. Which TWO measures should they implement? (Choose 2)

Select 2 answers

A.Implement API key authentication

B.Enable content filtering

C.Disable logging to reduce storage costs

D.Apply rate limiting per user

E.Use gRPC instead of REST

AnswersA, D

API keys restrict access to authorised clients.

Why this answer

API key authentication (Option A) is a fundamental security measure that ensures only authorized clients can access the LLM endpoint. It provides a simple, lightweight mechanism to validate requests without the overhead of full OAuth, making it both secure and cost-effective for internal deployments.

Exam trap

Cisco often tests the distinction between security measures (authentication, rate limiting) and non-security features (content filtering, protocol choice), leading candidates to mistakenly select content filtering or gRPC as security controls.

Practice this question →

98

MCQmedium

A data scientist needs to train a deep learning model on a large image dataset. Which hardware is most suitable for parallel matrix operations and faster training compared to a CPU?

A.GPU with thousands of CUDA cores

B.TPU designed for TensorFlow

C.CPU with high clock speed

D.FPGA for reconfigurable logic

AnswerA

GPUs excel at parallel matrix multiplications, drastically reducing training time for deep learning models.

Why this answer

A GPU with thousands of CUDA cores is the most suitable hardware for parallel matrix operations because deep learning training involves massive matrix multiplications and tensor operations that can be decomposed into thousands of independent threads. CUDA cores execute these threads in a massively parallel SIMT (Single Instruction, Multiple Thread) fashion, achieving significantly higher throughput than a CPU for such workloads, which leads to faster training times.

Exam trap

Cisco often tests the misconception that a TPU is always the best choice for deep learning, but the trap here is that the question specifies 'parallel matrix operations' and 'faster training compared to a CPU' without limiting the framework to TensorFlow, making the GPU the most universally suitable and correct answer.

How to eliminate wrong answers

Option B is wrong because a TPU is a custom ASIC designed specifically for TensorFlow workloads, but the question asks for the most suitable hardware for parallel matrix operations in general, and GPUs are more widely supported across deep learning frameworks (PyTorch, TensorFlow, etc.) and offer greater flexibility for various model architectures. Option C is wrong because a CPU with high clock speed excels at sequential, latency-sensitive tasks but has a limited number of cores (typically 8–64) compared to a GPU's thousands of cores, making it inefficient for the massive parallelism required in deep learning training. Option D is wrong because an FPGA offers reconfigurable logic for custom hardware acceleration but requires significant development effort and has lower floating-point throughput per watt compared to a GPU for standard deep learning operations, making it less practical for general-purpose training.

Practice this question →

99

MCQhard

A data science team is deploying a real-time fraud detection model on edge devices in retail stores. The model must infer under 10ms and fit within 50MB memory. Which combination of techniques should the team apply?

A.Model parallelism and distributed inference

B.Increase batch size and use FP16 precision

C.Train a larger model and use distillation to transfer knowledge

D.Model quantization to INT8 and pruning of low-weight connections

AnswerD

INT8 quantization reduces model size and latency; pruning eliminates unnecessary weights, meeting both memory and speed constraints.

Why this answer

Quantization reduces model precision (e.g., FP32 to INT8) to shrink memory and speed up inference, while pruning removes redundant parameters. Distillation can further compress. These are standard for edge deployment.

Practice this question →

100

MCQhard

An MLOps team observes that their production inference API experiences increasing latency as more concurrent requests arrive. They need to scale horizontally while maintaining session state of preprocessing steps. Which deployment strategy should they implement?

A.Deploy stateless containers without session persistence

B.Use a single larger GPU instance to handle all requests

C.Deploy multiple instances behind a round-robin load balancer with sticky sessions

D.Implement a message queue (e.g., Kafka) to buffer requests

AnswerC

Sticky sessions ensure that all requests from a user session are routed to the same instance, preserving session state during horizontal scaling.

Why this answer

Sticky sessions (session affinity) ensure that all requests from a given client are routed to the same backend instance, preserving the in-memory session state of preprocessing steps. Combined with a round-robin load balancer, this allows horizontal scaling while maintaining stateful behavior, which is essential for the described latency issue under concurrent load.

Exam trap

Cisco often tests the distinction between stateless and stateful scaling, where candidates mistakenly choose message queues (Option D) thinking they solve concurrency, but they fail to address the synchronous session state requirement.

How to eliminate wrong answers

Option A is wrong because stateless containers without session persistence would lose the preprocessing session state between requests, breaking the required stateful behavior. Option B is wrong because scaling vertically with a single larger GPU instance does not address horizontal scaling needs and creates a single point of failure, while also not solving the latency increase under concurrent requests. Option D is wrong because a message queue like Kafka buffers requests asynchronously, which introduces decoupling and potential ordering issues, but does not directly provide horizontal scaling with session state preservation for synchronous inference requests.

Practice this question →

← PreviousPage 2 of 2 · 100 questions total

Ready to test yourself?

Try a timed practice session using only Aio Ai Infrastructure questions.

Start 20-question session