CCNA Aio Ai Infrastructure Questions

75 of 100 questions · Page 1/2 · Aio Ai Infrastructure topic · Answers revealed

1
MCQeasy

A developer is using Hugging Face Transformers to fine-tune a BERT model for sentiment analysis. They want to track experiments, log metrics, and compare runs. Which MLOps tool should they integrate?

A.Apache Airflow
B.Docker
C.Kubeflow
D.MLflow
AnswerD

MLflow's Tracking API is simple to integrate and supports logging parameters, metrics, and artifacts.

Why this answer

MLflow is the correct choice because it is purpose-built for experiment tracking, metric logging, and run comparison in machine learning workflows. It provides an API to log parameters, metrics, and artifacts, and its UI allows easy comparison of different fine-tuning runs, which directly matches the developer's need to track experiments and compare runs for a BERT sentiment analysis model.

Exam trap

Cisco often tests the distinction between infrastructure tools (Airflow, Docker, Kubeflow) and ML-specific experiment tracking tools (MLflow), trapping candidates who confuse orchestration or containerization with MLOps tracking capabilities.

How to eliminate wrong answers

Option A is wrong because Apache Airflow is a workflow orchestration tool for scheduling and managing DAGs (Directed Acyclic Graphs) of tasks, not for experiment tracking or metric logging; it lacks native ML run comparison capabilities. Option B is wrong because Docker is a containerization platform for packaging applications and dependencies, not an MLOps tool for logging metrics or comparing experiments; it provides environment consistency but no tracking or logging features. Option C is wrong because Kubeflow is a Kubernetes-native platform for deploying and managing ML pipelines at scale, but it is overkill for simple experiment tracking and does not offer the lightweight, focused metric logging and run comparison that MLflow provides out of the box.

2
MCQhard

An ML team deploys a model on edge devices using INT8 quantization. They notice a significant drop in accuracy on a subset of classes. Which technique should they apply to recover accuracy without increasing model size?

A.Use pruning to remove less important weights
B.Increase the model architecture size
C.Switch to FP16 quantization
D.Apply quantization-aware training (QAT)
AnswerD

QAT simulates quantization during training, allowing the model to learn to compensate for the lower precision, often restoring accuracy.

Why this answer

Quantization-aware training (QAT) simulates INT8 quantization effects during the forward pass of training, allowing the model to learn weights and activations that are more robust to the lower precision. This recovers accuracy lost during post-training quantization without increasing the model's size, as the architecture and number of parameters remain unchanged.

Exam trap

Cisco often tests the misconception that post-training quantization is always lossless, leading candidates to overlook the need for QAT when accuracy drops on specific classes due to uneven weight distributions.

How to eliminate wrong answers

Option A is wrong because pruning reduces model size by removing less important weights, which does not directly address the accuracy drop caused by INT8 quantization and may further degrade performance. Option B is wrong because increasing the model architecture size would increase the model's memory footprint and latency, contradicting the requirement to not increase model size. Option C is wrong because switching to FP16 quantization uses 16-bit floating point, which increases the model size compared to INT8 and does not meet the constraint of maintaining the same model size.

3
MCQeasy

A company wants to build a real-time anomaly detection system for IoT sensor data using edge AI. The model must run on resource-constrained devices with minimal power consumption. Which model optimization technique is MOST important?

A.Use FP32 precision
B.Model quantization (INT8)
C.Increase the number of layers
D.Use a larger batch size
AnswerB

INT8 quantization dramatically reduces model size and inference latency with minimal accuracy loss, ideal for edge devices.

Why this answer

Quantization reduces model precision (e.g., FP32 to INT8), decreasing model size and computation, which is critical for resource-constrained edge devices.

4
MCQeasy

A data scientist is choosing a hardware accelerator for training a large transformer model. Which of the following is specifically designed for deep learning workloads and offers the highest throughput for matrix multiplications?

A.TPU
B.GPU
C.NPU
D.CPU
AnswerA

TPUs are Google's custom ASICs built specifically for tensor computations, delivering the highest throughput for matrix multiplications in deep learning.

Why this answer

The TPU (Tensor Processing Unit) is an application-specific integrated circuit (ASIC) designed by Google specifically to accelerate deep learning workloads. Its systolic array architecture is optimized for the matrix multiplications and convolutions that dominate transformer model training, delivering the highest throughput among the listed options for these operations.

Exam trap

Cisco often tests the distinction between hardware designed for training versus inference, and the trap here is that candidates may choose GPU because it is the most common deep learning accelerator, overlooking that TPU is purpose-built for the highest matrix multiplication throughput in training workloads.

How to eliminate wrong answers

Option B (GPU) is wrong because while GPUs are widely used for deep learning and offer high parallelism, they are general-purpose processors originally designed for graphics rendering, not specifically optimized for the dense matrix operations in transformer training. Option C (NPU) is wrong because Neural Processing Units are typically designed for low-power inference on edge devices, not for high-throughput training of large models. Option D (CPU) is wrong because CPUs are general-purpose processors optimized for sequential tasks and low-latency operations, lacking the massive parallel compute units and specialized matrix multiplication hardware needed for efficient transformer training.

5
MCQeasy

An ML engineer wants to deploy a model as a REST API that can scale to handle thousands of inference requests per second. Which serving approach is most appropriate?

A.Export the model to ONNX format and use a batch processing pipeline
B.Use gRPC streaming for all inference requests
C.Run the model directly on the client device
D.Deploy the model as a REST API endpoint using a containerized inference server
AnswerD

REST APIs are stateless and easily scalable with load balancers and container orchestration.

Why this answer

Option D is correct because deploying the model as a REST API endpoint using a containerized inference server (e.g., TensorFlow Serving, TorchServe, or NVIDIA Triton Inference Server) is the most appropriate approach for handling thousands of inference requests per second. These servers are designed for high-throughput, low-latency serving, support horizontal scaling via load balancers, and provide built-in batching and model versioning. REST APIs are stateless and can be easily integrated with existing web infrastructure, making them ideal for production-scale inference.

Exam trap

Cisco often tests the distinction between serving infrastructure (REST API with containerized server) and data processing pipelines (batch) or communication protocols (gRPC), leading candidates to confuse a transport mechanism or batch method with a scalable serving architecture.

How to eliminate wrong answers

Option A is wrong because exporting to ONNX and using a batch processing pipeline is designed for offline/batch inference, not for real-time REST API serving with thousands of requests per second; batch pipelines introduce latency and are not suitable for synchronous, low-latency inference. Option B is wrong because gRPC streaming is a communication protocol that can be used for inference, but it is not a serving approach itself; moreover, gRPC streaming is typically used for bidirectional or long-lived streams, not for high-volume stateless REST API requests, and it adds complexity without inherent scalability benefits over REST for this use case. Option C is wrong because running the model directly on the client device (edge inference) offloads computation from the server but does not provide a centralized REST API; it also introduces challenges with model updates, device heterogeneity, and security, and is not a server-side serving approach.

6
MCQmedium

A machine learning team is training a large transformer model on a text corpus. They need to reduce training time while maintaining model accuracy. Which hardware configuration would be MOST effective for this task?

A.Use a high-core-count CPU with large RAM
B.Use a cluster of GPUs with data parallelism
C.Use a single GPU with model parallelism
D.Use a single TPU with model parallelism
AnswerB

GPUs accelerate parallel tensor operations, and data parallelism distributes batches across multiple GPUs, significantly reducing training time.

Why this answer

GPUs are optimized for the parallel computations required in deep learning training, offering significant speedups over CPUs. TPUs are also effective but less accessible and more specialized. The question specifies 'most effective' for training a transformer model, which aligns with GPU acceleration.

7
Multi-Selecteasy

A machine learning engineer needs to containerize a PyTorch model for deployment on Kubernetes. Which THREE tools or formats should they use?

Select 3 answers
A.MLflow
B.Docker
C.Kubeflow
D.Kubernetes
E.ONNX
AnswersB, D, E

Docker is the standard for containerizing applications, including ML models and their dependencies.

Why this answer

Docker is correct because it is the standard tool for creating container images that package the PyTorch model along with its dependencies, runtime, and environment into a portable artifact. Kubernetes requires container images (typically built with Docker) to deploy and orchestrate workloads, making Docker essential for containerization before deployment.

Exam trap

Cisco often tests the distinction between containerization tools (Docker) and orchestration or ML lifecycle tools (Kubeflow, MLflow), leading candidates to select tools that manage containers rather than build them.

8
Multi-Selectmedium

A team is selecting a vector database for a RAG application that requires low-latency similarity search on millions of embeddings. They prioritize ease of use and fully managed cloud service. Which TWO options meet these requirements?

Select 2 answers
A.Pinecone
B.pgvector
C.Chroma
D.Weaviate
E.Milvus
AnswersA, D

Pinecone is a fully managed, cloud-native vector database with low-latency similarity search, ideal for production RAG.

Why this answer

Pinecone is a fully managed vector database designed for production-scale RAG applications, offering low-latency similarity search on millions of embeddings without requiring users to manage infrastructure. Its serverless architecture and simple API align directly with the team's priorities of ease of use and a fully managed cloud service.

Exam trap

Cisco often tests the distinction between open-source, self-managed tools and fully managed cloud services, where candidates may incorrectly assume that any popular vector database (like Milvus or pgvector) inherently provides a managed cloud experience without checking the deployment model.

9
MCQhard

An engineer is deploying a model on edge devices with limited compute. The model was trained in PyTorch. They need to convert it to a format optimized for mobile CPUs. Which framework should they use?

A.OpenVINO
B.TensorFlow Lite
C.ONNX Runtime Mobile
D.Core ML
AnswerB

TensorFlow Lite is the standard for deploying models on mobile, embedded, and IoT devices with hardware acceleration.

Why this answer

TensorFlow Lite (TFLite) is specifically designed for on-device inference on mobile CPUs with limited compute, offering quantization and reduced model size. Since the model is trained in PyTorch, it can be converted to TFLite via ONNX or directly using PyTorch's export to TorchScript followed by conversion through TensorFlow. TFLite's optimized kernels for ARM CPUs make it the best choice for mobile CPU deployment.

Exam trap

Cisco often tests the misconception that ONNX Runtime Mobile is the universal solution for all edge devices, but it lacks the mobile-specific CPU optimizations and ecosystem maturity of TensorFlow Lite for ARM-based mobile CPUs.

How to eliminate wrong answers

Option A is wrong because OpenVINO is optimized for Intel CPUs, GPUs, and VPUs, not for general mobile CPUs (e.g., ARM-based), and lacks native mobile runtime support. Option C is wrong because ONNX Runtime Mobile is designed for cross-platform inference but does not provide the same level of CPU kernel optimization for mobile devices as TFLite, and its mobile support is less mature. Option D is wrong because Core ML is Apple's framework for iOS devices only, not for cross-platform mobile CPUs, and requires conversion to a proprietary format that may not support all PyTorch operators.

10
Multi-Selecthard

A company is building a multi-modal AI application that processes text, images, and audio. They need a unified platform to store embeddings for all modalities, perform hybrid search (vector + metadata filtering), and scale to millions of vectors. Which THREE services are suitable for this purpose? (Choose THREE.)

Select 3 answers
A.Weaviate
B.Amazon S3
C.Snowflake
D.pgvector (PostgreSQL extension)
E.Pinecone
AnswersA, D, E

Weaviate is a vector database with hybrid search and multi-modal support.

Why this answer

Weaviate is a purpose-built vector database that natively supports multi-modal embeddings (text, images, audio) through its vectorizer modules and provides hybrid search combining vector similarity with metadata filtering (e.g., using GraphQL or REST APIs). It is designed to scale to millions of vectors with built-in sharding and replication, making it a strong fit for the described unified platform.

Exam trap

Cisco often tests the distinction between general-purpose storage (S3) or analytics platforms (Snowflake) and purpose-built vector databases, leading candidates to mistakenly choose services that store data but lack native vector search and hybrid filtering capabilities.

11
MCQhard

A team uses a retrieval-augmented generation (RAG) system to answer questions from a large enterprise document repository. They observe that the generated answers sometimes contain information not present in the retrieved documents. What is the MOST likely cause?

A.The vector database has low recall
B.The temperature setting is too high
C.The embedding model is not accurately representing document semantics
D.The LLM's context window is too small to include all retrieved chunks
AnswerD

When the context window truncates retrieved documents, the LLM lacks necessary information and may generate unsupported content.

Why this answer

When the LLM's context window is too small to include all retrieved chunks, the model may generate information not present in the provided context to fill gaps, a phenomenon known as hallucination. This occurs because the model relies on its internal knowledge when relevant retrieved content is truncated, leading to fabricated details. Option D directly addresses this mismatch between retrieval capacity and generation constraints.

Exam trap

Cisco often tests the distinction between retrieval-side failures (low recall, poor embeddings) and generation-side failures (context window limits, hallucination), trapping candidates who confuse missing information with fabricated information.

How to eliminate wrong answers

Option A is wrong because low recall in the vector database means relevant documents are missed, which would cause incomplete or missing answers, not the addition of extra information not in the retrieved set. Option B is wrong because a high temperature setting increases randomness in token selection, potentially causing less coherent or more creative outputs, but it does not directly cause the model to fabricate facts absent from the provided context. Option C is wrong because an inaccurate embedding model leads to poor semantic matching and retrieval of irrelevant documents, but the generated answer would still be based on whatever documents were retrieved, not on information outside those documents.

12
MCQhard

A team wants to deploy a large language model on edge devices with limited memory and compute. They need to reduce model size by at least 50% while preserving accuracy. Which combination of techniques is most effective?

A.Apply INT8 quantization and weight pruning
B.Distill the model into a smaller architecture without quantization or pruning
C.Use FP32 precision and increase batch size
D.Use FP16 quantization and add more layers
AnswerA

INT8 reduces storage and computation; pruning removes redundant weights, yielding >50% size reduction with minor accuracy impact.

Why this answer

INT8 quantization reduces the precision of weights and activations from 32-bit to 8-bit, cutting memory usage by approximately 75% for those tensors, while weight pruning removes redundant connections, often achieving over 50% size reduction with minimal accuracy loss when combined. Together, they directly address the constraints of edge devices by shrinking the model footprint and computational requirements without requiring a complete architecture redesign.

Exam trap

Cisco often tests the misconception that a single technique (like distillation or FP16) is sufficient for aggressive size reduction, when in reality, combining complementary compression methods (quantization and pruning) is necessary to meet both the 50% size reduction and accuracy preservation requirements on edge devices.

How to eliminate wrong answers

Option B is wrong because knowledge distillation alone reduces model size by training a smaller student network, but without quantization or pruning, the student model may still exceed the 50% reduction target or suffer significant accuracy loss if the architecture is not aggressively compressed. Option C is wrong because using FP32 precision and increasing batch size actually increases memory and compute demands, making it unsuitable for resource-constrained edge devices. Option D is wrong because FP16 quantization provides only a 50% memory reduction (not guaranteed to meet the target when combined with adding layers, which increases model size and complexity, often negating the quantization benefit and degrading accuracy on edge hardware without native FP16 support.

13
Multi-Selecteasy

A data scientist wants to build a proof-of-concept chatbot using a large language model. They need to choose a cloud AI platform that provides easy access to pre-trained models via API, with built-in safety filters and prompt engineering tools. Which TWO platforms are best suited?

Select 2 answers
A.Amazon Bedrock
B.Azure OpenAI Service
C.Amazon SageMaker
D.Hugging Face Hub
E.Google Vertex AI
AnswersB, E

Azure OpenAI provides managed access to GPT models with safety filters and prompt engineering.

Why this answer

Azure OpenAI Service (B) is correct because it provides direct API access to pre-trained models like GPT-4 with built-in content safety filters (e.g., Azure AI Content Safety) and integrated prompt engineering tools (e.g., Prompt Flow in Azure Machine Learning). This makes it ideal for quickly building a proof-of-concept chatbot with safety guardrails.

Exam trap

Cisco often tests the distinction between a managed API service for pre-trained models (Azure OpenAI, Vertex AI) and a full ML platform (SageMaker) or a model hub (Hugging Face) that lacks integrated safety and prompt engineering tools.

14
MCQmedium

A data science team is deploying a deep learning model for real-time inference on edge devices with limited power and memory. Which model optimisation technique would be MOST effective for reducing latency and memory footprint while maintaining acceptable accuracy?

A.Use a larger batch size during inference
B.Train the model for more epochs to improve convergence
C.Apply quantisation to convert weights from FP32 to INT8
D.Increase the number of layers to improve feature extraction
AnswerC

Quantisation reduces model size and speeds up inference, making it ideal for edge devices.

Why this answer

Quantization reduces the precision of model weights from 32-bit floating point (FP32) to 8-bit integer (INT8), which directly cuts memory usage by 75% and accelerates inference on edge devices by leveraging integer arithmetic. This technique is specifically designed for resource-constrained environments where power and memory are limited, and it typically preserves accuracy within 1-2% of the original model.

Exam trap

Cisco often tests the misconception that increasing model complexity (more layers or epochs) improves deployment performance, when in fact the opposite is true for edge inference; candidates may confuse training optimization with inference optimization.

How to eliminate wrong answers

Option A is wrong because increasing batch size during inference increases memory consumption and latency on edge devices, as it requires processing multiple inputs simultaneously, which is counterproductive for real-time, low-latency requirements. Option B is wrong because training for more epochs improves convergence and accuracy but does not reduce model size or inference latency; it may even lead to overfitting without any benefit to deployment efficiency. Option D is wrong because adding more layers increases the model's parameter count, memory footprint, and computational latency, directly opposing the goal of reducing resource usage on edge devices.

15
MCQmedium

A data scientist needs to deploy a PyTorch model to production with low-latency inference. The model must be served as a REST API and should support GPU acceleration. Which combination of tools is MOST suitable for this task?

A.ONNX runtime with a gRPC endpoint on a CPU-only node
B.Apache Spark with MLlib to serve the model in batch mode
C.Docker container with a FastAPI application and Nvidia GPU support
D.Kubeflow Pipelines to deploy the model as a scheduled job
AnswerC

Docker encapsulates the environment, FastAPI provides REST API capabilities, and Nvidia GPU support enables GPU acceleration for inference.

Why this answer

Option C is correct because it combines Docker containerization with FastAPI for a lightweight REST API and NVIDIA GPU support (via nvidia-docker or NVIDIA Container Toolkit) to enable low-latency GPU-accelerated inference. This stack directly meets the requirements of low-latency inference, REST API serving, and GPU acceleration without unnecessary overhead.

Exam trap

Cisco often tests the distinction between batch/offline processing tools (like Spark or Kubeflow Pipelines) and real-time serving frameworks, leading candidates to confuse orchestration or batch tools with low-latency inference solutions.

How to eliminate wrong answers

Option A is wrong because ONNX Runtime with a gRPC endpoint on a CPU-only node cannot provide GPU acceleration, which is explicitly required. Option B is wrong because Apache Spark with MLlib is designed for distributed batch processing and large-scale data pipelines, not for low-latency real-time REST API serving of a single PyTorch model. Option D is wrong because Kubeflow Pipelines is a workflow orchestration tool for scheduling and managing ML pipelines, not a real-time inference serving solution; it lacks native REST API endpoints for low-latency inference.

16
MCQhard

A company uses Azure OpenAI to generate customer support responses. The team notices that repeated queries with similar context incur high costs due to token usage. They want to reduce costs without affecting response quality. Which strategy is MOST effective?

A.Use a larger model to improve efficiency
B.Increase the frequency penalty
C.Reduce the max_tokens parameter
D.Implement prompt caching
AnswerD

Prompt caching avoids recomputing common prefixes, reducing token usage.

Why this answer

Prompt caching stores and reuses tokens from previous queries, reducing token consumption for similar requests and lowering costs without quality loss.

17
MCQeasy

An AI developer needs to store large amounts of unstructured data (e.g., images, logs) for training datasets. Which cloud storage solution is purpose-built for data lakes?

A.Amazon DynamoDB
B.Amazon RDS
C.Amazon S3
D.Amazon Redshift
AnswerC

S3 is an object store that is the foundation of many data lake architectures.

Why this answer

Amazon S3 is purpose-built for data lakes because it provides virtually unlimited scalability, high durability (99.999999999% or 11 nines), and supports any type of unstructured data (images, logs, videos) with a flat object storage architecture. Its integration with AWS Glue, Athena, and Lake Formation enables schema-on-read analytics, making it the foundational service for building a data lake on AWS.

Exam trap

Cisco often tests the distinction between storage for analytics (S3 data lake) vs. storage for transactions (DynamoDB) or structured querying (Redshift), so candidates mistakenly choose Redshift because they associate 'data' with 'warehouse' rather than recognizing that data lakes require raw object storage.

How to eliminate wrong answers

Option A is wrong because Amazon DynamoDB is a NoSQL key-value and document database optimized for low-latency transactional workloads, not for storing large volumes of unstructured data for analytics. Option B is wrong because Amazon RDS is a relational database service for structured data with fixed schemas, and it cannot scale to petabyte-scale unstructured data storage. Option D is wrong because Amazon Redshift is a petabyte-scale data warehouse designed for structured, columnar data and SQL-based analytics, not for storing raw unstructured data like images or logs.

18
MCQmedium

A data scientist is building a recommendation system using Apache Spark for feature engineering. They need to process streaming user click data in real-time before feeding into the model. Which tool should they use for the streaming data ingestion?

A.Amazon S3
B.Apache Kafka
C.Airflow
D.Snowflake
AnswerB

Kafka supports high-throughput, real-time data streams that can be processed by Spark.

Why this answer

Apache Kafka is the correct choice because it is a distributed streaming platform designed for high-throughput, fault-tolerant, real-time data ingestion. It acts as a durable message broker that can ingest streaming click data and make it available for Spark Structured Streaming to process in micro-batches or continuous processing mode, which is essential for real-time feature engineering in a recommendation system.

Exam trap

Cisco often tests the distinction between storage, orchestration, and streaming tools, and the trap here is that candidates confuse batch-oriented tools like S3 or Airflow with real-time streaming ingestion, overlooking Kafka's role as a dedicated event streaming platform.

How to eliminate wrong answers

Option A is wrong because Amazon S3 is an object storage service, not a streaming ingestion tool; it lacks the low-latency, pub-sub messaging capabilities required for real-time data streaming. Option C is wrong because Airflow is a workflow orchestration tool for scheduling batch jobs, not a real-time streaming ingestion platform; it cannot handle continuous, event-driven data streams. Option D is wrong because Snowflake is a cloud-based data warehouse optimized for analytical queries on structured data, not for real-time streaming ingestion; it does not provide a pub-sub or message queue interface for live click data.

19
MCQmedium

A team is using Hugging Face Transformers to serve an LLM via a REST API. They notice high latency during inference. The model is deployed on a single GPU. Which optimisation would reduce inference latency WITHOUT changing the model architecture?

A.Use model quantisation to FP16
B.Add more GPUs and distribute the model
C.Increase the batch size to process multiple requests simultaneously
D.Switch from GPU to CPU
AnswerA

FP16 half-precision reduces memory and compute, lowering latency on compatible hardware.

Why this answer

FP16 quantization reduces the memory footprint and computational load of the model by using half-precision floating-point numbers, which allows the GPU to process more operations per second and reduces memory bandwidth usage. This directly lowers inference latency without altering the model's architecture, making it the correct choice for a single-GPU deployment.

Exam trap

Cisco often tests the distinction between latency and throughput, and the trap here is that candidates confuse increasing batch size (which improves throughput) with reducing per-request latency, leading them to incorrectly select Option C.

How to eliminate wrong answers

Option B is wrong because adding more GPUs and distributing the model (model parallelism) does not reduce latency for a single request; it primarily increases throughput for multiple requests and can actually increase per-request latency due to inter-GPU communication overhead. Option C is wrong because increasing batch size improves throughput (requests per second) but increases per-request latency, as the GPU must process a larger batch before returning results. Option D is wrong because switching from GPU to CPU would dramatically increase inference latency, as CPUs are not optimized for the parallel matrix operations required by LLMs.

20
MCQmedium

A company wants to build a customer service chatbot that answers questions about their internal policy documents. The documents are updated monthly, and the team cannot afford to retrain a model each time. Which approach is MOST appropriate?

A.Use Retrieval-Augmented Generation (RAG) with the policy documents indexed in a vector store
B.Train a custom model from scratch on the policy documents each month
C.Use a larger foundation model with a longer context window and paste all documents into each prompt
D.Fine-tune a base LLM on the policy documents monthly
AnswerA

RAG retrieves relevant document chunks at query time, ensuring the chatbot always answers from the latest uploaded documents without any model retraining.

Why this answer

Retrieval-Augmented Generation (RAG) is the most appropriate approach because it allows the chatbot to answer questions by retrieving relevant chunks from the policy documents stored in a vector store, without requiring model retraining. When documents are updated monthly, RAG simply re-indexes the new content, keeping the system current while avoiding the cost and complexity of fine-tuning or retraining a model each cycle.

Exam trap

Cisco often tests the distinction between retrieval-based approaches (RAG) and fine-tuning, where candidates mistakenly choose fine-tuning because they think it 'customizes' the model, but the key constraint here is avoiding monthly retraining, which RAG uniquely satisfies.

How to eliminate wrong answers

Option B is wrong because training a custom model from scratch each month is prohibitively expensive and time-consuming, requiring large datasets and GPU resources, and contradicts the requirement to avoid retraining. Option C is wrong because pasting all policy documents into each prompt exceeds typical context window limits (e.g., 4K–128K tokens for most models), leading to truncation, high latency, and increased cost per query. Option D is wrong because fine-tuning a base LLM monthly still requires retraining, which the team cannot afford, and fine-tuning may cause catastrophic forgetting of previous policies unless carefully managed with multi-epoch training on all historical data.

21
MCQeasy

Which AWS service would a developer use to integrate a pre-built foundation model into an application via API, without managing underlying infrastructure?

A.Amazon Bedrock
B.Amazon EC2
C.Amazon SageMaker
D.AWS Lambda
AnswerA

Bedrock offers a managed API to foundation models from various providers.

Why this answer

Amazon Bedrock is a fully managed service that provides access to pre-built foundation models (FMs) from providers like AI21 Labs, Anthropic, Cohere, Meta, and Stability AI via a unified API. It eliminates the need to manage underlying infrastructure such as servers, GPUs, or scaling, allowing developers to integrate FMs into applications with minimal operational overhead.

Exam trap

Cisco often tests the distinction between managed AI services (Bedrock) and infrastructure-heavy services (EC2, SageMaker) by framing the question around 'pre-built models' and 'no infrastructure management,' leading candidates to mistakenly choose SageMaker because it is associated with AI/ML, even though it requires model hosting and endpoint management.

How to eliminate wrong answers

Option B (Amazon EC2) is wrong because EC2 requires you to manually provision, configure, and manage virtual servers and GPU instances, including patching, scaling, and infrastructure maintenance, which contradicts the 'without managing underlying infrastructure' requirement. Option C (Amazon SageMaker) is wrong because SageMaker is a machine learning platform focused on building, training, and deploying custom models; while it can host models, it does not provide pre-built foundation models via a simple API and still involves managing endpoints, instances, and model artifacts. Option D (AWS Lambda) is wrong because Lambda is a serverless compute service for running code in response to events, not a service for integrating pre-built foundation models; it lacks native APIs for invoking FMs and would require you to write custom code to call external model endpoints.

22
MCQhard

A team is deploying a model on Kubernetes using Kubeflow. They want to automatically scale the number of inference pods based on request latency. Which Kubernetes-native feature should they configure?

A.Horizontal Pod Autoscaler (HPA) with custom metrics
B.Kubeflow Pipelines component
C.Cluster Autoscaler
D.Vertical Pod Autoscaler (VPA)
AnswerA

HPA scales the number of pods based on metrics like latency, which is what the team needs.

Why this answer

The Horizontal Pod Autoscaler (HPA) with custom metrics is the correct choice because it allows scaling based on application-level metrics like request latency, not just CPU or memory. By configuring HPA to use a custom metric (e.g., from Prometheus or a metrics adapter), the team can automatically adjust the number of inference pods to maintain target latency thresholds, which is essential for responsive inference serving.

Exam trap

Cisco often tests the distinction between pod-level scaling (HPA) and node-level scaling (Cluster Autoscaler), and the trap here is that candidates may confuse Cluster Autoscaler with pod autoscaling, or assume VPA can handle latency-based scaling when it only adjusts resource limits.

How to eliminate wrong answers

Option B is wrong because Kubeflow Pipelines is a workflow orchestration component for building and managing ML pipelines, not a scaling mechanism; it cannot directly scale pods based on latency. Option C is wrong because Cluster Autoscaler adjusts the number of nodes in the Kubernetes cluster, not the number of pods, and does not respond to request latency metrics. Option D is wrong because Vertical Pod Autoscaler (VPA) adjusts CPU/memory resource requests for existing pods, not the number of pods, and is not designed for latency-based scaling.

23
Multi-Selecteasy

A developer is choosing a vector database for a RAG application that requires real-time updates and millisecond query latency. Which TWO vector databases are best suited for this requirement?

Select 2 answers
A.MongoDB Atlas Vector Search
B.Redis with vector similarity module
C.Pinecone
D.Weaviate
E.PostgreSQL with pgvector
AnswersC, D

Pinecone is a managed vector database with fast indexing and query performance.

Why this answer

Pinecone is a fully managed vector database designed for production-scale RAG applications, offering sub-10ms query latency and real-time index updates without requiring manual infrastructure tuning. Weaviate similarly provides native vector search with millisecond latency and supports real-time data ingestion through its auto-schema and incremental indexing, making both ideal for latency-sensitive, frequently updated RAG workloads.

Exam trap

Cisco often tests the misconception that any in-memory or NoSQL database (like Redis or MongoDB) can achieve millisecond vector search latency, but the trap is that real-time updates and high-dimensional ANN search require specialized indexing (HNSW) and distributed architecture that only purpose-built vector databases like Pinecone and Weaviate provide.

24
Multi-Selectmedium

An organization is building a recommendation system that requires low-latency vector similarity search. They need to store and query millions of embeddings. Which THREE technologies are appropriate for this task?

Select 3 answers
A.Snowflake
B.Amazon S3
C.Weaviate
D.pgvector
E.Pinecone
AnswersC, D, E

Weaviate is an open-source vector database with built-in similarity search.

Why this answer

Pinecone, Weaviate, and pgvector are vector databases designed for similarity search. Snowflake and S3 are not optimized for vector search.

25
MCQmedium

A financial institution requires that all AI model predictions be explainable and auditable for regulatory compliance. Which model serving approach should be used to meet these requirements?

A.Use gRPC streaming for lower latency
B.Export the model to ONNX format and run on a dedicated inference server
C.Deploy the model on edge devices to avoid centralised logging
D.Deploy the model as a containerised microservice with REST API and log all request/response pairs
AnswerD

REST APIs with logging provide a clear audit trail and can integrate with explainability tools.

Why this answer

Option D is correct because logging all request/response pairs provides a complete audit trail, which is essential for regulatory compliance in financial institutions. Containerized microservices with REST APIs are stateless and can be easily integrated with centralized logging systems (e.g., ELK stack) to capture every prediction for explainability and review. This approach ensures that model decisions are transparent and can be traced back to specific inputs, satisfying both explainability and auditability requirements.

Exam trap

Cisco often tests the misconception that performance optimizations (like gRPC or ONNX) inherently solve compliance requirements, when in fact auditability and explainability depend on explicit logging and traceability mechanisms, not just model format or transport protocol.

How to eliminate wrong answers

Option A is wrong because gRPC streaming focuses on low-latency communication, not on logging or auditability; it does not inherently provide request/response capture for compliance. Option B is wrong because exporting to ONNX and running on a dedicated inference server improves portability and performance but does not automatically log predictions or provide an audit trail; additional logging infrastructure would be required. Option C is wrong because deploying on edge devices avoids centralized logging, which directly contradicts the need for auditable records; edge deployments often lack persistent, centralized storage of prediction history, making regulatory review difficult.

26
MCQmedium

A company uses AWS SageMaker to train a model and wants to deploy it for real-time inference. They also need to monitor the endpoint for data drift and retrain automatically. Which SageMaker feature enables this automated retraining pipeline?

A.SageMaker Pipelines
B.SageMaker Debugger
C.SageMaker Ground Truth
D.SageMaker Model Monitor
AnswerA

SageMaker Pipelines can orchestrate the retraining process when triggered by drift detection or schedule.

Why this answer

SageMaker Pipelines is the correct answer because it provides a fully managed CI/CD service for creating, automating, and managing end-to-end machine learning workflows. It allows you to define a pipeline that includes steps for monitoring data drift (via Model Monitor), triggering retraining jobs, and deploying updated models, enabling the automated retraining pipeline described in the question.

Exam trap

The trap here is that candidates often confuse SageMaker Model Monitor's detection capability with the full orchestration needed for automated retraining, assuming that monitoring alone can trigger retraining without a pipeline service.

How to eliminate wrong answers

Option B is wrong because SageMaker Debugger is designed for debugging training jobs by monitoring system metrics, profiling, and detecting anomalies like vanishing gradients, not for orchestrating automated retraining pipelines. Option C is wrong because SageMaker Ground Truth is a data labeling service that creates high-quality training datasets using human annotators, not for automating model retraining or deployment. Option D is wrong because SageMaker Model Monitor only detects data drift and quality issues by analyzing inference data, but it does not include the orchestration logic to automatically trigger retraining or redeployment; that requires a pipeline service like SageMaker Pipelines.

27
MCQhard

A company is implementing a retrieval-augmented generation (RAG) pipeline using a vector database. They notice that the retrieved documents often lack relevance to the query. Which adjustment would MOST improve retrieval quality?

A.Use a better embedding model fine-tuned on domain-specific data
B.Increase the chunk size of documents
C.Switch from cosine similarity to Euclidean distance
D.Reduce the number of retrieved documents from 5 to 3
AnswerA

Domain-specific embeddings capture semantic nuances better, improving retrieval relevance.

Why this answer

Retrieval quality in a RAG pipeline is fundamentally determined by the semantic alignment between query embeddings and document embeddings. A domain-specific fine-tuned embedding model captures the unique terminology, context, and relationships within the company's data, producing vector representations that are far more relevant than those from a generic model. This directly improves the similarity search results in the vector database, leading to higher-quality retrieved documents.

Exam trap

Cisco often tests the misconception that retrieval quality can be improved by tuning retrieval parameters (chunk size, distance metric, or k-value) rather than addressing the foundational quality of the embedding model, which is the primary driver of semantic relevance.

How to eliminate wrong answers

Option B is wrong because increasing chunk size can reduce granularity and introduce noise, potentially lowering retrieval precision by mixing irrelevant content with relevant passages. Option C is wrong because cosine similarity and Euclidean distance are both valid distance metrics; switching between them does not inherently improve relevance unless the embedding space is normalized, and cosine similarity is typically preferred for high-dimensional semantic embeddings. Option D is wrong because reducing the number of retrieved documents from 5 to 3 may increase precision but at the cost of recall, and does not address the root cause of poor relevance—the quality of the embeddings themselves.

28
MCQeasy

An organization wants to centralize experiment tracking, model versioning, and deployment management across its data science team. Which MLOps platform is specifically designed for experiment tracking and model registry?

A.Apache Airflow
B.Weights & Biases
C.MLflow
D.Kubeflow
AnswerC

MLflow offers experiment tracking, model registry, and deployment management, making it a comprehensive tool for MLOps.

Why this answer

MLflow is an open-source MLOps platform that provides a centralized experiment tracking API (MLflow Tracking) and a model registry (MLflow Model Registry) for versioning, staging, and deploying machine learning models. It is specifically designed to address the need for experiment tracking and model lifecycle management, making it the correct choice for this scenario.

Exam trap

Cisco often tests the distinction between tools that handle only one part of the MLOps lifecycle (like W&B for tracking or Kubeflow for deployment) versus a unified platform like MLflow that combines experiment tracking and model registry.

How to eliminate wrong answers

Option A is wrong because Apache Airflow is a workflow orchestration tool for scheduling and managing data pipelines, not a platform for experiment tracking or model registry. Option B is wrong because Weights & Biases (W&B) is a commercial platform focused on experiment tracking and visualization, but it does not include a built-in model registry for versioning and deployment management; its model registry is a separate add-on and not as integrated as MLflow's. Option D is wrong because Kubeflow is a Kubernetes-native platform for deploying and managing ML workflows, but it does not have a dedicated experiment tracking or model registry component; it relies on external tools like MLflow or Katib for those functions.

29
MCQeasy

A machine learning engineer wants to track experiment parameters, metrics, and model artifacts across multiple runs. Which MLOps tool is specifically designed for experiment tracking?

A.Apache Airflow
B.Weights & Biases
C.MLflow
D.Kubeflow
AnswerB

Weights & Biases is a dedicated experiment tracking and visualization tool.

Why this answer

Weights & Biases (W&B) is purpose-built for experiment tracking, offering a centralized dashboard to log hyperparameters, metrics, and model artifacts across runs. It provides automatic logging for popular frameworks (e.g., PyTorch, TensorFlow) and supports rich visualizations like loss curves and parallel coordinate plots, making it the correct choice for this specific requirement.

Exam trap

Cisco often tests the distinction between general-purpose MLOps tools (like MLflow or Kubeflow) and specialized experiment tracking tools (like Weights & Biases), trapping candidates who assume any MLOps platform automatically excels at experiment tracking.

How to eliminate wrong answers

Option A is wrong because Apache Airflow is a workflow orchestration tool for scheduling and managing DAGs (Directed Acyclic Graphs) of tasks, not a dedicated experiment tracking platform. Option C is wrong because MLflow is an open-source platform for the full ML lifecycle (including experiment tracking, packaging, and deployment), but the question asks for a tool 'specifically designed for experiment tracking' — while MLflow does offer tracking, it is broader in scope and not as specialized as W&B. Option D is wrong because Kubeflow is a Kubernetes-native platform for deploying and managing ML pipelines, focusing on orchestration and scaling rather than detailed experiment logging and comparison.

30
MCQmedium

An MLOps team wants to deploy a trained PyTorch model to production with low latency inference. The model must be interoperable across different frameworks and runtimes. Which approach is BEST?

A.Deploy the native PyTorch model using TorchServe
B.Quantize the model to INT8 and deploy as a TensorFlow Lite model
C.Convert the model to TensorFlow SavedModel and deploy using TensorFlow Serving
D.Export the model to ONNX format and deploy using ONNX Runtime
AnswerD

ONNX is a framework-agnostic format; ONNX Runtime is optimized for low-latency inference and supports hardware acceleration.

Why this answer

Option D is correct because ONNX (Open Neural Network Exchange) provides a standardized, framework-agnostic format that ensures interoperability across different runtimes and hardware accelerators. By exporting the PyTorch model to ONNX and deploying with ONNX Runtime, the team achieves low-latency inference through graph optimizations and hardware-specific execution providers, while avoiding vendor lock-in.

Exam trap

Cisco often tests the misconception that framework-native serving (TorchServe, TensorFlow Serving) is the best path for low latency, ignoring the explicit requirement for cross-framework interoperability that ONNX uniquely satisfies.

How to eliminate wrong answers

Option A is wrong because deploying a native PyTorch model with TorchServe locks the inference into the PyTorch ecosystem, violating the requirement for interoperability across different frameworks and runtimes. Option B is wrong because quantizing to INT8 and deploying as a TensorFlow Lite model introduces unnecessary precision loss and framework conversion overhead, and TensorFlow Lite is primarily designed for mobile/edge devices, not general production low-latency serving. Option C is wrong because converting to TensorFlow SavedModel and using TensorFlow Serving ties the deployment to the TensorFlow stack, which does not satisfy the interoperability requirement and adds conversion complexity without the broad runtime support that ONNX provides.

31
MCQmedium

A company uses a vector database to store embeddings for a RAG application. Users report that some queries return irrelevant results. Which adjustment is most likely to improve relevance?

A.Reduce the chunk size of documents
B.Switch from an HNSW index to a flat index
C.Increase the top-k retrieval count
D.Change the similarity metric from cosine to dot product and use a different embedding model
AnswerD

The similarity metric and embedding quality are primary drivers of retrieval relevance.

Why this answer

Switching from cosine similarity to dot product and using a different embedding model can improve relevance because the choice of similarity metric must align with the embedding model's training objective. Many modern embedding models (e.g., text-embedding-ada-002) are optimized for dot product or cosine similarity, but if the current model was trained for cosine and the queries are not normalized, dot product may better capture magnitude and direction. A different model may also produce higher-quality embeddings that better represent semantic relationships, directly addressing irrelevant results.

Exam trap

Cisco often tests the misconception that changing the index type or retrieval count directly improves relevance, when the root cause is usually a mismatch between the similarity metric and the embedding model's training objective.

How to eliminate wrong answers

Option A is wrong because reducing chunk size can fragment context and lose semantic meaning, potentially worsening relevance rather than improving it. Option B is wrong because switching from an HNSW (Hierarchical Navigable Small World) index to a flat index increases search latency and does not inherently improve relevance; HNSW is designed for efficient approximate nearest neighbor search with good recall. Option C is wrong because increasing the top-k retrieval count returns more results but does not improve the relevance of the top results; it may actually introduce more noise if the embedding quality or similarity metric is suboptimal.

32
MCQhard

A data scientist is training a large language model on a custom dataset using PyTorch on AWS. The training is taking too long due to GPU memory constraints. The team wants to use multiple GPUs across instances with minimal code changes. Which AWS service should they use?

A.AWS Elastic Fabric Adapter (EFA)
B.Amazon SageMaker with distributed training libraries
C.AWS Batch with GPU instances
D.AWS ParallelCluster with Slurm
AnswerB

SageMaker's distributed libraries (e.g., SageMaker Data Parallelism) enable multi-GPU training with minimal code changes.

Why this answer

SageMaker distributed training libraries support data parallelism and model parallelism with minimal code changes, enabling multi-GPU training across instances efficiently.

33
Multi-Selectmedium

A healthcare startup needs to deploy an AI model for real-time patient monitoring on IoT devices with limited battery and compute. The model must run locally with minimal latency. Which TWO strategies are most appropriate?

Select 2 answers
A.Apply model distillation to create a smaller student model
B.Deploy the model on a cloud server and stream data
C.Use TensorFlow Lite to convert and run the model on the device
D.Quantize the model to INT8 precision
E.Use ONNX Runtime with a GPU backend
AnswersC, D

TensorFlow Lite is optimized for on-device machine learning, providing low-latency inference on resource-constrained devices.

Why this answer

Option C is correct because TensorFlow Lite is specifically designed to run TensorFlow models on resource-constrained edge devices like IoT sensors. It optimizes the model for low latency inference by using a specialized interpreter and hardware acceleration delegates (e.g., NNAPI, GPU), enabling real-time patient monitoring without cloud dependency.

Exam trap

Cisco often tests the misconception that model distillation alone is sufficient for edge deployment, when in fact it must be combined with a framework like TensorFlow Lite and quantization to meet hardware constraints.

34
MCQmedium

An organization is deploying a large language model on-premises for compliance reasons. They need to serve inference requests with low latency. Which architecture should they use?

A.Use a batch processing system like Apache Spark
B.Containerize the model and deploy it on a Kubernetes cluster with autoscaling
C.Use a serverless function like AWS Lambda
D.Deploy the model as a REST API on a single powerful server
AnswerB

Kubernetes enables container orchestration, autoscaling, and load balancing, meeting low-latency and compliance requirements.

Why this answer

Containerizing the model and deploying it on a Kubernetes cluster with autoscaling is the correct architecture because it provides horizontal scaling, low-latency inference through load-balanced pods, and supports on-premises deployment for compliance. Kubernetes can automatically scale replicas based on CPU/memory utilization or custom metrics (e.g., request queue depth), ensuring consistent response times under varying load.

Exam trap

Cisco often tests the misconception that a single powerful server is sufficient for low-latency inference, but the trap is that it ignores the need for horizontal scalability and fault tolerance, which are critical for production workloads.

How to eliminate wrong answers

Option A is wrong because batch processing systems like Apache Spark are designed for large-scale data processing jobs, not real-time inference; they introduce high latency due to job scheduling and data shuffling, making them unsuitable for serving low-latency requests. Option C is wrong because serverless functions like AWS Lambda are typically cloud-only and may not support on-premises deployment; they also have cold-start latency and execution time limits that conflict with low-latency inference requirements. Option D is wrong because deploying on a single powerful server creates a single point of failure and cannot scale horizontally to handle traffic spikes, leading to increased latency under load.

35
MCQmedium

A machine learning engineer needs to deploy a PyTorch model for real-time inference with low latency. The model uses custom operators that are not supported by standard ONNX conversion. Which deployment approach is MOST appropriate?

A.Use TensorFlow Serving with a saved model format
B.Wrap the model in a Flask app and deploy on a VM
C.Deploy the model using TorchServe with a custom handler
D.Convert the model to ONNX and serve with ONNX Runtime
AnswerC

TorchServe handles custom operators natively and provides optimized inference.

Why this answer

TorchServe is the native serving solution for PyTorch models and supports custom operators through custom handlers, allowing you to implement arbitrary preprocessing, inference, and postprocessing logic in Python. This approach avoids the need for ONNX conversion entirely, which is critical when custom operators are not supported by the ONNX standard. It also provides built-in features like model versioning, batching, and metrics for low-latency real-time inference.

Exam trap

Cisco often tests the misconception that ONNX is a universal solution for all model deployment scenarios, but the trap here is that custom operators break ONNX compatibility, so candidates must recognize when native serving frameworks like TorchServe are required instead of conversion-based approaches.

How to eliminate wrong answers

Option A is wrong because TensorFlow Serving expects a TensorFlow SavedModel format and cannot directly serve PyTorch models; converting a PyTorch model with custom operators to TensorFlow would require ONNX or another intermediate format, which is precisely the unsupported path. Option B is wrong because wrapping the model in a Flask app on a VM is a manual, non-scalable approach that lacks production-grade features like automatic batching, model versioning, and health checks, and it would require the engineer to build all serving infrastructure from scratch. Option D is wrong because the model uses custom operators not supported by standard ONNX conversion, so converting to ONNX would either fail or require custom ONNX operators, which defeats the purpose of using a standard runtime like ONNX Runtime.

36
MCQmedium

A team is using an API from a cloud AI service to generate text. They notice that repeated requests with the same prompt return different outputs. They want consistent responses for testing. Which parameter should they adjust?

A.Increase the top_p parameter to 1.0
B.Set the frequency_penalty to 0
C.Increase the max_tokens parameter
D.Set the temperature to 0
AnswerD

Temperature controls randomness; a value of 0 makes the model deterministic, so the same prompt always yields the same output.

Why this answer

Setting the temperature to 0 makes the model deterministic, producing the same output for the same input, which is ideal for testing.

37
MCQhard

A company is deploying a real-time object detection model on a fleet of IoT cameras. The model must run at 30 FPS on a device with limited memory and no internet connectivity. Which combination of techniques is MOST suitable?

A.Use FP16 inference and deploy via Docker containers
B.Use model distillation to create a smaller model and deploy via ONNX Runtime
C.Deploy on a GPU-based edge server with a full PyTorch model
D.Apply INT8 quantization and pruning, then deploy using TensorFlow Lite
AnswerD

INT8 quantization reduces memory footprint and accelerates inference; pruning removes redundant parameters. TensorFlow Lite is optimized for edge devices.

Why this answer

Option D is correct because INT8 quantization reduces model size and latency, while pruning removes redundant weights, making the model suitable for memory-constrained edge devices. TensorFlow Lite is optimized for on-device inference with no internet dependency, supporting real-time 30 FPS object detection on IoT cameras.

Exam trap

Cisco often tests the misconception that any lightweight deployment framework (like ONNX Runtime) is sufficient for edge devices, ignoring the need for hardware-specific quantization and pruning to meet strict memory and FPS constraints.

How to eliminate wrong answers

Option A is wrong because FP16 inference reduces precision but still requires significant memory and compute resources; Docker containers add overhead and are not designed for ultra-low-memory IoT cameras. Option B is wrong because model distillation creates a smaller model, but ONNX Runtime is a cross-platform inference engine that does not inherently provide the aggressive memory and latency optimizations needed for 30 FPS on constrained devices; it also lacks native support for hardware-specific quantization like TensorFlow Lite. Option C is wrong because deploying a full PyTorch model on a GPU-based edge server contradicts the 'limited memory and no internet connectivity' constraint; GPUs are power-hungry and expensive, and PyTorch's runtime overhead is too high for a memory-constrained IoT camera.

38
MCQmedium

A data engineering team needs to orchestrate a complex ML pipeline that involves data extraction, transformation, model training, and deployment. They require scheduling, monitoring, and retry logic. Which MLOps tool is BEST suited for this task?

A.Weights & Biases
B.Kubeflow
C.MLflow
D.Apache Airflow
AnswerD

Airflow is a mature, flexible orchestrator for scheduling and monitoring complex pipelines.

Why this answer

Apache Airflow is a workflow orchestration tool that supports complex DAGs, scheduling, monitoring, and retries, making it ideal for ML pipelines.

39
MCQhard

A team is deploying a model on AWS SageMaker and needs to handle variable traffic patterns with automatic scaling based on request latency. They want to minimize costs during low traffic. Which endpoint configuration should they use?

A.Use AWS Lambda with SageMaker
B.Provision a fixed number of instances with Multi-Model Endpoints
C.Use a single large instance to handle peak traffic
D.Configure automatic scaling with a target tracking metric based on latency
AnswerD

Target tracking scaling adjusts instance count based on a metric like latency, optimizing cost and performance.

Why this answer

Option D is correct because SageMaker's automatic scaling with a target tracking metric based on latency allows the endpoint to dynamically adjust the number of instances in response to real-time request latency, ensuring cost efficiency during low traffic while maintaining performance during spikes. This approach uses a predefined or custom metric (e.g., SageMakerVariantInvocationsPerInstance) to trigger scaling policies, minimizing over-provisioning and idle costs.

Exam trap

Cisco often tests the misconception that Multi-Model Endpoints (Option B) provide automatic scaling, but they only optimize model hosting density, not dynamic instance scaling based on latency.

How to eliminate wrong answers

Option A is wrong because AWS Lambda with SageMaker is used for serverless inference or preprocessing, not for managing endpoint scaling; it does not provide automatic scaling based on latency. Option B is wrong because Multi-Model Endpoints reduce costs by hosting multiple models on shared instances but still require manual or scheduled scaling; they do not inherently scale based on latency. Option C is wrong because using a single large instance to handle peak traffic leads to high costs during low traffic and risks performance degradation or throttling during unexpected spikes, as it lacks elasticity.

40
MCQeasy

An organization wants to integrate an AI-powered summarization feature into their existing web application. The AI service will be called via API. Which factor is MOST important to consider for cost management?

A.Token pricing of the AI model
B.Authentication method (API key vs. OAuth)
C.Rate limits per minute
D.Network latency to the API endpoint
AnswerA

Token pricing is the primary cost driver; optimizing prompt length and output tokens directly reduces expenses.

Why this answer

Token pricing directly impacts cost because API calls are billed based on the number of tokens (input + output). Understanding token usage helps estimate and control expenses.

41
Multi-Selectmedium

A team is deploying a model that must comply with GDPR. Users can request deletion of their data. Which TWO practices should be implemented to support this compliance? (Select TWO.)

Select 2 answers
A.Enable output caching for frequently requested predictions
B.Validate inputs to prevent prompt injection attacks
C.Use a vector database to store user embeddings
D.Maintain the ability to delete a user's data from training sets and derived features
E.Implement data versioning and lineage tracking
AnswersD, E

Directly supports the right to erasure by allowing removal of user data and any features based on it.

Why this answer

Option D is correct because GDPR's 'right to erasure' requires that upon user request, the organization must delete not only the user's raw data but also any derived features or embeddings that were generated from that data. Without this capability, the model could still indirectly retain user information through trained parameters or feature stores, violating compliance.

Exam trap

Cisco often tests the misconception that simply using a vector database or caching mechanism satisfies GDPR deletion requirements, when in fact the critical practice is maintaining the ability to delete user data from all derived artifacts, including training sets and feature stores.

42
MCQmedium

A data scientist is using PyTorch to train a custom NLP model. The training is slow on a single GPU. They want to speed up training by using multiple GPUs on a single machine. Which PyTorch feature should they use?

A.TorchScript tracing
B.torch.nn.DataParallel
C.torch.optim.SGD
D.PyTorch Lightning's zero_grad function
AnswerB

DataParallel automatically splits input across GPUs and aggregates gradients; it's the simplest multi-GPU approach.

Why this answer

DataParallel (or DistributedDataParallel) is PyTorch's built-in feature to split batches across multiple GPUs. It is straightforward for single-machine multi-GPU training.

43
Multi-Selectmedium

A data science team wants to implement a feature store to serve pre-computed features for both training and inference with low latency. Which TWO tools are commonly used for building a feature store?

Select 2 answers
A.Kubeflow
B.Apache Hive
C.Feast
D.Tecton
E.MLflow
AnswersC, D

Feast is an open-source feature store that manages and serves features.

Why this answer

Feast (Feature Store) is an open-source operational data system that manages and serves machine learning features to both training and inference pipelines with low latency. It provides a consistent feature definition API, offline serving for training, and online serving via a low-latency store like Redis or DynamoDB, making it a standard choice for feature store implementations.

Exam trap

Cisco often tests the distinction between ML orchestration tools (Kubeflow, MLflow) and dedicated feature stores (Feast, Tecton), trapping candidates who confuse lifecycle management with feature serving infrastructure.

44
Multi-Selecthard

A financial services company needs to deploy an ML model for loan approval that must be explainable to regulators. The model is a gradient boosting ensemble. They need to track experiments, log model parameters, and serve the model with explanations. Which THREE tools from the MLOps ecosystem should they use?

Select 3 answers
A.Apache Kafka
B.Docker Compose
C.Weights & Biases
D.Kubeflow
E.MLflow
AnswersC, D, E

W&B provides experiment logging, hyperparameter tracking, and model visualization.

Why this answer

Weights & Biases (W&B) is correct because it provides experiment tracking, hyperparameter logging, and model versioning, which are essential for the regulatory requirement of explainability and auditability. It integrates directly with gradient boosting frameworks like XGBoost and LightGBM to log parameters and metrics, enabling reproducible ML pipelines.

Exam trap

Cisco often tests the distinction between general infrastructure tools (like Kafka or Docker Compose) and purpose-built MLOps tools (like W&B, Kubeflow, and MLflow) that directly address experiment tracking, model serving, and explainability.

45
MCQmedium

A data scientist is using a Hugging Face transformer model for a sentiment analysis task. They want to optimize inference latency for a mobile app. Which model format and framework combination is BEST suited for on-device deployment?

A.Convert to TensorFlow Lite (TFLite) and run on the device
B.Use the full PyTorch model with JIT scripting
C.Deploy the model on a cloud endpoint and call via REST API
D.Export to ONNX and use ONNX Runtime with GPU
AnswerA

TFLite is optimized for mobile devices, providing low latency and small binary size.

Why this answer

TensorFlow Lite (TFLite) is specifically designed for on-device machine learning inference on mobile and edge devices. It provides a lightweight runtime, hardware acceleration via delegates (e.g., GPU, NNAPI), and reduced model size through quantization, making it the best choice for optimizing inference latency in a mobile app. Converting a Hugging Face transformer model to TFLite allows the model to run locally without network latency, which is critical for real-time sentiment analysis on a smartphone.

Exam trap

Cisco often tests the misconception that any export format (ONNX, JIT) is equally suitable for mobile deployment, but the trap here is that TFLite is the only option purpose-built for on-device inference with quantization and hardware acceleration, while ONNX and PyTorch JIT are primarily optimized for server-side or desktop inference.

How to eliminate wrong answers

Option B is wrong because using a full PyTorch model with JIT scripting does not produce a mobile-optimized runtime; PyTorch Mobile exists but JIT scripting alone lacks the quantization and delegate support that TFLite offers for low-latency on-device inference. Option C is wrong because deploying the model on a cloud endpoint and calling via REST API introduces network latency and dependency on connectivity, which defeats the purpose of on-device deployment for a mobile app. Option D is wrong because exporting to ONNX and using ONNX Runtime with GPU is typically designed for server or desktop environments with dedicated GPUs, not for mobile devices where GPU support is limited and ONNX Runtime Mobile is less mature than TFLite for transformer models.

46
MCQmedium

A data engineering team is building a pipeline to ingest streaming user activity data, process it in real-time, and store features in a feature store for ML models. Which streaming technology is BEST suited for this real-time data ingestion and processing?

A.Apache Kafka
B.Apache Spark SQL
C.Apache Airflow
D.Apache Hadoop MapReduce
AnswerA

Kafka provides high-throughput, fault-tolerant streaming for real-time data pipelines.

Why this answer

Apache Kafka is the best choice because it is a distributed streaming platform designed for high-throughput, fault-tolerant, real-time data ingestion and processing. It provides publish-subscribe messaging, durable log storage, and stream processing capabilities, making it ideal for ingesting streaming user activity data and feeding it into a feature store for ML models.

Exam trap

Cisco often tests the distinction between batch and stream processing technologies, and the trap here is that candidates confuse Apache Spark SQL (a batch-oriented SQL engine) with Spark Streaming, or mistake Airflow's scheduling capabilities for real-time ingestion.

How to eliminate wrong answers

Option B (Apache Spark SQL) is wrong because Spark SQL is a module for structured data processing using SQL queries, not a streaming ingestion technology; while Spark Streaming exists, Spark SQL itself is not designed for real-time data ingestion. Option C (Apache Airflow) is wrong because Airflow is a workflow orchestration tool for batch scheduling and DAG management, not a real-time streaming ingestion or processing system. Option D (Apache Hadoop MapReduce) is wrong because MapReduce is a batch processing framework that processes data in large, static batches with high latency, making it unsuitable for real-time streaming ingestion.

47
Multi-Selectmedium

A data scientist needs to store large volumes of unstructured log data for future AI model training. They also need to run SQL-based analytics on the data. Which THREE services are appropriate for this requirement? (Choose 3)

Select 3 answers
A.Pinecone
B.Snowflake
C.BigQuery
D.Amazon S3
E.pgvector
AnswersB, C, D

Snowflake is a data warehouse that supports SQL analytics on structured/semi-structured data.

Why this answer

Snowflake is correct because it is a cloud-native data warehouse that supports both structured and semi-structured data (like JSON, Avro, Parquet) via its VARIANT data type, enabling SQL-based analytics on unstructured log data. It also integrates with cloud storage (e.g., Amazon S3) for storing large volumes of raw logs, making it suitable for AI model training pipelines.

Exam trap

Cisco often tests the distinction between purpose-built databases (vector databases like Pinecone and pgvector) and general-purpose analytics platforms (Snowflake, BigQuery, S3), leading candidates to mistakenly select vector databases for log storage and SQL analytics.

48
MCQmedium

A security team needs to ensure that all data used for AI model training in the cloud is encrypted at rest and in transit. Which set of measures meets this requirement on AWS?

A.Use Security Groups and Network ACLs
B.Use client-side encryption and store keys in AWS Secrets Manager
C.Enable S3 default encryption with SSE-S3 and use HTTPS for API calls
D.Enable VPC peering and use VPN connections
AnswerC

SSE-S3 encrypts data at rest in S3; HTTPS encrypts data in transit. This covers both requirements.

Why this answer

AWS provides KMS for at-rest encryption and TLS for in-transit encryption. These are standard practices to secure data across the AI pipeline.

49
MCQmedium

A company wants to use a pre-trained model from Azure OpenAI but must ensure that customer data is not used to improve the service. Which configuration should they choose?

A.Set data retention to 30 days
B.Enable content filtering
C.Use the 'No Training' data policy option
D.Enable rate limiting
AnswerC

This option ensures customer data is not used to improve Azure OpenAI models.

Why this answer

Option C is correct because the 'No Training' data policy option explicitly prevents Azure OpenAI from using customer prompts and completions to retrain or improve the underlying models. This configuration is essential for compliance with data privacy requirements, ensuring that customer data remains isolated from model improvement pipelines.

Exam trap

The trap here is that candidates often confuse data retention settings (which control storage duration) with data usage policies (which control whether data is used for training), leading them to select Option A instead of the correct 'No Training' policy.

How to eliminate wrong answers

Option A is wrong because setting data retention to 30 days controls how long input and output data is stored for monitoring or debugging, but it does not prevent that data from being used for model training during that period. Option B is wrong because enabling content filtering only blocks harmful or policy-violating content from being generated; it has no effect on whether customer data is used to improve the service. Option D is wrong because rate limiting controls the number of API requests per time unit to manage load and cost, but it does not address data usage for training purposes.

50
MCQmedium

A team uses Apache Kafka to stream real-time sensor data for ML inference. They need to process the stream, perform feature engineering, and store results in a data lake. Which tool is best suited for this streaming ML pipeline?

A.Apache Spark with Structured Streaming
B.Apache Airflow
C.TensorFlow Data Validation
D.SageMaker Processing jobs
AnswerA

Spark's structured streaming reliably processes Kafka streams with exactly-once semantics and writes to data lakes.

Why this answer

Apache Spark with Structured Streaming is best suited because it provides a unified, scalable engine for both stream processing and batch processing, enabling real-time feature engineering on Kafka streams and direct writing to a data lake (e.g., Parquet format in Amazon S3). Its micro-batch or continuous processing model integrates natively with Kafka, allowing exactly-once semantics and low-latency transformations for ML inference pipelines.

Exam trap

Cisco often tests the distinction between stream processing engines (like Spark Structured Streaming) and orchestration or batch tools (like Airflow or SageMaker Processing), trapping candidates who confuse workflow scheduling with real-time data processing.

How to eliminate wrong answers

Option B (Apache Airflow) is wrong because it is a workflow orchestration tool for scheduling and managing DAGs, not a stream processing engine; it cannot perform real-time feature engineering on Kafka streams. Option C (TensorFlow Data Validation) is wrong because it is designed for data validation and schema inference in static datasets or batch pipelines, not for continuous stream processing or feature engineering on live sensor data. Option D (SageMaker Processing jobs) is wrong because it is a batch processing service for data preprocessing and model evaluation on static datasets, lacking native support for streaming ingestion from Kafka or real-time feature computation.

51
MCQmedium

A company is deploying a computer vision model to smartphones for offline object detection. The model was trained in PyTorch. Which format should they use for deployment on iOS devices?

A.TorchScript
B.ONNX
C.Core ML
D.TensorFlow Lite
AnswerC

Core ML is Apple's native format for iOS, providing optimized inference.

Why this answer

Core ML is Apple's framework for on-device machine learning on iOS, and PyTorch models can be converted to Core ML format.

52
Multi-Selecteasy

A data scientist wants to develop a computer vision model using transfer learning. They need a framework that provides pre-trained models and easy-to-use APIs for data augmentation and training. Which TWO frameworks are best suited for this task?

Select 2 answers
A.Hugging Face Transformers
B.PyTorch
C.scikit-learn
D.TensorFlow
E.Keras
AnswersB, D

PyTorch provides torchvision with pre-trained models and torchvision.transforms for data augmentation, making it ideal for transfer learning in computer vision.

Why this answer

PyTorch (option B) is correct because it offers a rich ecosystem of pre-trained models via `torchvision.models`, along with built-in data augmentation transforms in `torchvision.transforms` and a flexible training loop that is ideal for transfer learning. Its dynamic computation graph makes it easy to modify model architectures for fine-tuning, which is a core requirement for the task.

Exam trap

Cisco often tests the distinction between a high-level wrapper (Keras) and the underlying framework (TensorFlow) that actually provides the pre-trained models and training infrastructure, leading candidates to incorrectly select Keras as a standalone framework.

53
Multi-Selectmedium

A data scientist is building a RAG (Retrieval-Augmented Generation) system. They need to store document embeddings and retrieve relevant chunks efficiently. Which TWO technologies are most appropriate for this task? (Select TWO.)

Select 2 answers
A.A vector database such as Pinecone or Weaviate
B.An embedding model from Hugging Face Transformers
C.A relational database with BLOB storage
D.A GPU cluster for model serving
E.A data warehouse like Snowflake or BigQuery
AnswersA, B

Vector databases are optimized for storing and querying embeddings with approximate nearest neighbor search.

Why this answer

A vector database such as Pinecone or Weaviate is specifically designed to store and index high-dimensional vector embeddings, enabling efficient approximate nearest neighbor (ANN) search. This is essential for RAG systems to quickly retrieve the most semantically relevant document chunks based on the query embedding.

Exam trap

Cisco often tests the distinction between the component that generates embeddings (the model) and the component that stores/retrieves them (the vector database), leading candidates to mistakenly select only one or to confuse a data warehouse with a vector store.

54
MCQeasy

A developer wants to integrate an AI-powered text summarization API into their application. They need to authenticate securely and manage usage limits. What is the standard mechanism for authenticating with cloud-based AI services?

A.Provide a username and password in the request body
B.Embed the API key in the URL query string
C.Use a digital certificate for each request
D.Include an API key in the HTTP request header
AnswerD

API keys are the standard authentication method for cloud AI services, sent in the header (e.g., 'Authorization: Bearer <key>').

Why this answer

Option D is correct because cloud-based AI services, including text summarization APIs, standardize authentication via API keys passed in the HTTP header (e.g., `Authorization: Bearer <key>` or `x-api-key: <key>`). This method keeps credentials out of URLs and request bodies, preventing exposure in logs or caches, and aligns with RESTful API best practices and OWASP guidelines for secure API access.

Exam trap

Cisco often tests the misconception that embedding credentials in a URL or request body is acceptable for simplicity, but the trap here is that API keys must never appear in URLs or bodies due to security risks like exposure in server logs and referrer headers, making the HTTP header the only standard and secure option.

How to eliminate wrong answers

Option A is wrong because sending a username and password in the request body violates security best practices—credentials would be exposed in plaintext in logs, monitoring tools, and intermediate proxies, and it does not support stateless, token-based authentication used by modern AI APIs. Option B is wrong because embedding an API key in the URL query string exposes the key in server logs, browser history, and referrer headers, making it vulnerable to interception and violating RFC 3986 recommendations against sensitive data in URIs. Option C is wrong because digital certificates (e.g., mTLS) are typically used for machine-to-machine authentication in high-security enterprise environments, not as the standard mechanism for cloud AI services, which rely on simpler API key or OAuth 2.0 token flows for scalability and ease of integration.

55
MCQeasy

A data engineer needs to process streaming clickstream data for real-time feature engineering in an ML pipeline. Which data pipeline technology is BEST suited for this task?

A.Apache Spark in batch mode
B.Snowflake
C.Apache Kafka
D.Apache Airflow
AnswerC

Kafka is purpose-built for real-time data streaming and can feed into ML pipelines.

Why this answer

Apache Kafka is the best choice because it is a distributed streaming platform designed for high-throughput, fault-tolerant, real-time data ingestion and processing. It can capture clickstream events as they occur and make them immediately available for feature engineering in an ML pipeline, supporting exactly-once semantics and low-latency delivery.

Exam trap

Cisco often tests the distinction between data ingestion/messaging systems (Kafka) and batch processing or storage systems, leading candidates to confuse Airflow's orchestration role with actual stream processing capabilities.

How to eliminate wrong answers

Option A is wrong because Apache Spark in batch mode processes data in static, finite batches with high latency, making it unsuitable for real-time streaming clickstream data. Option B is wrong because Snowflake is a cloud-based data warehouse optimized for analytical queries on structured, stored data, not for real-time stream ingestion or processing. Option D is wrong because Apache Airflow is a workflow orchestration tool for scheduling and monitoring batch jobs, not a stream processing or messaging system capable of handling real-time data streams.

56
Multi-Selecthard

A company is deploying a large language model via a REST API using a cloud AI service. They expect high traffic and need to minimize latency while controlling costs. Which THREE strategies should they implement?

Select 3 answers
A.Enable prompt caching
B.Use batching to send multiple requests in one API call
C.Auto-scale the number of API endpoints
D.Quantize the model to FP16
E.Implement rate limiting for API requests
AnswersA, B, E

Prompt caching allows the API to reuse cached results for common prompt prefixes, reducing latency and cost for repeated queries.

Why this answer

Option A is correct because prompt caching stores the intermediate key-value (KV) cache from previous inference runs for identical or similar prompts. When a cached prompt is reused, the model skips recomputing the attention keys and values for the cached portion, significantly reducing time-to-first-token (TTFT) latency and lowering compute cost per request. This is especially effective for high-traffic scenarios where many users submit the same or slightly varied prompts.

Exam trap

Cisco often tests the distinction between infrastructure-level optimizations (like auto-scaling) and model-level optimizations (like quantization), expecting candidates to recognize that only API-layer strategies (caching, batching, rate limiting) directly control latency and cost at the REST endpoint.

57
MCQmedium

A company has a TensorFlow model trained on-premises and wants to deploy it on AWS SageMaker for scalable inference. What is the BEST way to package the model for deployment?

A.Convert the model to ONNX and upload to SageMaker
B.Upload the .h5 file to S3 and create a SageMaker endpoint directly
C.Package the model in a Docker container with a TensorFlow serving script and push to Amazon ECR
D.Use SageMaker Studio to train the model again from scratch
AnswerC

This creates an inference container that SageMaker can deploy; it includes the model and serving logic.

Why this answer

SageMaker expects models in a container format; the inference container should include the model artifacts and the serving code, allowing SageMaker to host it on scalable endpoints.

58
MCQhard

During inference, a model served via a REST API occasionally returns high latency due to cold starts. The team uses a containerized service on Kubernetes with horizontal pod autoscaling. Which solution minimizes cold start impact while controlling cost?

A.Configure the autoscaler based on request count with a shorter cooldown period
B.Increase CPU and memory requests for the inference container
C.Switch to vertical pod autoscaling
D.Use a sidecar container that pre-warms the model and set a minimum replica count
AnswerD

Pre-warming ensures the model is loaded; minimum replicas keep pods ready, reducing cold starts.

Why this answer

A sidecar warm-up agent and a minimum replica count keep pods ready. Increasing resources may not fix cold starts; autoscaling based on request count may lag; vertical scaling helps but not directly.

59
Multi-Selectmedium

A data engineering team is designing a data pipeline to process streaming sensor data and feed it into an ML model for anomaly detection. Which THREE components are essential for this pipeline?

Select 3 answers
A.Apache Airflow for scheduling recurring batch jobs
B.Amazon S3 as a data lake for storing raw sensor data
C.Snowflake as a real-time streaming destination
D.Apache Kafka for ingesting streaming sensor data
E.Apache Spark Structured Streaming for real-time processing
AnswersB, D, E

S3 is a scalable object store that can serve as a data lake for raw sensor data, accessible for both streaming and batch processing.

Why this answer

Amazon S3 is essential as a data lake for storing raw sensor data because it provides durable, scalable, and cost-effective object storage that can serve as a central repository for streaming data before and after processing. In a streaming pipeline, raw data must be persisted for reprocessing, historical analysis, and compliance, and S3's integration with Apache Spark and Kafka makes it a natural landing zone for sensor data.

Exam trap

Cisco often tests the distinction between batch and streaming technologies, and the trap here is that candidates confuse Airflow's scheduling capability with real-time streaming orchestration, or assume Snowflake can act as a streaming sink when it is fundamentally a batch-oriented warehouse.

60
Multi-Selectmedium

A machine learning engineer wants to track hyperparameter experiments and compare results across runs. Which TWO tools are best suited for this purpose? (Choose 2)

Select 2 answers
A.MLflow
B.Weights & Biases
C.Apache Airflow
D.Docker
E.Kubeflow
AnswersA, B

MLflow provides experiment tracking, logging, and comparison UI.

Why this answer

MLflow is correct because it provides a centralized tracking server and API to log hyperparameters, metrics, and artifacts for each run, enabling easy comparison across experiments. Weights & Biases is correct because it offers a cloud-hosted dashboard with real-time logging, hyperparameter sweeps, and collaborative comparison features, making it ideal for tracking and comparing runs.

Exam trap

Cisco often tests the distinction between infrastructure tools (orchestration, containerization) and purpose-built experiment tracking tools; the trap here is that candidates may confuse Kubeflow’s pipeline capabilities with dedicated experiment tracking, or assume Docker/Airflow can serve as tracking solutions because they are used in ML workflows.

61
MCQeasy

A developer wants to deploy a scikit-learn model as a REST API endpoint with minimal infrastructure management. Which cloud service is MOST appropriate?

A.Use AWS Lambda with a custom runtime
B.Deploy on an EC2 instance manually
C.Use AWS SageMaker to create a real-time endpoint
D.Use Amazon ECS with manual Docker setup
AnswerC

SageMaker offers managed inference endpoints with automatic scaling, reducing operational overhead.

Why this answer

AWS SageMaker provides a fully managed service for deploying machine learning models as real-time endpoints with built-in scaling, monitoring, and automatic infrastructure management. It directly supports scikit-learn models via pre-built containers, eliminating the need for custom runtime setup or manual server configuration. This makes it the most appropriate choice for a developer seeking minimal infrastructure management.

Exam trap

Cisco often tests the misconception that serverless compute like AWS Lambda is the best choice for any API deployment, but the trap here is that Lambda's execution environment and constraints (timeout, payload size, cold starts) make it inappropriate for ML model inference, whereas SageMaker is purpose-built for this workload.

How to eliminate wrong answers

Option A is wrong because AWS Lambda with a custom runtime requires manual packaging of the scikit-learn model and dependencies, and Lambda has a 15-minute timeout and limited memory, making it unsuitable for real-time inference with larger models or payloads. Option B is wrong because deploying on an EC2 instance manually involves provisioning, patching, scaling, and managing the underlying server, which contradicts the requirement for minimal infrastructure management. Option D is wrong because Amazon ECS with manual Docker setup still requires managing the cluster, task definitions, and scaling policies, adding operational overhead compared to SageMaker's fully managed endpoint service.

62
Multi-Selecthard

A data science team uses Vertex AI for model training and deployment. They want to implement CI/CD for ML pipelines. Which THREE Google Cloud services should they integrate?

Select 3 answers
A.Vertex AI Pipelines
B.Cloud Deploy
C.BigQuery
D.Cloud Build
E.Google Kubernetes Engine (GKE)
AnswersA, B, D

Vertex AI Pipelines is the workflow orchestrator for ML CI/CD.

Why this answer

Vertex AI Pipelines orchestrates ML workflows; Cloud Build automates builds; Cloud Deploy manages deployments. BigQuery is for analytics; GKE is for containers but not CI/CD specific.

63
MCQmedium

A team uses Kubeflow to manage ML workflows on Kubernetes. They want to automate hyperparameter tuning for a training job. Which Kubeflow component should they use?

A.KFServing
B.Kubeflow Notebooks
C.Kubeflow Pipelines
D.Kubeflow Katib
AnswerD

Katib provides automated hyperparameter tuning with various algorithms.

Why this answer

Katib is the hyperparameter tuning component in Kubeflow. Pipelines orchestrate workflows; KFServing is for inference; Notebooks are for development.

64
MCQmedium

A company is building a recommendation system that uses user embeddings stored in a vector database. The system must retrieve the top 10 most similar items for a given user query. Which vector database feature is MOST critical for this task?

A.Built-in data versioning
B.ACID transaction support
C.Approximate nearest neighbor (ANN) search
D.SQL query interface
AnswerC

ANN search is designed to efficiently find the closest vectors to a query vector, which is exactly what the recommendation system requires.

Why this answer

Approximate nearest neighbor (ANN) search is the most critical feature because it enables the vector database to efficiently find the top-10 most similar items to a user query embedding without scanning the entire dataset. Unlike exact nearest neighbor search, ANN algorithms (e.g., HNSW, IVF) trade a small amount of accuracy for massive performance gains, which is essential for real-time recommendation systems handling millions of high-dimensional vectors.

Exam trap

Cisco often tests the misconception that SQL or ACID features are needed for all database tasks, but in vector databases, the critical differentiator is the ANN search algorithm, not traditional relational or transactional capabilities.

How to eliminate wrong answers

Option A is wrong because built-in data versioning manages historical changes to data but does not directly impact the speed or accuracy of similarity search; it is irrelevant to the core retrieval task. Option B is wrong because ACID transaction support ensures data consistency and reliability during writes but does not optimize or accelerate vector similarity queries; it addresses transactional integrity, not search performance. Option D is wrong because a SQL query interface is designed for structured relational queries and lacks native support for high-dimensional vector similarity operations; using SQL for nearest neighbor search would require inefficient full-table scans or custom extensions, defeating the purpose of a vector database.

65
MCQmedium

A company wants to store unstructured text data for AI model training while enabling SQL-based queries for analytics. Which storage solution should they use as the primary data source?

A.A vector database like Pinecone
B.A data lake like Amazon S3
C.A data warehouse like Snowflake
D.A NoSQL database like DynamoDB
AnswerB

Data lakes store unstructured data in native format; SQL queries can be run on top via services like Athena or Presto.

Why this answer

Amazon S3 is a highly scalable object storage service that can store unstructured text data in its native format (e.g., CSV, JSON, Parquet) and supports SQL-based queries via services like Amazon Athena or S3 Select. This makes it ideal as a primary data source for AI model training while enabling analytics without requiring data transformation or loading into a separate system.

Exam trap

The trap here is that candidates often confuse a data warehouse (Snowflake) with a data lake (S3) for storing unstructured data, forgetting that data warehouses require structured schemas and are not designed for raw, schema-on-read storage.

How to eliminate wrong answers

Option A is wrong because vector databases like Pinecone are optimized for similarity search and embedding storage, not for SQL-based analytics or general unstructured text storage for training. Option C is wrong because data warehouses like Snowflake require structured, schema-on-write data and are not designed to store raw unstructured text files as the primary source. Option D is wrong because NoSQL databases like DynamoDB are key-value/document stores that enforce schema constraints and are not optimized for SQL queries on large volumes of unstructured text data.

66
MCQmedium

A team is using a cloud AI service with a pay-per-token pricing model. They want to minimize costs while maintaining response quality. Which strategy is MOST effective?

A.Switch to a smaller, less capable model
B.Increase the batch size for API calls
C.Use prompt caching for repeated query patterns
D.Reduce the model's max_tokens to a very low value
AnswerC

Caching avoids reprocessing identical prompts, saving token costs and reducing latency while preserving quality.

Why this answer

Prompt caching reduces costs by avoiding redundant token processing for repeated query patterns. The cloud AI service charges per token, so caching the prefix of frequent requests (e.g., system prompts or common context) means only the new, unique tokens are billed, directly lowering expenditure without sacrificing response quality.

Exam trap

Cisco often tests the misconception that reducing model size or output length is the only way to cut costs, but the correct strategy leverages architectural features like prompt caching to reduce token consumption without affecting quality.

How to eliminate wrong answers

Option A is wrong because switching to a smaller, less capable model typically reduces response quality, which contradicts the requirement to maintain quality. Option B is wrong because increasing batch size for API calls does not reduce per-token cost; it may improve throughput but still charges for all tokens processed. Option D is wrong because reducing max_tokens to a very low value can truncate responses, degrading quality, and does not address the cost of input tokens or repeated patterns.

67
MCQhard

A healthcare AI startup must store and query high-dimensional embeddings of medical records for a RAG system. They need low-latency similarity search at scale. Which database should they choose?

A.Amazon S3
B.pgvector
C.Pinecone
D.BigQuery
AnswerC

Pinecone is a managed vector database purpose-built for high-performance similarity search.

Why this answer

Pinecone is a fully managed vector database optimized for high-dimensional embeddings and low-latency similarity search at scale. It provides built-in indexing (e.g., HNSW), automatic sharding, and serverless scaling, making it ideal for RAG systems that require fast approximate nearest neighbor (ANN) queries on medical record embeddings.

Exam trap

Cisco often tests the distinction between general-purpose storage or analytical databases and purpose-built vector databases; the trap here is that candidates may choose pgvector for its familiarity with SQL or S3 for its scalability, overlooking the specific low-latency and high-dimensional requirements of a production RAG system.

How to eliminate wrong answers

Option A is wrong because Amazon S3 is an object storage service, not a database; it lacks native vector indexing and query capabilities, requiring external compute to perform similarity search, which introduces latency and complexity. Option B is wrong because pgvector, while capable of storing and querying vectors in PostgreSQL, is not designed for ultra-low-latency similarity search at massive scale; its performance degrades with high-dimensional vectors and large datasets due to lack of specialized ANN algorithms like HNSW or IVF in its default configuration. Option D is wrong because BigQuery is a data warehouse optimized for analytical SQL queries on structured data, not for real-time vector similarity search; its query latency is too high for interactive RAG systems, and it does not natively support ANN indexing.

68
Multi-Selectmedium

A company uses Azure OpenAI to generate marketing copy. They need to manage costs and ensure consistent response quality. Which TWO actions should they take?

Select 2 answers
A.Fine-tune the model on previous marketing copy
B.Use prompt caching to avoid reprocessing identical inputs
C.Switch to a cheaper, less capable model
D.Implement rate limiting and token-based throttling
E.Increase max tokens per response to ensure completeness
AnswersB, D

Caching reduces token usage and latency for repeated prompts.

Why this answer

Implementing rate limits prevents exceeding token budgets; prompt caching reduces repeated API calls. Fine-tuning is expensive; increasing max tokens may increase costs; using a less capable model may harm quality.

69
MCQeasy

A data scientist needs to train a deep learning model on a large image dataset. Which hardware component is specifically designed to accelerate deep learning training workloads?

A.TPU
B.GPU
C.NPU
D.CPU
AnswerB

GPUs contain thousands of cores that can perform parallel matrix operations, greatly accelerating training of deep learning models.

Why this answer

B (GPU) is correct because GPUs contain thousands of parallel cores designed for matrix operations, which are fundamental to deep learning training. They significantly accelerate the forward and backward passes of neural networks compared to CPUs, making them the standard choice for training large image datasets.

Exam trap

Cisco often tests the distinction between training accelerators (GPUs) and inference accelerators (NPUs/TPUs), where candidates mistakenly choose TPU or NPU because they associate 'AI' with any specialized hardware, but the question explicitly asks for 'deep learning training' which is GPU-dominated.

How to eliminate wrong answers

Option A (TPU) is wrong because while TPUs are custom ASICs designed by Google to accelerate TensorFlow workloads, they are not the 'specifically designed' component for general deep learning training across all frameworks; GPUs are the industry-standard accelerator. Option C (NPU) is wrong because NPUs are specialized for on-device inference and low-power neural network execution, not for large-scale training workloads. Option D (CPU) is wrong because CPUs have limited parallel processing cores and are inefficient for the massive matrix multiplications required in deep learning training, leading to significantly slower training times.

70
MCQmedium

An organisation needs to deploy PyTorch models on mobile devices with minimal latency. Which framework or tool should they use to convert and optimise the model for on-device inference?

A.TensorFlow Lite
B.Keras for mobile
C.ONNX Runtime with Core ML conversion
D.TorchScript
AnswerD

TorchScript is PyTorch's own tool for model serialisation and optimisation for mobile deployment.

Why this answer

TorchScript is the correct choice because it is PyTorch's native model serialization and optimization format, designed specifically for deploying PyTorch models on mobile devices with minimal latency. It allows you to trace or script a PyTorch model into a static graph that can be run efficiently on iOS and Android via the PyTorch Mobile runtime, without the overhead of Python interpreter.

Exam trap

Cisco often tests the misconception that any model can be easily converted to any mobile framework, but the trap here is that TorchScript is the only native, optimized path for PyTorch models, while options like TensorFlow Lite or ONNX Runtime require non-trivial cross-framework conversions that increase latency and complexity.

How to eliminate wrong answers

Option A is wrong because TensorFlow Lite is designed for TensorFlow models, not PyTorch; converting a PyTorch model to TensorFlow Lite requires an intermediate conversion step (e.g., ONNX) and adds complexity and potential performance loss. Option B is wrong because Keras for mobile does not exist as a standalone framework; Keras is a high-level API for TensorFlow, and mobile deployment would still rely on TensorFlow Lite, inheriting the same conversion issues. Option C is wrong because ONNX Runtime with Core ML conversion introduces an extra conversion step (PyTorch → ONNX → Core ML) that can increase latency and compatibility issues, and Core ML is specific to Apple devices, not a cross-platform mobile solution like TorchScript.

71
MCQmedium

A company is using Google Cloud Vertex AI for model training. They want to automate the retraining pipeline when new data arrives in BigQuery. Which Vertex AI feature should they use?

A.Vertex AI Prediction
B.Vertex AI Pipelines
C.Vertex AI Model Registry
D.Vertex AI Feature Store
AnswerB

Pipelines can be scheduled or triggered by events to automate ML workflows.

Why this answer

Vertex AI Pipelines is the correct choice because it enables you to define, automate, and orchestrate end-to-end ML workflows, including retraining models when new data arrives. By integrating with BigQuery triggers or Cloud Scheduler, you can set up a pipeline that automatically ingests new data, preprocesses it, retrains the model, and deploys the updated version—all without manual intervention.

Exam trap

Cisco often tests the distinction between operational tools (like Prediction or Model Registry) and orchestration tools (like Pipelines), so the trap here is confusing a component that manages models or features with the service that actually automates the end-to-end retraining workflow.

How to eliminate wrong answers

Option A is wrong because Vertex AI Prediction is a serving endpoint for deploying models to make predictions, not a tool for automating retraining pipelines. Option C is wrong because Vertex AI Model Registry is a central repository for managing model versions and metadata, but it does not orchestrate the retraining workflow itself. Option D is wrong because Vertex AI Feature Store is designed for managing and serving feature data consistently across training and serving, not for automating pipeline execution.

72
Multi-Selectmedium

A data scientist is deploying a model on edge devices using TensorFlow Lite. The model currently uses FP32 precision. Which TWO techniques can reduce the model size and improve inference speed without significant accuracy loss? (Choose TWO.)

Select 2 answers
A.Use a larger batch size during inference
B.Increase the number of layers
C.Post-training quantization to INT8
D.Convert to FP16 precision
E.Model pruning
AnswersC, E

INT8 quantization reduces model size by ~4x and accelerates inference.

Why this answer

Post-training quantization to INT8 reduces model size by converting FP32 weights and activations to 8-bit integers, which also speeds up inference on edge devices by leveraging integer-optimized hardware. This technique typically preserves accuracy within 1–2% of the original FP32 model, making it suitable for deployment on resource-constrained devices.

Exam trap

Cisco often tests the misconception that FP16 conversion is universally beneficial for edge devices, but the trap is that many edge platforms lack native FP16 support, making INT8 quantization the more practical and widely compatible choice.

73
MCQeasy

A machine learning engineer needs to train a deep neural network on a large image dataset. Which hardware component is specifically optimized for this task due to its high parallel processing capability and is commonly used in AI training?

A.Central Processing Unit (CPU)
B.Neural Processing Unit (NPU)
C.Graphics Processing Unit (GPU)
D.Tensor Processing Unit (TPU)
AnswerC

GPUs have thousands of cores that excel at parallel processing, making them the industry standard for training deep neural networks.

Why this answer

Option C is correct because Graphics Processing Units (GPUs) are specifically optimized for the parallel processing required in deep neural network training. Their architecture contains thousands of smaller cores designed to handle multiple matrix operations simultaneously, which is the core computation in backpropagation and forward passes of neural networks. This makes GPUs the standard choice for training large image datasets in AI.

Exam trap

Cisco often tests the distinction between training and inference hardware, where candidates may confuse NPUs (optimized for inference) with GPUs (optimized for training), or assume TPUs are the most common due to their specialization, when GPUs remain the industry standard for deep learning training.

How to eliminate wrong answers

Option A is wrong because CPUs are optimized for sequential, low-latency processing with a small number of powerful cores, not the massive parallelism needed for deep learning matrix operations. Option B is wrong because Neural Processing Units (NPUs) are specialized for inference (running trained models) with lower power consumption, not for the heavy parallel training workloads that GPUs handle. Option D is wrong because Tensor Processing Units (TPUs) are custom ASICs designed by Google specifically for TensorFlow workloads, but they are less commonly used in general AI training compared to GPUs, and the question asks for the hardware 'commonly used' in AI training, which is the GPU.

74
MCQhard

A team is deploying a BERT-based question-answering model using a REST API endpoint with gRPC for internal microservices. They notice high latency for small payloads. Which optimization is MOST likely to reduce latency?

A.Enable batching of multiple queries into a single request
B.Convert the model to ONNX and use ONNX Runtime
C.Switch from gRPC to REST with HTTP/2
D.Use a larger instance type with more CPU
AnswerA

Batching increases payload size per request, reducing per-query overhead and improving throughput/latency.

Why this answer

Batching multiple queries into a single request reduces the overhead of repeated gRPC connection setup, serialization, and network round trips for small payloads. This amortizes the fixed cost of each inference call across several queries, directly lowering per-query latency in high-throughput scenarios.

Exam trap

Cisco often tests the misconception that model optimization (ONNX) or hardware upgrades are the default fix for latency, when the real bottleneck for small payloads is network and serialization overhead, which batching directly mitigates.

How to eliminate wrong answers

Option B is wrong because converting to ONNX and using ONNX Runtime primarily improves inference speed through model optimization and hardware acceleration, but it does not address the network and serialization overhead that dominates latency for small payloads. Option C is wrong because switching from gRPC to REST with HTTP/2 would likely increase latency, as gRPC already uses HTTP/2 and provides more efficient binary serialization (Protobuf) compared to REST's text-based JSON. Option D is wrong because using a larger instance type with more CPU addresses compute-bound bottlenecks, but the high latency here is due to network and protocol overhead, not CPU capacity.

75
Multi-Selectmedium

A team is using Kubeflow to orchestrate ML workflows on Kubernetes. They need to ensure reproducibility, track experiments, and share models across the organization. Which THREE components or tools should they integrate? (Choose THREE.)

Select 3 answers
A.Apache Airflow
B.Weights & Biases
C.Kubeflow Pipelines
D.MLflow Tracking
E.MLflow Model Registry
AnswersC, D, E

Pipelines define and manage the ML workflow DAGs.

Why this answer

Kubeflow Pipelines is a core component of Kubeflow that enables the definition, deployment, and management of end-to-end ML workflows on Kubernetes. It provides a platform for building reproducible pipelines by capturing the entire workflow as a directed acyclic graph (DAG) of containerized steps, ensuring that each run can be exactly recreated. This directly addresses the team's need for reproducibility and orchestration within their existing Kubernetes environment.

Exam trap

Cisco often tests the distinction between experiment tracking (MLflow Tracking) and model sharing/versioning (MLflow Model Registry), and candidates mistakenly think Weights & Biases covers both, but it lacks a built-in model registry for organizational sharing.

Page 1 of 2 · 100 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Aio Ai Infrastructure questions.

CCNA Aio Ai Infrastructure Questions — Page 1 of 2 | Courseiva