Hardware practice questions

A team is deploying a deep learning model that uses a convolutional neural network (CNN) for image recognition. The model achieves high accuracy but is very slow to infer on edge devices. Which THREE optimization techniques should the team consider to speed up inference without significant accuracy loss? (Select three.)

Trap 1: Use larger convolutional filters (e.g., 7x7 instead of 3x3) to…

Larger filters increase computation and slow down inference.

Trap 2: Increase the number of convolutional layers to improve feature…

More layers increase computational cost and latency.

A
Use larger convolutional filters (e.g., 7x7 instead of 3x3) to capture more context.
Why wrong: Larger filters increase computation and slow down inference.
B
Use weight pruning to remove unnecessary connections in the network.
Pruning reduces computation and memory footprint.
C
Implement knowledge distillation by training a smaller model to mimic the larger one.
Knowledge distillation creates a compact model that retains much of the original accuracy.
D
Increase the number of convolutional layers to improve feature extraction.
Why wrong: More layers increase computational cost and latency.
E
Apply model quantization to reduce weight precision.
Quantization reduces model size and speeds up inference, often with minimal accuracy loss.

Question 2easymultiple choice

A team deploys a real-time fraud detection model on a streaming platform. The model must produce predictions within 100 milliseconds per event. Initial latency is 150 ms. Which optimization is most likely to meet the latency requirement?

Trap 1: Increase the batch size to process more events simultaneously.

Larger batches reduce per-event latency only if parallelism is improved; but for real-time streaming, batching adds delay.

Trap 2: Add more feature engineering steps to improve model accuracy.

Additional feature engineering increases pre-processing time, worsening latency.

Trap 3: Migrate from a decision tree ensemble to a deep neural network.

DNNs are typically slower than tree ensembles for tabular data.

A
Apply model quantization to reduce precision from FP32 to INT8.
Quantization reduces model size and speeds up computation, lowering latency.
B
Increase the batch size to process more events simultaneously.
Why wrong: Larger batches reduce per-event latency only if parallelism is improved; but for real-time streaming, batching adds delay.
C
Add more feature engineering steps to improve model accuracy.
Why wrong: Additional feature engineering increases pre-processing time, worsening latency.
D
Migrate from a decision tree ensemble to a deep neural network.
Why wrong: DNNs are typically slower than tree ensembles for tabular data.

Question 3easymultiple choice

A company wants to deploy an AI model for real-time inference on edge devices with limited computational resources. Which model architecture would be MOST suitable?

Trap 1: YOLOv4

YOLOv4 is efficient but still may be too heavy for very limited edge devices; MobileNet is more optimized.

Trap 2: ResNet-152

ResNet-152 is a deep network with many parameters, too large for edge devices.

Trap 3: BERT

BERT is a large transformer model for NLP, not suitable for resource-constrained edge devices.

A
YOLOv4
Why wrong: YOLOv4 is efficient but still may be too heavy for very limited edge devices; MobileNet is more optimized.
B
MobileNet
MobileNet uses depthwise separable convolutions to reduce computation, ideal for edge deployment.
C
ResNet-152
Why wrong: ResNet-152 is a deep network with many parameters, too large for edge devices.
D
BERT
Why wrong: BERT is a large transformer model for NLP, not suitable for resource-constrained edge devices.

Question 4mediummultiple choice

A team deploying an AI model for real-time fraud detection notices that inference latency is too high. The model is a deep neural network with 50 layers, deployed on a cloud GPU. Which of the following is the BEST approach to reduce latency while maintaining acceptable accuracy?

Trap 1: Deploy the model on a more powerful GPU.

This may not reduce latency enough and increases cost.

Trap 2: Reduce the batch size for inference.

Smaller batch sizes can increase relative overhead.

Trap 3: Replace the DNN with a logistic regression model.

This would likely cause a significant drop in accuracy.

A
Deploy the model on a more powerful GPU.
Why wrong: This may not reduce latency enough and increases cost.
B
Reduce the batch size for inference.
Why wrong: Smaller batch sizes can increase relative overhead.
C
Replace the DNN with a logistic regression model.
Why wrong: This would likely cause a significant drop in accuracy.
D
Apply knowledge distillation to create a smaller model.
Correct; distillation compresses the model while preserving performance.

Read the full NAT/PAT explanation →

Question 5mediummultiple choice

A financial services company has a real-time fraud detection system that uses Apache Kafka to stream transaction events, a TensorFlow Serving model for scoring, and a Redis cache for lookup of historical fraud patterns. The system processes 10,000 transactions per second with an SLA of 100ms latency per transaction. Recently, after a model update, the latency for some transactions spiked to over 500ms, causing timeouts. The model uses a deep neural network with 10 million parameters. The engineering team suspects the issue is due to increased model inference time. Which action should be taken to reduce latency without significant loss in accuracy?

Trap 1: Add more Redis nodes to the cache cluster

Redis caching does not affect model inference latency.

Trap 2: Increase the number of Kafka partitions and consumer threads

This improves parallelism for stream processing but does not reduce model inference time.

Trap 3: Decrease the inference batch size from 32 to 1

Smaller batch sizes can increase overhead and latency due to less efficient computation.

A
Add more Redis nodes to the cache cluster
Why wrong: Redis caching does not affect model inference latency.
B
Increase the number of Kafka partitions and consumer threads
Why wrong: This improves parallelism for stream processing but does not reduce model inference time.
C
Decrease the inference batch size from 32 to 1
Why wrong: Smaller batch sizes can increase overhead and latency due to less efficient computation.
D
Quantize the model weights from FP32 to FP16
FP16 quantization reduces model size and speeds up inference, typically with minimal accuracy impact.

Read the full NAT/PAT explanation →

Question 6easymultiple choice

A hospital's radiology department uses an AI model to detect lung nodules in CT scans. The model was trained on data from a specific brand of scanners and patient demographics common in Europe. Recently, the hospital acquired new scanners from a different manufacturer and started serving a more diverse patient population. Over the past month, the model's false-positive rate has increased by 15% and false-negative rate by 8%. The radiologists are losing confidence and are considering abandoning the AI tool altogether. The IT team has verified that the model inference is running correctly and the hardware is performing as expected. The data science team suspects the problem is related to the change in input data distribution. The hospital's AI operations policy requires that any model update must be validated on at least 500 recent cases before deployment. What is the BEST course of action for the AI operations team?

Trap 1: Roll back to the previous model version and restrict use of the AI…

Rolling back does not solve the problem for the new patient population and scanner; restricting use reduces clinical value.

Trap 2: Retrain the model using the original training data but with…

Retraining on old data alone will not capture the new distribution; regularization does not adapt to domain shift.

Trap 3: Adjust the model's decision threshold to reduce false positives and…

Threshold adjustment can reduce false positives but may increase false negatives and does not address the underlying data shift.

A
Roll back to the previous model version and restrict use of the AI tool to only European patients.
Why wrong: Rolling back does not solve the problem for the new patient population and scanner; restricting use reduces clinical value.
B
Collect 500 recent CT scans from the new scanners, retrain the model on a combined old and new dataset, and validate before deployment.
Retraining with a representative sample addresses the data drift and meets the policy requirement of 500 validation cases.
C
Retrain the model using the original training data but with increased regularization to avoid overfitting.
Why wrong: Retraining on old data alone will not capture the new distribution; regularization does not adapt to domain shift.
D
Adjust the model's decision threshold to reduce false positives and then monitor for two weeks.
Why wrong: Threshold adjustment can reduce false positives but may increase false negatives and does not address the underlying data shift.