An organization must ensure that an AI model deployed on an IoT device meets stringent latency requirements. The model is currently in FP32 and runs at 200ms per inference on the device; the target is 50ms. Which technique will provide the greatest latency reduction with the least accuracy loss?
INT8 quantization reduces bit width from 32 to 8, accelerating arithmetic and memory access, often achieving ~4x latency reduction.
Why this answer
Quantizing the model from FP32 to INT8 reduces the precision of weights and activations, which directly decreases memory bandwidth and computational load. On IoT devices with limited resources, this typically yields a 2-4x speedup, bringing the 200ms inference time close to the 50ms target, while INT8 quantization often retains over 90% of the original accuracy when using calibration techniques.
Exam trap
Cisco often tests the misconception that any optimization technique (like pruning or framework switching) can achieve the same latency reduction as quantization, but only INT8 quantization directly addresses the computational precision bottleneck to deliver the required 4x speedup with minimal accuracy loss.
How to eliminate wrong answers
Option B is wrong because weight pruning removes parameters but does not reduce the precision of the remaining values; the model still operates in FP32, so the latency reduction is limited (often 20-30%) and may not achieve the 4x speedup needed, while aggressive pruning can cause significant accuracy loss. Option C is wrong because switching from TensorFlow Lite to Core ML is a framework change that may optimize for Apple hardware but does not inherently reduce computational precision or model size; it typically provides marginal latency improvements (10-20%) and is platform-specific, not a general solution for the required 4x reduction. Option D is wrong because knowledge distillation creates a smaller student model, but training a new architecture from scratch is time-consuming and may not guarantee the exact 50ms target; the latency reduction depends on the student model's size and hardware compatibility, and distillation often requires extensive retuning to avoid accuracy degradation.