A financial institution needs to deploy a TensorFlow model for fraud detection with strict latency requirements (<100ms). The model uses custom ops that are not available in standard TF Serving. What is the most appropriate serving solution?
NVIDIA Triton supports custom backends and is designed for high-performance inference with low latency.
Why this answer
Option C is correct because NVIDIA Triton Inference Server supports custom backends written in C++ or Python, allowing the integration of custom ops that are not available in standard TensorFlow Serving. This enables the model to meet strict latency requirements (<100ms) by leveraging GPU acceleration and optimized inference pipelines, while avoiding the limitations of TF Serving's fixed op registry.
Exam trap
The trap here is that candidates assume TF Serving's custom op registration (Option D) is straightforward, but Cisco tests the understanding that TF Serving does not support dynamic loading of custom ops without a custom build, making Triton's backend architecture the correct choice for production-grade latency requirements.
How to eliminate wrong answers
Option A is wrong because Vertex AI Prediction relies on standard TF Serving or custom containers, but exporting as a SavedModel does not automatically include custom ops; Vertex AI would fail to load the model if the custom ops are not registered in its runtime. Option B is wrong because Cloud Run with a custom container can serve the model, but it lacks the specialized inference optimization features (e.g., dynamic batching, model concurrency) needed to guarantee <100ms latency under load, and it does not natively support custom backends for ops. Option D is wrong because TF Serving's custom op registration requires recompiling TF Serving from source with the custom ops linked, which is complex and not supported via standard Docker images; even if done, TF Serving's architecture is less flexible than Triton's custom backend for handling non-standard ops efficiently.