This chapter covers ECS Service Discovery and AWS Service Connect, two mechanisms for enabling communication between microservices running on Amazon ECS. For the SAA-C03 exam, these topics appear in roughly 5-8% of questions, often in scenarios involving dynamic port mapping, service meshes, or inter-service communication. Understanding the trade-offs between DNS-based discovery and proxy-based service connectivity is critical for designing high-performance, scalable container architectures. We will dive deep into the mechanisms, configuration, and exam traps.
Jump to a section
ECS Service Discovery is like an airline's hub-and-spoke system where each flight (service) announces its arrival gate (IP address and port) to a central registry (Amazon Route 53 or Cloud Map). When a passenger (client service) wants to connect to another flight, they check the registry to find the correct gate and then walk directly there. This works well for static, predictable schedules. AWS Service Connect, on the other hand, is like a direct, non-stop shuttle service between two specific airports. The shuttle company (AWS Service Connect) handles all the logistics: it assigns a dedicated, fixed gate (virtual DNS name) for each destination, and passengers simply show up at that gate. The shuttle automatically routes them to any available aircraft (task) at the destination, even if the aircraft changes gates (IP changes). With Service Connect, the client doesn't need to query a registry; it just sends traffic to a well-known endpoint, and the network handles the rest. The key difference: discovery requires the client to look up and remember the address each time; Service Connect provides a stable, always-available endpoint that abstracts away the underlying task churn.
What are ECS Service Discovery and AWS Service Connect?
ECS Service Discovery and AWS Service Connect are two distinct approaches to enable communication between services running in an Amazon ECS cluster. Both solve the problem of finding and connecting to a service instance, but they do so in fundamentally different ways.
ECS Service Discovery is a feature that integrates with AWS Cloud Map to automatically register each task (container instance) as a DNS A record or SRV record. When a task starts, it registers its IP address and port (including dynamic ports) with Cloud Map. Other services can then resolve the service name via DNS to get the IP:port of a healthy task. This is a client-side discovery pattern: the client must perform DNS resolution and then connect directly.
AWS Service Connect is a newer capability that provides a transparent proxy layer between services. It uses an Envoy proxy sidecar injected into each task to intercept outbound traffic to a configured service name and load-balance it across healthy tasks. The client sends traffic to a local endpoint (e.g., localhost:8080), and the proxy forwards it to a backend service. This is a server-side discovery pattern: the client does not need to know the backend addresses.
How They Work Internally
ECS Service Discovery (DNS-based):
When you enable service discovery on an ECS service, you create a namespace in AWS Cloud Map (either public or private, hosted in Route 53). Each task, upon registration, creates a DNS record with a TTL of 60 seconds by default. The record type can be A (IPv4), AAAA (IPv6), or SRV (for dynamic port mapping). The SRV record includes the port number. Clients perform standard DNS resolution, which returns up to 8 healthy IP addresses (or SRV records) in random order (DNS round-robin). The client then picks one and opens a direct TCP connection.
Example DNS resolution:
$ dig +short my-service.my-namespace.local
10.0.1.45
10.0.2.12
10.0.3.78If a task becomes unhealthy or stops, Cloud Map automatically deregisters the DNS record. However, due to DNS caching (TTL), clients may still attempt to connect to the old IP for up to 60 seconds. This can cause brief connection failures during deployments or scaling events.
AWS Service Connect (Proxy-based):
Service Connect deploys an Envoy proxy as a sidecar container in each task. The proxy listens on a local port (e.g., 15000 for inbound, 15001 for outbound). The application is configured to send traffic to a virtual endpoint (e.g., http://service-b:8080). The proxy intercepts this traffic using iptables rules and forwards it to a backend service. Service Connect uses a control plane to distribute routing tables to all proxies, including the list of healthy tasks and their IP:port combinations. The proxy performs client-side load balancing (round-robin or least-request) and health checks. If a backend task fails, the proxy immediately stops sending traffic to it, without waiting for DNS TTL.
Service Connect also supports mTLS encryption between proxies, and you can configure timeouts, retries, and circuit breakers at the proxy level.
Key Components, Values, Defaults, and Timers
Cloud Map Namespace: - Types: HTTP (for API calls) or DNS (for DNS-based discovery). DNS namespaces can be private (VPC only) or public. - DNS record TTL: Default 60 seconds, configurable from 1 to 86400 seconds. - Health checks: Cloud Map can perform HTTP/HTTPS/TCP health checks on tasks. Unhealthy tasks are removed from DNS responses.
Service Connect Configuration: - Client alias: The DNS name that the client uses to reach the service. This is resolved locally by the proxy (not via DNS). - Port mapping: You specify the port the proxy listens on and the port the application listens on. - Timeout: Default 5 seconds for idle timeout, configurable. - Health check: Proxies perform health checks every 5 seconds by default. - mTLS: Requires AWS Certificate Manager (ACM) private CA.
Configuration and Verification Commands
Creating a Service Discovery namespace:
aws servicediscovery create-private-dns-namespace \
--name my-namespace.local \
--vpc vpc-12345678Creating an ECS service with service discovery:
aws ecs create-service \
--cluster my-cluster \
--service-name my-service \
--task-definition my-task:1 \
--desired-count 3 \
--service-registries registryArn=arn:aws:servicediscovery:us-east-1:123456789012:service/srv-xxxVerifying service discovery records:
aws servicediscovery list-instances --service-id srv-xxxEnabling Service Connect on an ECS service:
aws ecs create-service \
--cluster my-cluster \
--service-name my-service \
--task-definition my-task:1 \
--desired-count 3 \
--service-connect-configuration "{\"enabled\": true, \"namespace\": \"arn:aws:servicediscovery:us-east-1:123456789012:namespace/ns-xxx\", \"services\": [{\"portName\": \"http\", \"clientAliases\": [{\"dnsName\": \"service-b\", \"port\": 8080}]}]}"Verifying Service Connect: Check the proxy logs in CloudWatch or use the ECS console to view the service connect configuration.
Interaction with Related Technologies
Application Load Balancer (ALB): Both service discovery and Service Connect can be used alongside an ALB. For external-facing services, you typically front them with an ALB. For internal microservices communication, you use Service Connect or service discovery. The exam often tests whether to use ALB + service discovery vs. Service Connect alone.
AWS App Mesh: App Mesh is a full service mesh that also uses Envoy proxies. Service Connect is a simpler, ECS-native alternative to App Mesh. App Mesh provides more advanced traffic management (weighted routing, retries, circuit breakers) but requires more configuration. For SAA-C03, Service Connect is the recommended approach for simple inter-service communication within an ECS cluster.
VPC Lattice: AWS VPC Lattice is a newer service that provides service-to-service connectivity across VPCs and accounts. It can also be used with ECS, but Service Connect is limited to within the same cluster and namespace.
Performance and Scaling Considerations
Service Discovery: DNS resolution overhead adds latency (typically <10ms in VPC). Caching can reduce this, but TTL-based staleness can cause issues. It scales well because DNS is distributed.
Service Connect: The proxy adds ~1-3ms latency per hop. It scales by adding more proxy sidecars. The control plane updates routing tables quickly (within seconds). For high-throughput applications, ensure the proxy has adequate CPU/memory limits.
Enable Cloud Map Namespace
First, you create a Cloud Map namespace. For private DNS, you specify the VPC. Cloud Map creates a private hosted zone in Route 53 associated with the VPC. The namespace name (e.g., 'my-app.local') becomes the DNS suffix for all services. This step is one-time per application environment.
Register a Service in Cloud Map
For each ECS service you want to discover, you create a Cloud Map service. You define the DNS record type (A or SRV), TTL, and health check configuration (optional). Cloud Map assigns a unique service ID. This step is also one-time per service.
Create ECS Service with Service Discovery
When you create or update an ECS service, you specify the Cloud Map service ARN in the serviceRegistries parameter. ECS automatically registers each task as an instance in Cloud Map. The registration includes the task's private IP and, if using dynamic port mapping, the host port. ECS also deregisters the instance when the task stops.
Client Resolves DNS
When a client service needs to call the backend service, it performs a DNS query for the service name (e.g., 'backend.my-app.local'). Route 53 returns up to 8 IP addresses (or SRV records) in random order. The client then opens a TCP connection to one of the IPs. DNS caching at the OS level may reuse the same IP for up to the TTL (default 60 seconds).
Service Connect Proxy Interception
With Service Connect, each task has an Envoy proxy sidecar. When the client application sends a request to a configured service name (e.g., 'service-b:8080'), iptables rules redirect the traffic to the local proxy. The proxy consults its routing table (distributed by the control plane) and forwards the request to a healthy backend task. The proxy also handles retries and timeouts.
Enterprise Scenario 1: E-Commerce Microservices with Service Discovery
A large e-commerce platform runs hundreds of microservices on ECS (Fargate). They initially used service discovery with Cloud Map to allow services like 'inventory', 'pricing', and 'orders' to find each other. However, during flash sales, tasks scaled up rapidly and DNS TTL of 60 seconds caused many client requests to hit stale IPs, leading to increased error rates. They mitigated this by reducing TTL to 10 seconds and implementing client-side retry logic. Still, the architecture required each service to handle DNS resolution and connection management. The team later migrated to Service Connect, which eliminated the staleness issue because the proxy immediately updates routing tables. Performance improved, and error rates dropped by 40%. Configuration: They used a single namespace 'prod.local', and each service had a client alias like 'inventory.prod.local'. The proxies were allocated 256 CPU units and 512 MB memory.
Enterprise Scenario 2: Financial Services with Strict Compliance
A financial institution needed mTLS between all microservices for compliance. They evaluated App Mesh but found it too complex for their 20-service stack. Service Connect provided built-in mTLS with ACM private CA. They configured each service with a client alias and enabled mTLS. The proxies handled certificate rotation automatically. They also used Service Connect's circuit breaker to prevent cascading failures. One misconfiguration: they initially set the idle timeout too low (2 seconds), causing long-running database queries from an API service to be cut off. They increased it to 30 seconds. The team monitors proxy metrics (active connections, request duration) in CloudWatch.
Scenario 3: Hybrid Deployment with Service Discovery and ALB
A media company runs a legacy monolith alongside new microservices on ECS. The monolith is exposed via an ALB. New microservices communicate internally using Service Connect. However, the monolith needs to call a new service. They could not add a proxy to the monolith, so they instead used service discovery: the monolith queries DNS to find the new service. This hybrid approach works, but the monolith's DNS cache (60s) causes occasional connection failures during deployments. They plan to eventually migrate the monolith to ECS with Service Connect.
Exactly What SAA-C03 Tests
Objective 3.3: Design high-performing and scalable application architectures. Sub-objectives include 'Implement service-to-service communication' and 'Choose between service discovery and service mesh.'
Key concepts tested: Differences between DNS-based discovery and proxy-based connectivity; when to use service discovery vs. Service Connect; integration with Cloud Map; dynamic port mapping; health checks; TTL values; mTLS; and the role of Envoy proxy.
Common Wrong Answers and Why
"Service Discovery is always better because it has lower latency." – Wrong. While DNS adds minimal latency, the stale DNS cache can cause failures. Service Connect has slightly higher per-request latency (proxy overhead) but provides faster failover and better resiliency. The exam expects you to choose based on requirements like 'minimize connection failures during deployments' → Service Connect.
"Service Connect requires an Application Load Balancer." – Wrong. Service Connect works independently; it does not require an ALB. The ALB is used for external traffic; Service Connect is for internal.
"Service Discovery supports mTLS." – Wrong. Service Discovery does not provide encryption between services. You would need to implement mTLS at the application level or use a separate service mesh. Service Connect supports mTLS natively.
"Service Connect only works with Fargate." – Wrong. It works with both Fargate and EC2 launch types.
Specific Numbers and Terms
Default DNS TTL: 60 seconds
Service Connect proxy: Envoy
Cloud Map namespace types: DNS (public/private) and HTTP
Service Connect supports up to 1000 services per namespace (soft limit)
mTLS requires ACM Private CA
Service Connect idle timeout: default 5 seconds, configurable
Edge Cases and Exceptions
If you use service discovery with dynamic port mapping, you must use SRV records (not A records). The exam might test that A records only work with static port mapping.
If your client is outside the VPC, it cannot resolve private DNS namespaces. You would need a public namespace or a Route 53 resolver endpoint.
Service Connect does not support Cross-Account or Cross-VPC communication out of the box. For that, you need VPC Lattice or App Mesh.
How to Eliminate Wrong Answers
If the question mentions 'minimize latency' and 'no additional proxy', choose service discovery.
If the question mentions 'fast failover', 'mTLS', or 'reduce application complexity', choose Service Connect.
If the question mentions 'external clients', consider ALB + service discovery or ALB alone.
If the question mentions 'cross-account', look for VPC Lattice or Transit Gateway.
ECS Service Discovery uses Cloud Map DNS (A or SRV records) with a default TTL of 60 seconds.
AWS Service Connect uses an Envoy sidecar proxy for transparent service-to-service communication.
Service Discovery requires the client to perform DNS resolution and handle load balancing; Service Connect handles it automatically.
Service Connect supports mTLS using ACM Private CA; Service Discovery does not.
Service Connect provides faster failover because proxy routing tables update immediately, unlike DNS TTL.
Dynamic port mapping with Service Discovery requires SRV records; A records are only for static ports.
Service Connect works with both Fargate and EC2 launch types.
For cross-VPC or cross-account communication, use VPC Lattice or App Mesh, not Service Connect.
The exam often asks you to choose between these based on requirements: low latency vs. fast failover vs. mTLS.
These come up on the exam all the time. Here's how to tell them apart.
ECS Service Discovery (DNS-based)
Client-side discovery: client resolves DNS and connects directly.
No additional proxy; minimal overhead (~1ms DNS lookup).
Requires client to handle retries, load balancing, and failover.
DNS caching (TTL) can cause stale connections during scaling.
No built-in mTLS or traffic management.
AWS Service Connect (Proxy-based)
Server-side discovery: proxy handles routing transparently.
Envoy proxy adds ~1-3ms latency per request.
Built-in load balancing, retries, circuit breakers, and mTLS.
Immediate failover via proxy routing table updates.
Tightly integrated with ECS; simpler client code.
Mistake
ECS Service Discovery automatically load balances traffic across tasks.
Correct
Service Discovery only provides DNS resolution with round-robin. The client is responsible for load balancing (e.g., connection pooling, retry logic). Service Connect provides actual load balancing via the Envoy proxy.
Mistake
Service Connect requires an Application Load Balancer.
Correct
Service Connect is independent of ALB. ALB is for external traffic; Service Connect is for internal service-to-service communication within the ECS cluster.
Mistake
Service Discovery works across VPCs by default.
Correct
Private DNS namespaces are only resolvable within the VPC they are associated with. For cross-VPC resolution, you need a Route 53 Resolver or a public namespace.
Mistake
Service Connect adds significant latency (10ms+).
Correct
The Envoy proxy adds approximately 1-3ms per hop, which is negligible for most applications. The benefits of fast failover and mTLS outweigh the slight overhead.
Mistake
You can use A records for dynamic port mapping with service discovery.
Correct
Dynamic port mapping requires SRV records because the port is not the standard container port. A records only contain IP addresses and assume a well-known port.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
No. ECS Service Discovery is built on AWS Cloud Map. You must create a Cloud Map namespace and service, then reference the service ARN in your ECS service definition. Cloud Map handles registration and deregistration of tasks.
No. Service Connect is designed for internal service-to-service communication within the same ECS cluster and namespace. For external clients, you typically use an Application Load Balancer or API Gateway.
Service Connect proxies perform health checks every 5 seconds by default. You can configure this in the service connect configuration.
Yes, but it is not recommended. They serve different purposes. If you enable both, the service will be registered in Cloud Map and also have the proxy sidecar. This might be useful during migration, but typically you choose one.
When a new task starts, the proxy registers with the control plane, and routing tables are updated across all proxies within seconds. When a task stops, the proxy deregisters, and traffic is immediately redirected to other healthy tasks. No DNS TTL delays.
Service Connect itself does not have an additional cost beyond the underlying resources (CPU/memory for the proxy sidecar). However, you pay for Cloud Map if you use it for the namespace. There are no per-request charges.
Service Connect is limited to within a single ECS cluster and namespace. It does not support cross-account or cross-VPC communication. It also does not support gRPC (though Envoy does, but Service Connect may not expose all Envoy features).
You've just covered ECS Service Discovery and AWS Service Connect — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?