This chapter covers AWS DataSync, a fully managed data transfer service that simplifies, automates, and accelerates moving data between on-premises storage and AWS storage services. DataSync is a core service for high-performance data migration, recurring data processing, and hybrid cloud storage workflows. On the SAA-C03 exam, DataSync appears in approximately 5-8% of questions, often paired with Storage Gateway or Snow Family scenarios. You must understand its architecture, use cases, and how it differs from other data transfer options.
Jump to a section
Imagine you run a global e-commerce company that needs to move inventory from a network of regional warehouses (on-premises storage) to a central distribution hub (AWS). Each warehouse uses a different inventory tracking system (NFS, SMB, HDFS, or S3). You need to move data reliably, securely, and fast, but the warehouses have limited outbound bandwidth and sometimes unreliable internet connections. AWS DataSync is like hiring a professional logistics company that sends a fleet of trucks (DataSync agents) to each warehouse. The logistics company first sends a small reconnaissance vehicle to scan the warehouse and identify exactly which items have changed since the last shipment (incremental transfer using change detection). Then, instead of loading one box at a time, they use high-speed conveyor belts (parallel multi-threaded data streams) to load pallets efficiently. They compress the pallets (data compression) and use dedicated toll lanes (AWS Direct Connect or VPN) to bypass congested public roads. If a truck breaks down (network failure), the logistics company automatically resumes from the last successful checkpoint, not from the beginning. The entire operation is orchestrated from a central control tower (DataSync API/Console) that shows real-time progress and can schedule recurring shipments (scheduled tasks). The logistics company also validates each pallet's manifest using checksums (data integrity verification) before signing off. This is precisely how DataSync works: it uses lightweight agents, parallel streams, compression, incremental sync, automatic retry, and checksum validation to move terabytes of data efficiently.
What is AWS DataSync and Why It Exists
AWS DataSync is a managed data transfer service that enables moving large amounts of data between on-premises storage and AWS, or between AWS storage services, with speed, security, and automation. It was designed to overcome the limitations of traditional tools like rsync, scp, or custom scripts, which are often slow, lack monitoring, and require manual error handling. DataSync handles network optimization, data validation, and scheduling automatically, making it ideal for one-time migrations, ongoing replication, and data processing pipelines.
How DataSync Works Internally
DataSync operates through a software agent deployed in your on-premises environment (or another cloud) as a virtual machine (VM) on VMware ESXi, Microsoft Hyper-V, or as an Amazon EC2 instance for cloud-to-cloud transfers. The agent connects to your local storage (NFS, SMB, HDFS, or S3-compatible object storage) and to your AWS storage destination (S3, EFS, FSx for Windows File Server, FSx for Lustre, or FSx for NetApp ONTAP).
When a transfer task is initiated, DataSync performs the following steps at a low level:
Discovery and Listing: The agent lists all files and objects in the source location, collecting metadata such as size, modification time, and checksums.
Change Detection: For incremental transfers, DataSync compares the current listing with a previous manifest (stored in the agent's cache or in an S3 bucket) to identify new, modified, or deleted files. This uses a combination of file modification timestamps and optional checksum validation.
Parallel Multi-Threaded Transfer: DataSync splits the data into chunks and transfers them over multiple concurrent TLS encrypted connections (up to 10 parallel streams per agent, configurable). This parallelism maximizes throughput, especially over high-latency or high-bandwidth links.
Data Integrity Verification: During transfer, each chunk is checksummed using SHA-256. After the entire file is transferred, the agent verifies the checksum against the source. If mismatch occurs, the chunk is retransmitted.
Compression: DataSync compresses data on-the-fly using LZ4 compression (for S3 destinations, it can be disabled). This reduces bandwidth usage, particularly for compressible data like text files or logs.
Error Handling and Retry: If a network error occurs, DataSync automatically retries the failed chunk with exponential backoff (up to 5 retries by default). The task can be configured to continue despite errors or to stop.
Completion and Validation: Once all files are transferred, DataSync performs a final consistency check comparing the number and size of files on both ends. A detailed task report is generated, listing successes, failures, and skipped files.
Key Components, Values, Defaults, and Timers
Agent: A virtual appliance deployed on-premises. Each agent can handle up to 10 Gbps throughput. You can deploy multiple agents for load balancing or failover.
Task: A job that defines source, destination, and transfer options (e.g., preserve metadata, bandwidth limit, schedule). Tasks can be one-time or recurring (cron-like schedule).
Location: A source or destination endpoint. Types: NFS, SMB, HDFS, S3, EFS, FSx for Windows, FSx for Lustre, FSx for ONTAP.
Bandwidth Limit: You can cap transfer speed (e.g., 100 Mbps) to avoid saturating your network. Default is no limit.
Preserve Metadata: DataSync can preserve file permissions, timestamps, and ownership (for NFS/SMB). This is enabled by default.
Schedule: Tasks can be scheduled using cron expressions (e.g., 0 2 * * ? for daily at 2 AM). Minimum interval is 1 hour.
Task Report: After each task execution, DataSync can generate a detailed report and store it in an S3 bucket. Report includes file-level success/failure, size, and transfer speed.
DataSync Discovery: A feature that scans on-premises storage and generates a report of file metadata, useful for planning migrations. Does not transfer data.
DataSync Transfer: The actual data movement. Default number of parallel streams is 10 per agent, adjustable up to 100.
Encryption: Data in transit is encrypted using TLS 1.2. At rest, data in S3, EFS, or FSx is encrypted according to the destination's encryption settings (e.g., SSE-S3, SSE-KMS).
Configuration and Verification Commands
DataSync is managed via AWS Management Console, AWS CLI, or SDK. Here are common CLI commands:
1. Create an agent:
aws datasync create-agent \
--agent-name my-agent \
--activation-key <activation-key> The activation key is obtained from the console or via aws datasync generate-activation-key.
2. Create a location (NFS source):
aws datasync create-location-nfs \
--server-hostname 192.168.1.100 \
--subdirectory /data \
--on-prem-config AgentArns=["arn:aws:datasync:us-east-1:123456789012:agent/agent-12345678901234567"]3. Create a task:
aws datasync create-task \
--source-location-arn <source-location-arn> \
--destination-location-arn <destination-location-arn> \
--name my-migration-task \
--options "VerifyMode=POINT_IN_TIME_CONSISTENT,OverwriteMode=ALWAYS,Atime=ATIME_NONE,PreserveDeletedFiles=PRESERVE" The --options dict controls behavior like verification mode, overwrite, and handling of deleted files.
4. Start a task:
aws datasync start-task-execution --task-arn <task-arn>5. Monitor task execution:
aws datasync describe-task-execution --task-execution-arn <task-execution-arn>This returns status, bytes transferred, files transferred, and errors.
6. List tasks:
aws datasync list-tasksInteraction with Related Technologies
AWS Direct Connect: For large transfers, use Direct Connect to bypass the public internet. DataSync works over any TCP/IP network, but Direct Connect provides consistent bandwidth and lower latency.
AWS Snow Family: For petabyte-scale data or very low-bandwidth environments, use Snowball Edge or Snowmobile for initial bulk transfer, then DataSync for ongoing incremental sync.
AWS Storage Gateway: Storage Gateway provides low-latency on-premises access to cloud storage, while DataSync is for batch transfers. They complement each other: DataSync for migration, Storage Gateway for ongoing hybrid access.
AWS Transfer Family: Transfer Family provides managed SFTP, FTPS, FTP, and AS2 protocols for third-party data exchange. DataSync is for internal transfers between your own systems and AWS.
Amazon S3 File Gateway: This is a specific Storage Gateway type that provides an NFS/SMB mount point to S3. DataSync can transfer data into an S3 bucket that is also accessed via File Gateway, but DataSync is not a file gateway itself.
Use Cases and Exam Relevance
The SAA-C03 exam tests DataSync in the context of:
Migrating large datasets to AWS (e.g., 50 TB of file data from on-premises NFS to Amazon EFS).
Recurring data transfers for processing (e.g., daily log files from on-premises to S3 for analytics).
Data synchronization across AWS regions (using an EC2 agent in one region to transfer to another region).
Hybrid cloud storage with on-premises and cloud storage.
Key exam facts:
DataSync supports NFS, SMB, HDFS, S3, EFS, FSx for Windows, FSx for Lustre, and FSx for ONTAP.
DataSync uses an agent that must be deployed on-premises (or in another cloud) as a VM.
DataSync can transfer up to 10 Gbps per agent (theoretical maximum, actual throughput depends on network and storage).
DataSync automatically handles encryption in transit (TLS 1.2) and can preserve metadata.
DataSync is not a real-time or streaming service; it is designed for batch transfers.
DataSync is not free; you pay per GB transferred (including over AWS Direct Connect).
DataSync does not support FTP, SFTP, or HTTP sources directly; use Transfer Family for those.
DataSync can be used for one-time migration or recurring schedules (minimum 1 hour interval).
DataSync can be used with AWS Snowball Edge as a data source (Snowball Edge can act as an NFS/SMB endpoint) for hybrid scenarios.
Limitations
DataSync agents must be deployed in the same network as the source storage.
For S3 destinations, DataSync can only write to S3 buckets; it cannot write to S3 Glacier or S3 Glacier Deep Archive directly (though lifecycle policies can transition data after transfer).
DataSync does not support NTFS file system ACLs for Windows FSx; it preserves basic file attributes but not advanced ACLs.
DataSync cannot be used to transfer data from one on-premises location to another without an AWS destination in between.
DataSync does not support transferring data from or to Amazon EBS volumes directly; you must use EFS, FSx, or S3 as intermediaries.
Performance and Scaling
For maximum throughput, use multiple agents and multiple tasks. Each agent can handle up to 10 Gbps, but actual performance depends on source storage system speed, network latency, and file size distribution. Small files (e.g., < 1 MB) reduce throughput due to overhead. DataSync is optimized for large files (e.g., > 100 MB). For many small files, consider compressing them into archives before transfer.
DataSync provides real-time monitoring through CloudWatch metrics (e.g., BytesTransferred, FilesTransferred, TaskDuration). You can also enable detailed task reports.
Deploy the DataSync Agent
You start by deploying a DataSync agent as a virtual machine (VM) on your on-premises hypervisor (VMware ESXi, Microsoft Hyper-V, or Linux KVM). The agent is a lightweight Linux-based appliance that communicates with the AWS DataSync service. You download the OVA or VHD file from the AWS Management Console. After powering on the VM, it obtains an IP address via DHCP. You then activate the agent by entering an activation key generated in the DataSync console. The agent registers with AWS and appears in the console as an available agent. The agent must have outbound internet access (or a Direct Connect/VPN connection) to reach the DataSync service endpoints.
Create Source and Destination Locations
In the DataSync console or CLI, you define a source location pointing to your on-premises storage (e.g., an NFS export at 192.168.1.100:/data) and a destination location in AWS (e.g., an S3 bucket named `my-bucket` in us-east-1). For NFS, you specify the server hostname, the mount path, and the agent that can access it. For S3, you specify the bucket name, an optional prefix, and the IAM role that DataSync assumes to write to the bucket. The IAM role must have permissions for `s3:PutObject`, `s3:GetObject`, `s3:DeleteObject`, and `s3:ListBucket`. DataSync also supports mounting SMB shares with credentials stored in AWS Secrets Manager.
Create and Configure a Task
A task ties together a source location, a destination location, and a set of options. You create a task in the console or CLI, specifying the source and destination ARNs. You can configure options such as: `VerifyMode` (e.g., `POINT_IN_TIME_CONSISTENT` for files that are stable during transfer), `OverwriteMode` (`ALWAYS` or `NEVER`), `Atime` (preserve access time or ignore), `PreserveDeletedFiles` (keep or delete in destination), `PreservePermissions` (for NFS/SMB), and `BandwidthLimit` (in Mbps). You can also enable a schedule using a cron expression. The task can be started manually or automatically on schedule.
Start the Task Execution
When you start a task (manually or via schedule), DataSync begins the transfer. The agent first lists all files at the source, then compares with the destination (if incremental) or starts a full copy. The agent opens multiple parallel TLS connections to the destination. Data is read from the source, compressed (if enabled), and sent in chunks. Each chunk is checksummed. The destination reassembles the chunks and writes the file. The agent monitors progress and reports to the DataSync service. You can view real-time metrics in CloudWatch, such as throughput and number of files transferred. If a failure occurs, the agent retries up to 5 times with exponential backoff. The task continues until all files are processed or a critical failure stops it.
Verify and Monitor Completion
After the task execution completes, DataSync generates a detailed report (if configured) that lists each file, its size, whether it succeeded or failed, and any errors. The report is stored in an S3 bucket you specify. You can also view the task execution status in the console: `SUCCESS`, `ERROR`, or `WARNING`. DataSync provides CloudWatch metrics for each task execution, including `BytesTransferred`, `FilesTransferred`, `FilesFailed`, and `TaskDuration`. You can set CloudWatch alarms for failures. For ongoing tasks, you can schedule them to run periodically (e.g., daily) to keep the destination in sync with the source. DataSync also supports incremental transfers after the initial full sync, reducing time and bandwidth.
Scenario 1: Migrating 100 TB of Genomics Research Data to Amazon S3
A pharmaceutical company has 100 TB of genomic sequencing files stored on an on-premises NetApp NFS cluster. They want to move this data to Amazon S3 for analysis using AWS Batch and Amazon Athena. The office has a 1 Gbps internet connection with occasional packet loss. They deploy a DataSync agent on a VMware ESXi host in the same data center as the NetApp cluster. They create a source location pointing to the NFS export and a destination S3 bucket with SSE-S3 encryption. They configure a task with bandwidth limit of 800 Mbps to avoid saturating the link. The initial full transfer takes about 12 days. DataSync automatically retries any failed chunks due to network blips. After the initial sync, they schedule a daily incremental task to capture new files. The task report is stored in a separate S3 bucket for auditing. The migration succeeds with zero data corruption verified by SHA-256 checksums.
Scenario 2: Recurring Log Transfer for a Retail Analytics Pipeline
A large e-commerce company generates 500 GB of web server logs daily across multiple on-premises data centers. They need to transfer these logs to Amazon S3 for processing by Amazon EMR and Amazon Redshift. They deploy one DataSync agent per data center. Each agent has a source location pointing to an SMB share where logs are written. Destination is a single S3 bucket with a prefix per data center (e.g., logs/dc1/). They create a scheduled task that runs at 3 AM daily, after the log files are finalized. The task uses VerifyMode=POINT_IN_TIME_CONSISTENT to avoid transferring files that are still being written. Bandwidth is limited to 200 Mbps per agent to avoid impacting production traffic. DataSync compresses the logs (text files compress well), reducing the transferred volume by 70%. The task completes within 4 hours. If a task fails, an SNS notification is sent to the operations team.
Scenario 3: Cross-Region Replication for Disaster Recovery
A financial services company uses Amazon EFS for its application data in us-east-1. For disaster recovery, they need to replicate the EFS file system to us-west-2. They deploy a DataSync agent as an EC2 instance in us-east-1 (since both source and destination are in AWS, the agent can be an EC2 instance). The source location is the EFS file system in us-east-1, and the destination is an EFS file system in us-west-2. They create a task that runs every 6 hours. DataSync preserves file permissions and timestamps. The transfer uses AWS internal network, so no internet bandwidth is consumed. The replication lag is at most 6 hours, meeting the RPO requirement. They monitor CloudWatch metrics for any failures.
Common Misconfigurations and Pitfalls
Incorrect IAM Role: The IAM role for the destination must have the correct permissions. Missing s3:PutObject causes writes to fail.
Agent Network Access: The agent must have outbound internet access to the DataSync service endpoints (or use VPC endpoints). If the agent cannot reach AWS, activation or transfer fails.
Bandwidth Saturation: Without a bandwidth limit, DataSync can saturate the network link, causing issues for other applications.
Inconsistent Source Data: If files are modified during transfer, the copied file may be inconsistent. Use VerifyMode=POINT_IN_TIME_CONSISTENT to skip files that are still open or changing.
Small Files Performance: Transferring millions of small files (e.g., < 64 KB) results in low throughput due to per-file overhead. Consider archiving or combining files before transfer.
SAA-C03 Exam Focus on AWS DataSync
The SAA-C03 exam tests DataSync primarily under Objective 3.5 (High Performance) and also under Objective 2.3 (Storage). You should know: - When to use DataSync vs. other services: The exam often presents a scenario with large on-premises data to migrate. Correct answer is DataSync if the requirement is speed, automation, and incremental sync. Wrong answers include Storage Gateway (for low-latency access, not bulk transfer), Snowball (for offline transfer, not online), or S3 Transfer Acceleration (for client uploads, not server-side transfer). - Supported source/destination types: Memorize that DataSync supports NFS, SMB, HDFS, S3, EFS, FSx for Windows, FSx for Lustre, and FSx for ONTAP. It does NOT support FTP, SFTP, or EBS directly. - Agent deployment: The agent is a VM on-premises or an EC2 instance for cloud-to-cloud. The exam may ask where to deploy the agent for a given scenario. - Incremental transfers: DataSync can do incremental after the first full sync. The exam might ask how to keep data in sync – answer: schedule a recurring DataSync task. - Bandwidth limit: You can set a bandwidth limit to avoid impact on production traffic. The exam may present a scenario where the network is shared, and you need to limit DataSync throughput. - Data integrity: DataSync uses checksums to verify data integrity. The exam may ask how DataSync ensures data is not corrupted. - Encryption: Data is encrypted in transit via TLS. At rest, encryption depends on destination service (e.g., SSE-S3). - Cost: You pay per GB transferred. Not free. The exam may include cost considerations when comparing with other services.
Common Wrong Answers on Exam Questions
Choosing Storage Gateway instead of DataSync: Candidates think Storage Gateway is for data transfer, but it is for low-latency access with a local cache. For bulk migration, DataSync is correct.
Choosing Snowball for online transfers: Snowball is for offline (physical) transfer. If the scenario mentions an online connection, DataSync is better.
Thinking DataSync supports real-time streaming: DataSync is batch-oriented, not real-time. For real-time, use Kinesis or DMS.
Assuming DataSync can transfer to EBS directly: DataSync does not support EBS as a destination. Use EFS or S3 as intermediate.
Ignoring the need for an agent: Some questions may ask what is needed for on-premises transfer. The answer is a DataSync agent.
Specific Numbers and Terms on the Exam
Agent throughput: up to 10 Gbps per agent.
Minimum schedule interval: 1 hour.
Compression: LZ4 (on by default for S3).
Encryption: TLS 1.2 in transit.
Default parallel streams: 10 per agent (configurable up to 100).
Retries: up to 5 with exponential backoff.
Supported protocols: NFS v3/v4, SMB v2/v3, HDFS.
Edge Cases and Exceptions
DataSync cannot transfer to S3 Glacier classes directly; use lifecycle policies after transfer.
DataSync does not preserve NTFS ACLs for FSx for Windows; only basic file attributes.
For S3 destinations, DataSync can set object metadata (e.g., storage class) using the --object-storage-class option.
DataSync can use AWS PrivateLink (VPC endpoints) for secure transfer without public internet.
How to Eliminate Wrong Answers
If the scenario requires low-latency access to cloud data from on-premises, eliminate DataSync and choose Storage Gateway.
If the scenario requires transferring data from an FTP server, eliminate DataSync and choose Transfer Family.
If the scenario requires real-time data ingestion, eliminate DataSync and choose Kinesis or DMS.
If the scenario involves a one-time transfer of petabytes and the network is very slow, eliminate DataSync and choose Snowball.
DataSync is a managed service for fast, automated data transfer from on-premises to AWS (or between AWS services).
DataSync uses a software agent deployed as a VM on-premises (or EC2 for cloud-to-cloud).
Supports NFS, SMB, HDFS, S3, EFS, FSx for Windows, FSx for Lustre, and FSx for ONTAP as source/destination.
Transfers are parallelized (up to 10 streams per agent, configurable to 100) and compressed (LZ4) for speed.
Data integrity is ensured with SHA-256 checksums and automatic retry (up to 5 times with exponential backoff).
Encryption in transit uses TLS 1.2; at rest encryption depends on destination service.
Can schedule recurring tasks with minimum 1-hour interval using cron expressions.
Bandwidth limit can be set to avoid network saturation (e.g., 100 Mbps).
DataSync is not real-time; it is batch-oriented. For real-time, use Kinesis or DMS.
DataSync does not support FTP/SFTP sources or EBS destinations directly.
These come up on the exam all the time. Here's how to tell them apart.
AWS DataSync
Designed for batch data transfer and migration (full or incremental).
Uses an agent deployed as a VM on-premises or EC2.
Transfers data directly to S3, EFS, or FSx without caching on-premises.
Supports NFS, SMB, HDFS as sources; S3, EFS, FSx as destinations.
Provides automated scheduling, compression, and checksum validation.
AWS Storage Gateway
Designed for low-latency on-premises access to cloud storage with a local cache.
Uses a gateway appliance (VM or hardware) that caches frequently accessed data locally.
Exposes cloud storage as NFS/SMB/iSCSI mounts to on-premises applications.
Supports file (File Gateway), volume (Volume Gateway), and tape (Tape Gateway) interfaces.
Provides local cache for low-latency reads/writes; data is asynchronously uploaded to AWS.
Mistake
DataSync can transfer data between two on-premises locations directly without going through AWS.
Correct
DataSync always transfers data to or from an AWS storage service. It cannot transfer directly between two on-premises systems; data must pass through an AWS location (S3, EFS, FSx).
Mistake
DataSync is a real-time data replication service.
Correct
DataSync is a batch transfer service, not real-time. It can be scheduled as frequently as every hour, but there is always some latency. For real-time replication, use AWS DMS or Kinesis.
Mistake
DataSync can migrate data to Amazon EBS volumes directly.
Correct
DataSync does not support EBS as a destination. You must use an intermediate service like EFS or S3, then copy to EBS using other tools.
Mistake
DataSync is free to use; you only pay for storage.
Correct
DataSync charges per GB transferred. There are no upfront costs, but you pay for the amount of data moved. Check the pricing page for current rates.
Mistake
DataSync requires a VPN or Direct Connect to function.
Correct
DataSync works over the public internet using TLS encryption. However, for better performance and security, you can use VPN or Direct Connect. It is not a requirement.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
DataSync is for batch data transfer and migration, moving large amounts of data from on-premises to AWS quickly with automation and validation. Storage Gateway provides low-latency on-premises access to cloud storage by caching data locally. Use DataSync for one-time or recurring bulk transfers; use Storage Gateway when your on-premises applications need to access cloud storage with low latency as if it were local.
Yes, you can use DataSync to transfer data between AWS regions. Deploy a DataSync agent as an EC2 instance in the source region, and configure the source location as an EFS or S3 location in that region, and the destination as an EFS or S3 location in the target region. DataSync will transfer the data over the AWS backbone network.
Yes, DataSync supports incremental transfers. After the first full sync, subsequent task executions only transfer files that are new or modified, based on file modification time and size comparison. This reduces transfer time and bandwidth usage. You can schedule tasks to run periodically to keep the destination in sync.
DataSync requires the agent to have outbound internet access to the AWS DataSync service endpoints (or use VPC endpoints). It works over the public internet with TLS encryption, but for better performance and reliability, you can use AWS Direct Connect or a VPN connection. The agent must be able to reach the source and destination storage systems.
DataSync uses SHA-256 checksums to verify data integrity. Each chunk of data is checksummed at the source and verified at the destination. If a checksum mismatch is detected, the chunk is retransmitted. After the transfer, DataSync performs a final consistency check comparing file count and sizes.
No, DataSync cannot write directly to S3 Glacier or S3 Glacier Deep Archive storage classes. It writes to S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, or S3 One Zone-IA. To transition data to Glacier, you must configure an S3 Lifecycle policy after the transfer.
DataSync supports the following source types: NFS, SMB, HDFS, and S3 (for cloud-to-cloud transfers). Destination types: S3, EFS, FSx for Windows File Server, FSx for Lustre, and FSx for NetApp ONTAP. It does not support FTP, SFTP, or EBS directly.
You've just covered AWS DataSync for Data Transfer — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?