This chapter covers AWS DataSync, a managed data transfer service that simplifies and accelerates moving large amounts of data between on-premises storage and AWS storage services. For the CLF-C02 exam, this topic falls under Domain 3: Cloud Technology Services, Objective 3.2, which focuses on data migration and transfer services. Understanding DataSync is essential because it is a key service for hybrid cloud migrations and is frequently tested in comparison to other data transfer options like AWS Snowball or AWS Transfer Family. This chapter will explain what DataSync is, how it works, its use cases, and what you need to know for the exam.
Jump to a section
Imagine your company has two offices: one in New York (on-premises data center) and one in Los Angeles (AWS cloud). You need to move hundreds of boxes of documents (data) between them regularly. You could hire a courier who drives a truck across the country, but that's slow and the boxes might get lost or damaged. Instead, you use a specialized secure courier service (AWS DataSync). This courier has a dedicated, encrypted truck that only carries documents. At your New York office, you pack the boxes into the truck using a standardized packing list (configuration). The courier then drives directly to the Los Angeles office, not stopping at other locations. The truck has its own GPS and monitoring (DataSync agent and CloudWatch), so you can track progress in real-time. When it arrives, the courier unpacks the boxes and verifies each document against the packing list (validation). If a box is missing, the courier immediately re-drives to get the missing document (incremental transfer). You pay per box moved (per-GB pricing), not per trip. This service is 10x faster than your old courier because it uses a dedicated route and compression. It also ensures document integrity with checksums. The key mechanism: the courier doesn't just drop off boxes; it actively validates and reports errors, and can handle ongoing updates by only moving new or changed documents (incremental sync).
What is AWS DataSync and the Problem It Solves
AWS DataSync is a managed data transfer service that simplifies, automates, and accelerates moving data between on-premises storage systems and AWS storage services, or between AWS storage services themselves. The core problem it solves is the challenge of transferring large datasets over the internet or direct connect with reliability, speed, and security. Traditional methods like rsync or FTP require manual scripting, lack monitoring, and are prone to failures when handling millions of files. DataSync handles these issues by providing a purpose-built agent that can saturate network links, validate data integrity, and automate recurring transfers.
DataSync supports both one-time migrations and ongoing replication. It can transfer data from Network File System (NFS) and Server Message Block (SMB) shares on-premises to Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. It also supports transfers between AWS storage services, such as from S3 to EFS or from EFS to FSx. The service is designed to move data at speeds up to 10 Gbps per agent, and you can use multiple agents in parallel for higher throughput.
How AWS DataSync Works – Mechanism Walkthrough
DataSync operates using a software agent that you install on your on-premises environment, typically on a virtual machine (VM) running on VMware ESXi, Microsoft Hyper-V, or Linux KVM. The agent acts as a bridge between your on-premises storage and AWS. Here's the step-by-step mechanism:
Agent Deployment: You download and install the DataSync agent on a hypervisor in your data center. The agent is a lightweight VM image that includes the DataSync software. It requires outbound internet access or AWS Direct Connect to communicate with the DataSync service endpoint.
Location Configuration: In the AWS Management Console, you define two "locations" – a source location (your on-premises NFS or SMB share) and a destination location (an S3 bucket, EFS file system, or FSx file system). For the source, you specify the server hostname or IP, mount path, and authentication credentials (if SMB). For the destination, you specify the AWS storage resource and optionally a subfolder.
3. Task Creation: You create a "task" that ties the source and destination locations together. The task includes configuration options such as: - Transfer mode: One-time or scheduled (recurring). - Bandwidth limits: You can throttle the transfer to avoid saturating your network. - Data validation: Options include checksum validation, overwrite rules (always overwrite, never overwrite, or only overwrite if newer). - Logging: You can enable CloudWatch logging for detailed transfer logs.
Execution: When you start the task (or it runs on a schedule), the agent reads the source file system, compresses and encrypts the data in transit using TLS, and sends it to the DataSync service endpoint. The service then writes the data to the destination storage service. The agent and service work together to parallelize transfers, splitting large files into chunks and sending them concurrently.
Validation and Reporting: After the transfer completes, DataSync performs a checksum validation to ensure data integrity. It generates a detailed report showing which files were transferred, skipped (due to errors or exclusions), or failed. You can view this report in the console or export it to S3.
Incremental Transfers: For ongoing replication, DataSync can perform incremental transfers after the initial full sync. It uses the source file system's metadata (e.g., modification timestamps) to identify changed files and only transfers those. This is efficient for continuous backup or synchronization.
Key Configurations, Pricing, and Limits
Agent Types: DataSync agents are available as software appliances (free) or as hardware appliances (AWS DataSync appliance, which is a physical device for environments without virtualization). The software agent is the most common for CLF-C02.
Pricing: You pay per gigabyte of data transferred (ingress and egress). There is no upfront cost for the software agent. Pricing varies by region but is typically around $0.0125 per GB for data transferred from on-premises to AWS. Data transferred between AWS services within the same region is free (no DataSync charge, but standard storage costs apply).
Limits: A single agent can support up to 10 Gbps throughput. You can have up to 100 agents per AWS account per region. A single task can transfer up to 10 million files. For larger datasets, you should split into multiple tasks or use multiple agents.
Scheduling: Tasks can be scheduled using cron expressions or triggered via AWS CloudWatch Events. The minimum interval is 1 hour.
Security: Data is encrypted in transit using TLS 1.2. At rest, encryption depends on the destination service (e.g., S3 server-side encryption). You can also use AWS KMS for encryption keys.
Comparison to On-Premises or Competing Approaches
Rsync/SCP: Manual, no built-in monitoring, no validation, slower due to single-threaded transfers. DataSync is 10x faster and provides logs.
AWS Snowball: For large datasets (over 10 TB) where network transfer is too slow or costly. Snowball is a physical device shipped to you. DataSync is for network-based transfers. The exam often tests when to use Snowball vs DataSync: DataSync for ongoing or moderate-sized transfers over network; Snowball for one-time bulk transfers of massive datasets.
AWS Transfer Family: Managed SFTP, FTPS, FTP service with a fully managed endpoint. Transfer Family is for third-party users who need to upload files via standard protocols. DataSync is for automated, scheduled transfers between storage systems.
Amazon S3 Transfer Acceleration: Speeds up uploads to S3 using AWS edge locations. It's a feature of S3, not a separate service. DataSync is a full service with agent-based transfers, validation, and scheduling. For exam: Transfer Acceleration is for client-to-S3 uploads; DataSync is for server-to-storage migrations.
When to Use DataSync vs Alternatives
Use AWS DataSync when:
You need to migrate large datasets (hundreds of GB to tens of TB) from on-premises NFS/SMB to AWS.
You need ongoing replication or synchronization (e.g., for disaster recovery or hybrid cloud workloads).
You require data validation and detailed reporting.
You have a network connection (internet or Direct Connect) with sufficient bandwidth.
Do NOT use DataSync when:
The data is over 10 PB (consider Snowmobile or multiple Snowball devices).
You need real-time file access (consider AWS Storage Gateway with cache).
You only need to transfer a few small files occasionally (use AWS CLI or S3 console).
Deploy the DataSync Agent
Download the DataSync agent virtual machine image from the AWS Management Console. Deploy it on a hypervisor in your on-premises environment. The agent requires at least 4 vCPUs, 8 GB RAM, and 80 GB disk. During deployment, you assign an IP address and ensure the agent has outbound internet access or connectivity to AWS Direct Connect. The agent will register with AWS and receive a unique activation key. You enter this key in the console to associate the agent with your AWS account. This step is critical because without a properly registered agent, DataSync cannot initiate transfers.
Configure Source and Destination Locations
In the DataSync console, create a source location by specifying the on-premises NFS or SMB server details. For NFS, you provide the server hostname/IP and the mount path (e.g., /data). For SMB, you also provide the domain, username, and password. Then create a destination location – for example, an S3 bucket. You must specify the S3 bucket name, optionally a subfolder, and the S3 storage class (e.g., Standard, Standard-IA). DataSync will write objects with the same directory structure as the source. You can also configure data encryption options at this step.
Create and Configure a Task
A task ties the source and destination locations together. In the console, choose the source and destination locations you created. Configure transfer options: enable data validation (default: checksum validation), set overwrite mode (default: always overwrite), and specify bandwidth limits if needed. You can also set up a schedule using cron syntax for recurring transfers. For example, a daily sync at 2 AM. Optionally, enable CloudWatch logging to capture detailed logs. Tasks can be started manually or run on schedule.
Start the Transfer and Monitor Progress
Once the task is configured, you can start it immediately or wait for the schedule. During execution, the DataSync agent reads files from the source, compresses and encrypts them, and sends them to the AWS DataSync service endpoint. The service then writes data to the destination. You can monitor progress in the console – DataSync shows the number of files transferred, bytes transferred, and any errors. You can also view CloudWatch metrics like BytesTransferred, FilesTransferred, and TransferDuration. If errors occur, they are logged and the task continues (skipping problematic files).
Review Reports and Validate Data
After the task completes, DataSync generates a detailed report. You can view it in the console or export it to an S3 bucket. The report lists every file that was transferred, skipped, or failed. It includes file paths, sizes, and checksums. Use this report to verify that all data arrived intact. For incremental tasks, future runs will only transfer new or modified files, based on file modification timestamps. This step is crucial for auditing and compliance, especially for regulated industries.
Scenario 1: Enterprise Backup to AWS A financial services company has 50 TB of critical data stored on on-premises NFS file servers. They want to back up this data to Amazon S3 for disaster recovery. They choose AWS DataSync because it supports automated, recurring transfers. The team deploys two DataSync agents (for redundancy) and creates a task that runs nightly. The initial full transfer takes 3 days over a 1 Gbps Direct Connect link. After that, incremental transfers take only 30 minutes each night. The company uses S3 Object Lock to prevent deletion or overwrites. Cost: DataSync charges $0.0125/GB for the initial 50 TB ($640) and then nightly incremental transfers (around 5 GB per night, costing $0.06 per night). The team monitors via CloudWatch dashboards and receives alerts if a transfer fails. A common mistake is not setting bandwidth limits, which can saturate the network and impact other business operations. The team learned to throttle transfers to 500 Mbps during business hours.
Scenario 2: Migration from Windows File Server to FSx A media production company uses on-premises Windows file servers (SMB shares) for video editing. They want to migrate to Amazon FSx for Windows File Server to reduce hardware costs and enable global collaboration. They use DataSync to perform the migration. The team creates a task from the SMB source to an FSx file system. They first run a test migration with a small dataset to validate permissions and file integrity. After confirming, they run the full migration over a weekend. DataSync preserves file permissions, timestamps, and metadata. After the migration, they cut over to FSx and decommission the on-premises servers. The team uses DataSync's incremental transfer capability to sync changes made during the migration window. Cost: DataSync charges for the data transferred (e.g., 20 TB at $0.0125/GB = $256). The team saves on data center power and cooling costs.
Scenario 3: Multi-Region Data Distribution A global e-commerce company needs to distribute product catalog data from their primary AWS region (us-east-1) to secondary regions (eu-west-1, ap-southeast-1) for low-latency access. They use DataSync to transfer data between S3 buckets in different regions. They create a DataSync task with source location as an S3 bucket in us-east-1 and destination location as an S3 bucket in eu-west-1. They schedule the task to run every 6 hours. DataSync efficiently transfers only changed files (incremental). The company uses S3 Cross-Region Replication (CRR) as an alternative, but DataSync gives them more control over scheduling and validation. Cost: DataSync charges for inter-region data transfer (e.g., $0.02/GB), plus standard S3 storage costs. Misconfiguration: If the team forgets to enable data validation, a corrupted file could propagate and cause catalog errors. They always enable checksum validation.
What CLF-C02 Tests on DataSync The CLF-C02 exam tests your understanding of AWS DataSync as a data transfer service under Domain 3: Cloud Technology Services, Objective 3.2 (Data migration and transfer services). You need to know:
The purpose of DataSync: automated, fast, and secure data transfer between on-premises storage and AWS storage services.
Supported sources: NFS, SMB, and self-managed object storage (via agent).
Supported destinations: Amazon S3, Amazon EFS, Amazon FSx for Windows File Server, and Amazon FSx for Lustre.
The role of the DataSync agent (software or hardware).
That DataSync can perform one-time or recurring transfers.
That DataSync includes data validation and encryption.
Use cases: migration, backup, disaster recovery, and data distribution.
How it differs from AWS Snowball, AWS Transfer Family, and S3 Transfer Acceleration.
Common Wrong Answers and Why Candidates Choose Them 1. "DataSync is used for real-time file access" – Wrong. DataSync is for batch transfers, not real-time. Candidates confuse it with AWS Storage Gateway, which provides low-latency access to S3 or EFS. The exam tests this distinction: Storage Gateway for live access, DataSync for bulk transfer. 2. "DataSync can transfer data from any source" – Wrong. DataSync only supports NFS, SMB, and self-managed object storage. It does not support FTP, HTTP, or databases. Candidates might think it's a generic transfer tool. The exam expects you to know the specific protocols. 3. "DataSync requires a hardware appliance" – Wrong. The default is a software agent that you install on a hypervisor. There is a hardware appliance option, but it's not required. Candidates may recall Snowball and assume DataSync also needs hardware. 4. "DataSync transfers data faster than Direct Connect" – Wrong. DataSync uses your existing network (internet or Direct Connect). It can saturate the link, but it doesn't make the link faster. The speed depends on bandwidth. Candidates might think DataSync has its own network.
Specific Terms and Values on the Exam - Agent: Software agent deployed on-premises. - Locations: Source and destination. - Task: Defines the transfer. - NFS, SMB: Supported source protocols. - S3, EFS, FSx: Supported destinations. - Data validation: Checksum verification. - Encryption in transit: TLS. - Incremental transfers: Only changed files. - Pricing: Per GB transferred. - Throughput: Up to 10 Gbps per agent.
Tricky Distinctions - DataSync vs. Storage Gateway: DataSync is for bulk transfer; Storage Gateway is for low-latency access with local cache. - DataSync vs. Snowball: DataSync for network transfers (under 10 TB or recurring); Snowball for offline transfer of large datasets (over 10 TB or limited bandwidth). - DataSync vs. Transfer Family: DataSync is automated server-to-storage; Transfer Family is for end-user file uploads via SFTP/FTP. - DataSync vs. S3 Transfer Acceleration: Transfer Acceleration speeds up client uploads to S3 using edge locations; DataSync is for server-to-storage migrations.
Decision Rule for Multi-Choice Questions If a question asks about moving large amounts of data from on-premises NFS to S3 with automation and validation, choose DataSync. If the question mentions real-time access or file sharing, choose Storage Gateway. If the question mentions physical shipment or no network, choose Snowball.
AWS DataSync is a managed data transfer service for moving data between on-premises storage (NFS/SMB) and AWS storage services (S3, EFS, FSx).
DataSync uses a software agent deployed on-premises to read data, compress it, encrypt it in transit (TLS), and transfer it to AWS.
DataSync supports one-time or recurring (scheduled) transfers, with incremental sync after initial full transfer.
DataSync provides data validation via checksum and detailed transfer reports.
Pricing is per GB transferred; there is no cost for the software agent.
DataSync is not for real-time access (use Storage Gateway) or for physical shipment (use Snowball).
DataSync can achieve up to 10 Gbps throughput per agent, and multiple agents can be used in parallel.
DataSync supports destinations: S3 (any storage class except Glacier), EFS, FSx for Windows File Server, and FSx for Lustre.
DataSync does not support database migration (use DMS) or FTP/SFTP (use Transfer Family).
The CLF-C02 exam tests the distinction between DataSync, Storage Gateway, Snowball, and Transfer Family.
These come up on the exam all the time. Here's how to tell them apart.
AWS DataSync
Purpose: Bulk data transfer and migration
Access pattern: Batch, not real-time
Source: On-premises NFS/SMB
Destination: S3, EFS, FSx
Data validation: Yes, checksum
AWS Storage Gateway
Purpose: Low-latency access to cloud storage
Access pattern: Real-time with local cache
Source: On-premises applications via NFS/SMB/iSCSI
Destination: S3, EBS, Glacier (via S3)
Data validation: Not built-in (relies on storage)
AWS DataSync
Transfer method: Network (internet or Direct Connect)
Data size: Up to tens of TB per transfer
Speed: Up to 10 Gbps per agent
Recurring transfers: Yes, scheduled
Cost: Per GB transferred
AWS Snowball Edge
Transfer method: Physical device shipped to AWS
Data size: Up to 80 TB per device (Snowball Edge)
Speed: Depends on shipping time
Recurring transfers: No, one-time
Cost: Per device + shipping + data transfer out
AWS DataSync
Protocol: Proprietary (agent-based)
Use case: Server-to-storage migration
Authentication: Agent registration
Automation: Scheduled tasks
User access: No end-user access
AWS Transfer Family
Protocol: SFTP, FTPS, FTP
Use case: End-user file uploads
Authentication: SSH keys, passwords, AD
Automation: Not built-in (uses Lambda for triggers)
User access: Direct end-user access
Mistake
AWS DataSync requires a hardware appliance for all transfers.
Correct
DataSync offers both a software agent (free, deployable on VMware, Hyper-V, or KVM) and a hardware appliance. The software agent is sufficient for most use cases. The hardware appliance is optional for environments without virtualization.
Mistake
DataSync can transfer data from any on-premises storage system, including databases and FTP servers.
Correct
DataSync only supports NFS, SMB, and self-managed object storage. It does not support databases, FTP, or HTTP. For databases, use AWS Database Migration Service (DMS).
Mistake
DataSync transfers are always incremental after the first full sync.
Correct
DataSync can perform incremental transfers if configured. By default, a one-time task does a full transfer. For recurring tasks, you can enable incremental sync using file modification timestamps. The exam tests that you can choose between full and incremental.
Mistake
DataSync automatically compresses data to reduce transfer time.
Correct
DataSync does compress data in transit to improve throughput, but it is not user-configurable. The compression is transparent and always applied. However, it does not reduce the per-GB pricing – you still pay for the original uncompressed size.
Mistake
DataSync can transfer data directly to Amazon S3 Glacier without going through S3 Standard first.
Correct
DataSync writes data to an S3 bucket. You can configure the S3 storage class (e.g., Standard, Standard-IA, One Zone-IA, Intelligent-Tiering) but not Glacier directly. To move data to Glacier, you need to use S3 Lifecycle policies after the transfer.
AWS DataSync is designed for bulk data transfer and migration between on-premises storage and AWS, operating in batch mode with validation and scheduling. AWS Storage Gateway provides low-latency, real-time access to cloud storage by caching data locally. Use DataSync for one-time or recurring migrations; use Storage Gateway when your on-premises applications need to access cloud storage as if it were local storage. For the exam, remember: DataSync = batch transfer, Storage Gateway = live access.
No, DataSync cannot write directly to S3 Glacier. DataSync writes to an S3 bucket, and you can specify a storage class such as Standard, Standard-IA, One Zone-IA, or Intelligent-Tiering. To transition data to Glacier, you must configure an S3 Lifecycle policy after the transfer. This is a common exam trap: DataSync supports S3 storage classes but not Glacier directly.
Yes, DataSync supports incremental transfers for recurring tasks. After the initial full transfer, subsequent runs only transfer files that have changed (based on modification timestamps). This is efficient for ongoing sync. For one-time tasks, you get a full transfer. The exam may ask about this feature as a benefit over manual rsync.
The DataSync agent requires outbound internet access or AWS Direct Connect to communicate with the DataSync service endpoint. It uses port 443 (HTTPS) for control and data transfer. The agent does not need inbound ports open. For best performance, a dedicated connection is recommended. The exam may test that DataSync works over the internet or Direct Connect, not over a VPN necessarily.
AWS DataSync charges per gigabyte of data transferred. As of the CLF-C02 exam, the typical price is $0.0125 per GB for data transferred from on-premises to AWS. Data transferred between AWS services within the same region is free (no DataSync charge). There is no cost for the software agent. For the exam, remember the per-GB pricing model.
Yes, DataSync can transfer data between AWS storage services in different regions. For example, you can copy data from an S3 bucket in us-east-1 to an EFS file system in eu-west-1. Inter-region transfers incur DataSync charges plus standard data transfer fees. This is a valid use case for data distribution or disaster recovery.
A single DataSync agent can achieve up to 10 Gbps throughput, depending on network bandwidth and storage performance. You can use multiple agents in parallel to increase throughput. The exam may test this limit as a performance characteristic.
You've just covered AWS DataSync — now see how well it sticks with free CLF-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?