AWS DataSync is a fully managed data transfer service that simplifies, automates, and accelerates moving and replicating data between on-premises storage systems and AWS storage services over the internet or AWS Direct Connect. In a datalake environment, AWS DataSync can be used to sync files securely from on premise storage servers like NFS to S3 based datalake automatically.
In this architecture, we = walk you through how to use AWS DataSync and DataSync Agent to migrate data to a datalake in Amazon S3.
- You create a network attached file storage server (NFS) inside your data center.
- You install an AWS Datasync Agent as a VMware ESXi hypervisor based environment. This Agent will have read access on the NFS server.
- You configure AWS DataSync with the locations required to perform syncronisation
- You create and then start an AWS DataSync task to synchronization files from NFS to S3.
- Use an AWS Glue Crawler to catalog the S3 location that receives files via AWS DataSync.