Data Lake using Veeam Data Integration API vs Veeam Tiered/Copied data on AWS S3

Veeancent · Post by **Veeancent** » Jul 13, 2019 10:58 pm this post

Hi there,

Let's say that I wanna run whether DLaaS using the AWS EMR platform or a Data Lake on-premise using Hadoop and look for the most efficient way to leverage my Veeam backup data.

So what about presenting Veeam backup files to an on-premise Hadoop using Veeam's upcoming Data Integration API via the so called VeeamFLR folder as a primary data source, and on request copy data from the Veeam Repository to HDFS or S3, versus presenting data to AWS EMR using Veeam Cloud Tiered/Direct Copied data resting on S3 to be directly used as a Data Lake data source?

Ultimately, would there be any further advantages (or disadvantages, except high AWS operating costs) of presenting Veeam backup files to AWS EMR from a running Veeam B&R instance on EC2 or on VMware on AWS, that would hold backup files and present data via Veeam Data Integration API, then on request copy data from the Veeam repository to HDFS or S3 on AWS?

Many thanks

Vincent
TM West Switzerland

Post by **nielsengelen** » Jul 17, 2019 10:55 pm this post

Hi Vincent,

As this feature is still in development and fine-tuned, this is a bit hard to tell. Cost is for sure one thing that you'll have when using AWS EMR - it will mostly depend on the number of disks/data you want to present. Presenting it to local Hadoop clusters will be cheaper and most likely be faster but this would require some testing more close to GA to confirm.

R&D Forums

Data Lake using Veeam Data Integration API vs Veeam Tiered/Copied data on AWS S3

Re: Data Lake using Veeam Data Integration API vs Veeam Tiered/Copied data on AWS S3

Who is online