Design for large number of VM's

rcarstens · Post by **rcarstens** » Mar 07, 2016 6:49 pm this post

We are currently running into some time limitations when backing up clients with lots of VM's and I am curious what others are doing to make Veeam scale when backing up 200+ VM's and 20TB a night. The environment in question currently is about 175 VMs with about 300-400GB of changed data nightly. This is running to a Synology 3400 with WD Red Pro drives. There are 8 jobs, one for each host, all standard incremental. The backups them self are fairly quick, but the merge's are painfully slow. Once there is more than job running and another merging, the performance is poor. All in all, it takes 14 hours most nights for this all to complete; which is too long in most cases and leaves no room for growth. The jobs take about 3 hours to merge once the backup is finished. It seems like the merge just kills the Synology in terms of performance.

How are others handling situations like this? Multiple Veeam servers, multiple repositories, better storage, etc?

nmdange · Mar 07, 2016 7:25 pm

If the merging is taking a long time, it sounds like a performance issue on the target storage. How is this storage accessed by Veeam? Is it a Windows repository or a CIFS share? I would suspect a CIFS share would not provide the same performance as storage directly attached to a Windows computer. Also how many physical disks is in this storage device? What RAID level? etc.

rcarstens · Post by **rcarstens** » Mar 07, 2016 10:18 pm this post

It is a Synology 3400 series with 12 drives in a RAID6. The repository is a SMB share directly from the Synology. Is there a known performance improvement to attach the Synology via iSCSI to a physical Windows server and then create repositories to that Windows machine via CIFS?

DaveWatkins · Post by **DaveWatkins** » Mar 07, 2016 11:50 pm this post

I'd think it's more likely you're running up against an IOPS issue with RAID6 or the drives. If you can rebuild it RAID10 you'd probably be better off, although you'll lose storage space. Ultimately, it might be time to look at some faster drives

rcarstens · Post by **rcarstens** » Mar 08, 2016 5:25 am this post

I agree DaveWatkins, RAID 10 would definitely help if we are up against an IO issue on the drives. The enviroment was just recently moved from a 4 drive to a Synology to the 12 drive and performance was not significantly better. This was the reasoning behind wonder if it was a limitation of SMB on the Synology.

I am curious what others are seeing in terms of performance when doing 300GB+ merges on a Synology over iSCSI. Anyone have any data on this?
When calculating the number IO operations needed to complete a merge of this size and factoring in the expected IOPS of the RAID 6, I am getting about half of what is expected. Assuming 250 IOPS for a 12 drive RAID6 a 300GB merge should take about 3 hours based on 4 IO operations per block with 512KB blocks. We are seeing roughly 6 hours to actually perform this operation. I would imagine moving this to a RAID10 would greatly help, just curious if something else is at play here.

If you have an environment of this size or larger, what are you using for your repository?

slos · Post by **slos** » Mar 08, 2016 7:46 am this post

The metrics the environment below is not as large but the goal was to create as many restore points per day as possible. Three host ESXi cluster, Physical VCS Server, Physical Veeam Server, Synology NAS, & Direct SAN Access backup mode.

Previously this was configured in network mode with the Veeam Server as the proxy and a CIFS share on the NAS as the target. One long job per evening was not a problem but running a job during production hours was very noticeable.

Modified the Veeam Sever from one two-disc raid 1 to one two-disc raid1 and one six disc raid5; also moved Veeam to Direct SAN access. The data drive on the Veeam Server became the temporary holding repository holding a short number of backups. A copy job then moves the data to the NAS with a significantly longer retention span.

You’ll have to do your own planning to ensure your data drive and copy job target have sufficient space in order to perform all the actions required to keep the backup chain moving and performance to meet your time goal.

nmdange · Post by **nmdange** » Mar 08, 2016 5:10 pm this post

What is the network interface on this NAS? If it's 1Gbps that could also be a bottleneck.

I use RAID50 in my environment using local SAS disks in a server (and a SAS jbod) directly attached to a SCSI controller. The backup repository server is also my off-host proxy, and the connection between this server and the virtual environment (mostly Hyper-V but also some VMWare) is 10Gbps. I do have a lot more drives (84 drives vs 12) though. Each RAID50 is a grouping of 3 sets of 7-disk RAID5 arrays, all 4TB 7.2k drives. I prefer to stick with SAS-attached storage because you get a lot more bandwidth compared with SAN storage, be it iSCSI or Fiber Channel.

meilicke · Post by **meilicke** » Mar 10, 2016 12:45 am this post

rcarstens wrote:I agree DaveWatkins, RAID 10 would definitely help if we are up against an IO issue on the drives. The enviroment was just recently moved from a 4 drive to a Synology to the 12 drive and performance was not significantly better. This was the reasoning behind wonder if it was a limitation of SMB on the Synology.

I am curious what others are seeing in terms of performance when doing 300GB+ merges on a Synology over iSCSI. Anyone have any data on this?
When calculating the number IO operations needed to complete a merge of this size and factoring in the expected IOPS of the RAID 6, I am getting about half of what is expected. Assuming 250 IOPS for a 12 drive RAID6 a 300GB merge should take about 3 hours based on 4 IO operations per block with 512KB blocks. We are seeing roughly 6 hours to actually perform this operation. I would imagine moving this to a RAID10 would greatly help, just curious if something else is at play here.

If you have an environment of this size or larger, what are you using for your repository?

I think your 250 IOPS is generous. R6 will scale to faster reads as you add disks, but generally acts as a single disk for writes. I am not surprised you are only seeing ~130 IOPS. If we assume 90 IOPS per disk (I used to assume 180 per 15k FC disk - EMC), six striped mirrors would give you 540 IOPS. 130 IOPS across six is nearly 800 IOPS, so maybe an hour to complete the merge? However at that point you start to bump into the 1G limits.

The other consideration is in my testing with a queue depth of 1, the synologies are fast. As soon as you start to pile up the requests, i.e. increase queue depth, they start to fall over.

Post by **foggy** » Mar 10, 2016 4:33 pm this post

Synology NAS is typically not the best at providing random IOPS the merge process is all about.

Post by **Gostev** » Mar 10, 2016 7:57 pm this post

NAS brand makes zero difference to random IOPS capacity... it's all about number of spindles and their speed (and much more rarely, NAS CPU).

csinetops · Post by **csinetops** » Mar 10, 2016 10:03 pm this post

I'd have to agree, while your backup target has the capacity to ingest the backups it sounds like it lacks the power to roll up data. I had the same issue when I first installed Veeam 3 years ago, I under spec'd the repository and exceeded my windows. I ended up just getting a HP server and DAS shelf full of disk, no issues after that.

rcarstens · Post by **rcarstens** » Mar 11, 2016 4:53 pm this post

The Synology has a 1gbps NIC on it, however I hardly ever see it saturated, so it does not seem to be the bottleneck.

Meilicke, I have noticed the same in regards to queue depth. Running just 2-3 tasks I can saturate the network heading to the synology. Change this to 6 concurrent tasks and the performance of the Synology will drop drastically. However, I am still not clear as to whether this bottleneck is from CIFS running on the Synology or the queue depth itself on the box. I am hoping this will be answered when I rebuild the box as a iSCSI target only.

csinetops · Post by **csinetops** » Mar 11, 2016 8:48 pm this post

That will be interesting to see if it helps, I bet it will. When I tried to use my Netapp FAS-2240 as a CIFS target for Veeam , performance was abysmal, I couldn't get jobs to roll up in a 12 hour window. Changed it to a iscsi RDM lun (12TB) mounted on the Veeam server and performance has been great.

justyjusty123 · Post by **justyjusty123** » Mar 14, 2016 5:56 am this post

We are seeing long merge times and I have an open call at ID# 01709298

*140 VMS about 7,5 TB on Production 4-host ESX cluster.
*Read and write incremental is complete within 3,5 hours. Merge takes another 9 hours, so total job time is 12-14 hours.
*using a virtual machine as a backup proxy on ESX Production cluster.
*All in one big job.
*The backup server (repository) is a physical Windows server with 5+1 drive in RAID6 (SATA drives) with hot-spare. We are considering redoing the setup with RAID10 (8 drives) or moving to a different machine because of long merge times. The server was not built to have that high amount of random iops Veeam needs for the merge, and does not have a BBU.
*10g NIC both on Production as on the backup server.

What I found out is, correct me if I'm wrong, that the incremental backup is written to the disk before the merge is starting (looking at the job details I can see that last machine is being read after 3,5 hours).
So the merge does not affect my production: It will use only the disk resources of the backup repository resources to merge the oldest backup into the second oldest.
That has two consequences for me:
1. I am unable to meet an 8 hour to complete the backup, but:
2. I am still able to meet the requirement that production is not impacted within business hours.

Problems with merge started when we changed from "forward incremental" to "forward incremental forever", meaning, we stopped doing additional weekly fulls. When we have enough space, we will switch back to forward incremental again.

kryptoem · Post by **kryptoem** » Mar 14, 2016 8:55 am this post

I have a similar issue however with more VMs + hosts. I've configured more proxies which improved performance.

Merging of backups is a killer - my solution is to run 7 day retention with a active full on one of the days. Not space efficient but has less of a hit on the NAS (TARGET). I've also now specced 1x SSD upgrade for our Synology. Will update once I've test synthetic fulls and always incremental.

lando_uk · Post by **lando_uk** » Mar 14, 2016 10:34 am this post

Our solution is to not do any transforms during the week and save the pain for weekends.

We manage to protect 350 VMs with about 70TB of frontend data, but we've also hit a wall as some of the transforms are leaking into Monday, thankfully they finish by Monday afternoon but it wont be long until we have to have a rethink. Scrapping RAID6 and going RAID10 for everything is the answer, but that's costly...

An example, a typical job takes 15 mins each night, but takes 9hrs on a weekend to do a the synthetic full with rollbacks.

If we were a 7 days a week place, having to do a Sat and Sun backup would really screw us up, we'd need lots more faster repositories.

Mar 14, 2016 10:45 am

Similar size. We have 375 VMs and over 55TB of data, and that should shortly grow even bigger.
Without the guest indexing on the file servers, the whole estate would be backed up in just over a couple of hours.

The merge adds a few more hours onto that, but on local storage - away from the production SAN.
The longest tasks we've run into are the consistency checking - which on the file servers stretch over a couple of days.

Overall though, two hours a night easily beats the window of our previous backup solution - which stretched to almost a full day

ITP-Stan · Post by **ITP-Stan** » Mar 14, 2016 10:59 am this post

I had a similar issue with a smaller amount of VM's and a smaller Synology NAS device.
We have the Synology connected using iSCSI instead of CIFS/SMB.
Our Synology device is a 4 bay system (DS412+) with 4 WD Red (not pro) 4TB disks in RAID5.
We had about 50 VM's our so and the merge was taking 8 ~ 10hours, take in mind that we have only 4 disks in RAID5 and it are 5400rpm drives!

We had Veeam support investigate this issue, and after some escalation and analyzing, they supplied us with registry parameters that would tune the merge engine for faster cooperation with Synology NAS. This helped reduce the merge time by a couple of hours.

Another option is to avoid the merge all together by using periodic fulls. If you use active fulls the source storage and systems will take the load. If you use synthetic fulls the target storage will take the load (similar to merge). But you can schedule this weekly instead. Ofcourse this will increase your backup storage capacity requirements.

To support growth we are going to use a recently decommissioned SAN (HP P2000 G3) as our main backup repository and the Synology perhaps for backup copy's.

Post by **Gostev** » Mar 14, 2016 12:48 pm this post

Transform allows for trading storage system performance performance for backups size. If there's no performance, there's nothing to trade - simple as that. Just don't use transforms and store multiple full backups instead (which is also way more reliable approach with low end storage).

@Stan I am finding DS412+ with LFF hard drives waaay too slow even for my home use (at least after getting used to SSDs in my PC).

Post by **JailBreak** » Mar 14, 2016 4:14 pm this post

Hi

We backup around 800 VMs(with several jobs) and only 2 concurrent and the full data is around 25Tb with a huge amount of Read and tranfered data, Backups starts at 7:00PM and around 7am all are finish.

But yes we give up of merging backups because takes ages to finish. I prefer to do a full backup in the end of each week.

R&D Forums

Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Re: Design for large number of VM's

Who is online