parallel processing algorithm improvements?

Post by **ashleyw** » Jan 21, 2014 11:20 pm this post

Hi,

We are brainstorming possibilities in terms of improving the throughput.

In the last week, we've luckily been able to refresh our backup technology using a convered storage/compute device (SuperMicro), so we have 22x3TB spindles+hotspare+256GB SSD as an l2arc (ZFS speak for read cache) + 128GB ram. We install VMware on this beast, and then use hardware pass through so that the LSI 2008 controller is passed through to the VM running OmniOS (allocated 64GB ram). This VMware host has dual 8GB/s fibre cards to connect the ESX host to our fibre layer where the rest of our farm is connected to.
So on this VMware server (128GB ram, 12 cores across 2 sockets) we have;
- SQL DB 2008R2 (2vCPU, 4GB ram)
-OmniOS+Napp-it vsan (4vCPU, 64GB ram, LSI2008 pass-through, 23x3TB SATA+1x256GB SSD, ZFS config=1xzpool using 5vdevs (5+5+4+4+4 spindles + SSD cache + 1 hot spare, sync disabled).
- vCentre (2vCPU, 6GB ram)
- VeeamB&R (4vCPU, 6GB ram)
- VeeamMonitor (1vCPU, 2GB ram for VeeamOne etc)
- VeeamProxy01 (4vCPU, 16GB ram)
- VeeamProxy02 (4vCPU, 16GB ram)

We present the storage to the Veeam proxies using ZFS CIFS (not Samba because Samba runs in user space rather than kernel space).

When we run IOMeter benchmarks from one of the Veeam VMs, we are getting a respectable 2500 IOPs through to the vsan layer which we think is reasonable for a backup target.

We believe that the use of a converged device like this gives by far the greatest bang for the buck as it allows us to run all our backup and management infrastructure on a single device with everything virtualised. The configuration provides a usable storage area of over 45TB.

The relatively small performance penalty of running the SAN virtualised (thanks to hardware pass-through) is offset by the fact we don't have to worry about connectivity between our backup host and backup storage and the fact we can run all our management software on a device outside of the farm itself.

anyhow... back to the original reason for the post....

we want to push this device to the limits (we want everything to smoke rather than smoulder!) and have now enabled parallel processing in Veeam and have a few comments.
- some job types like synthetic fulls are much heavier than others. Is it possible for the number of concurrent jobs hitting a proxy be changed to a weighting system so a normal incremental job (which has relatively low resource footprint) has a different weighting to a synthetic full job? That way during normal day-day incrementals we could have more parallelism but then when the synthetic fulls kick in, the number of parallel jobs could be reduced to prevent overloading the proxies?
- there have been many long running threads on the RAM usage of the proxies. The issue here is really that while it may be the OS that is using the RAM to the max, I'm guessing that this is because of the de-duping tables in RAM and the OS caches (under windows 2008r2). The problem is that once ram is fully utilised performance starts dropping off on the proxies themselves. Can any of the Veeam processes be re-architectured to prevent the OS using all it's RAM to resolve this issue? We find whatever RAM we throw at the proxies it gets fully used.
- currently a CIFS target can be presented on a single IP/URL to Veeam. What we'd like to see is for the parallel job architecture to allow a logical CIFS target to be defined but with multiple IPs and for the job scheduling engine to load balance over the IPs without manually having to set each job to a different repository (but connecting to a single storage device just on different IPs). Reason for this is that it is now easy to create ZFS based targets like we are doing with all interfaces being active.
- a long running issue we have had is that the performance of the de-dupe containers seems to drop off as the de-dupe containers get larger - unfortunately this is not linear which means we have to artificially split jobs once a job is backing up several TB of VMs (say 5TB). It would be great if the de-dupe architecture could be re-looked at so performance could be maintained. If artificial splits of de-dupe containers are necessary to maintain performance we'd ideally like this to be done automatically by Veeam.

Any comments would be much appreciated!

thanks!
Ashley

Post by **ashleyw** » Jan 31, 2014 9:30 am this post

any takers? Luckily our converged storage/compute seems to be taking the heat without bursting into flames

In the interim, we have increased the number of the proxies to 4 which has helped significantly reduce our backup times but we still need to allocate 16gb each to them otherwise the synthetic full jobs will randomly fail (at 12GB ram per proxy, 2 parallel jobs per proxy) so we think our questions around sideways scaling and memory footprints are still very valid.

Jan 31, 2014 1:37 pm

In looking at your config it appears that you are writing to the repository via CIFS. You might want to consider dedicating a server or two as the "gateway" servers for the CIFS. That's likely why you need so much memory for your proxies. Whenever you write backups to CIFS, some proxy has to take on the role of the "repository" for each running job/synthetic full process. Assuming you leave the "Proxying server" set to automatic when configuring the CIFS repo, then the server that acts as the "repository", and thus performs the the CIFS I/O for that job and things like the synthetic full, will basically be the first proxy used with that job. Any other proxies used by that job will have to send their data to this proxy so you can end up with a lot of cross proxy traffic.

I actually prefer a more "fan out" approach when using CIFS, which can be achieved in one of two ways:

1. Simply assign jobs to specific proxies. This works since the proxy/repository agents end up all running on the same proxy VM, however, this might not be idea if using virtual proxies since you need mulitple proxies to be able to backup multiple disk from the same VM in parallel.

2. Designate one a more specific servers as the "proxying" gateway. In this scenario you effectively create the same model as if you were not using CIFS, the server selected as the "proxying" gateway for the CIFS performs no function other than receiving data from the other proxies and performing the I/O to the CIFS share. This now becomes the server that would need the memory (typically at least 4GB per vCPU/core), while the proxies likely need no more than 2GB per vCPU/core.

That being said, my understanding is that this hardware is pretty much dedicated specifically to backup. Is that correct? If so I would swear you'd be better off just installing Windows on the box and leveraging direct SAN and native local storage. Also, I'm curious what IOmeter profile you are using to measure your 2500 IOPS?

Post by **ashleyw** » Feb 03, 2014 2:55 am this post

thanks - it all makes sense what you are saying but...
- Our backup host is doing more than just Backup - it's providing out of band management to the farm so has the vcentre components and monitoring components on so it's for this reason we don't dedicate a single OS container to Veeam. We have found that now hardware hardware controller pass-through is relatively common practice on converged storage units, there is very little performance drop off in running the SAN virtualised - we tested by importing the same zpool from physical tin through to the pass through disks.
- Having all the Veeam components virtualised provides us with massive amounts of flexibility in terms of deployment architecture, but as you say it causes specific headaches with the Veeam Proxies (which are virtuals) - which is why we don't want to assign specific jobs to specific proxies as it limits our parallelism.

The proxying gateway model is interesting and something we may look at in the near future.

Right now we have settled on the following architecture and are seeing excellent performance;
- SQL DB 2008R2 (2vCPU, 4GB ram)
- OmniOS+Napp-it vsan (4vCPU, 32GB ram, LSI2008 pass-through, 23x3TB SATA+1x256GB SSD, ZFS config=1xzpool using 5vdevs (5+5+4+4+4 spindles + SSD cache + 1 hot spare, sync disabled).
- vCentre (2vCPU, 6GB ram)
- VeeamB&R (4vCPU, 6GB ram)
- VeeamMonitor (1vCPU, 2GB ram for VeeamOne etc)
- VeeamProxy01 (4vCPU, 16GB ram)
- VeeamProxy02 (4vCPU, 16GB ram)
- VeeamProxy03 (4vCPU, 16GB ram)
- VeeamProxy04 (4vCPU, 16GB ram)

We are seeing our jobs screaming (including synthetic) particularly now we've split our largest job into 4 smaller ones (we are running 14 jobs at our primary site).

Installing windows natively on tin would mean we'd loose out on the flexibility and we'd need more tin for the non Veeam components (considerably adding to the cost of the solution). It would also mean we'd loose out on all the nice parts of having a ZFS based storage unit.

If Veeam has the resources to run comparative performance tests using a style of deployment similar to ours verses tin, it would be great as we think the bang-for-buck of ours is pretty good (hardware around NZ$15000 for about 46TB of usable space and decent compute capacity - 128GB ram, 12 cores).

cheers
Ashley

Post by **tsightler** » Feb 03, 2014 3:24 am this post

I wasn't really referring to any loss of performance from pass-through, but just from the overhead of CIFS and the sheer amount of data that has to cross through the various network stacks and bus with your setup. There's a lot of data copies with your setup happening on the underlying physical hardware. With Veeam installed on the physical hardware it's literally read data in, write data to storage. With your model data has to be read by the virtual proxies, which is passed through the virtualized network and thus the physical memory, potentially to another proxy that's running the repository agent, then to the NAS, then finally to the storage. All of that data passing multiple times across the virtual network stack are not resource free so anything you can do to reduce those will help. But if you're happy with your setup that's really all that really matters.

R&D Forums

parallel processing algorithm improvements?

Re: parallel processing algorithm improvements?

Re: parallel processing algorithm improvements?

Re: parallel processing algorithm improvements?

Re: parallel processing algorithm improvements?

Who is online