NFS mounted to Linux repository: Instant Restore data flow?

stevenrodenburg1 · Nov 14, 2012 6:45 pm

Foggy wrote in an other topic:

foggy wrote:Regarding the vPower service, it does not need to run on the repository server itself. You simply select any Windows server (typically the Veeam server or a Windows proxy server) to act as the vPower NFS frontend for the Linux repository.

I've build this and it works quite well. I'm just wondering how the data flows through the various components. I've setup the following:

- Veeam B&R 6.5
- A VMware virtualized Veeam Master Server, acting only as the master and it has Veeam NFS enabled. 2 vCPU and 2GB vRAM
- A whole bunch of virtualized, windows based Proxies using HotAdd.
- A physical, powerful ZFS Based NFS Appliance exporting a multi-TB NFS share to the Linux repository VM. ZFS Compression and Dedup are enabled on this NFS exported Volume.
- A Linux VM with 1 vCPU and 4GB vRAM, acting as the central repository for all Proxies. It has the before-mentioned NFS export mounted with the exact same mount-options as would be used with a DataDomain machine.
- Gigabit Ethernet connectivity
- VM's that are backed-up are not compressed or deduped by Veeam. (i want to benefit from Dedup accross all VM's and not just of the VM's inside a single job).
- Backup speeds of large, running VM's with Thick VMDK's is around 65MB/s.
- Restore speeds using "Entire VM" are about the same.
- Storage vMotion speeds of a running VM that is "instant restored" and then moved to a production SAN is about 40 to 45 MB/s.

Performance is not bad. A VM that is "Instant Restored" performs acceptable. It's much slower than normal but it's not sluggish.

It is clear to me that the 4GB memory in the Linux Repository machine holds a nice cache, acting as a performance-boost.

The Veeam Master VM that runs the Veeam NFS Service is the one that ESXi servers mount to, and it's where the "instant restore VM" is actually running from, from ESXi's point of view. But...
...the question i have is: how does the traffic flow exactly, end-to-end? Via which machines and using what protocols?

It's clear that it goes from the physical NFS Appliance to the Linux "Repository" (Repo) VM (which mounted the NFS Export). But then? How do the data (disk-blocks) get to the Veeam NFS Service machine? Which protocol?
Using NetStat on the Linux Repository VM, i see 8 TCP sessions going from the Repo VM to the Veeam NFS VM, all SSH.

In short, how does it work exactly? (please be technical).

Can the performance be enhanced by increasing the number of those SSH connections? Can i do anything else? (10Gbit Ethernet is not available to us). The individual components can saturate a Gigabit pipe easily.

Nov 14, 2012 9:31 pm

What you really want to look for is the TCP connections used between the VeeamAgent and the Veeam NFS VM. The SSH connection is simply used as a control channel. The Veeam server connects to the Linux host via SSH and pushes a small Linux agent which is then run and connects back to the Veeam server control port and awaits additional commands. For an Instant restore, the commands from the Veeam server are basically to tell it to connect to the proxy/vPower NFS server and present the required datafiles. This would generally show up as a connection on TCP port 2501 from the VeeamAgent on the Linux repository back to the vPower NFS server. You can see this by running a command as follows (this is an example from my small lab running an instant restore):

Code: Select all

# netstat -anp | grep Veeam | grep EST
tcp        0      0 192.168.60.100:2500         192.168.60.30:64508         ESTABLISHED 32652/VeeamAgent45e
tcp        0     28 192.168.60.100:2501         192.168.60.6:53541          ESTABLISHED 32652/VeeamAgent45e

In the above example, 192.168.60.100 is my Linux server acting as the repository, 192.168.60.30 is my Veeam server, and 192.168.60.6 is the proxy that is currently designated as the vPower NFS server for this specific repository. You can see the connections from the VeeamAgent process to the Veeam server on TCP port 2500, and the "data stream" connection to the vPower NFS server on port 2501. The vPower NFS/proxy server would also have a connection back to the Veeam server on port 2500 for it to accept commands as well.

You can actually see more details about the agents, the ports they are lisenting on, and the commands they are responding to but looking at the agent logs created during the restore process. On the Linux server this should be in /var/log/VeeamAgent, and on the Windows server it is typically C:\ProgramData\Veeam\Backup.

As far as the "protocol" being used, it's a Veeam proprietary protocol, but it's effectively the same protocol used for all proxy to repository communications. Effectively the agent on the vPower NFS side request a block (based on receiving a request and NFS I/O from the ESXi host), sends this request to the repository, which retrieves that block from the backup file chain.

As far as improving performance, there's nothing you can do from the perspective of creating more connections, etc. By far the biggest impact on vPower NFS performance is the latency in retrieving the information. Think about the request chain here, ESXi sends an NFS request for a block over the network, which must be interpreted and turned into a request for a block from the repository, which must be retrieved from a compressed/dedupe storage system (you said you were using ZFS). That's a lot of places to introduce latency, and latency works against an sVmotion because sVmotion does not use read-ahead/high queue depth. You can monitor the latency from the ESXi host perspective by looking at datastore stats in vCenter. Anything you can do that lowers the latency will improve the performance.

However, you really should ask yourself if this is even what you want. Remember that you chose to restore this VM with instant restore, which likely means that getting it online fast is crtical. It's already going to be operating at less than ideal performance due to the vPower NFS, and if you start a sVmotion running at full speed, this will actually cause it's performance to be even worse during the migration. It's a balance between getting good performance while the VM is running and getting the sVmotion completed in a reasonable time without negatively impacting that performance even more. That's probably why sVmotion intentionally uses a fairly limited queue depth in the first place. How fast are svMotions between datastores in you environment without Veeam vPower NFS?

stevenrodenburg1 · Nov 14, 2012 10:09 pm

tsightler wrote:How fast are svMotions between datastores in you environment without Veeam vPower NFS?

Do you mean between Datastores and sVmotion started as a normal "move a vm to another datastore" job? Then, well above 150MB/s.
It's a VAAI enabled ZFS array, using FC to connect to the ESXi farm. Because it uses VAAI, the sVmotion traffic technically never leaves the array so it's not a fair comparison with doing a sVmotion over a Gigabit Network

I know about the tradeoff between Instant restore and the subsequent sVMotion (which indeed, we don't want to be "too fast"), versus doing a full VM restore which transfers almost twice as fast, but does not provide application-services until after the transfer is complete.

I was hoping to be able to press a bit more performance out of the solution (read: the communications-chain). Especially while the hardware is picking it's nose during Veeam-jobs

Before i started building this setup, i always avoided dedup and compression on VM's that needed a reasonable performance during an Instant Restore (and subsequent sVmotion) because the performance i have seen with DeDup and Compression enabled was apalling during this scenario. The VM's being restored with Instant Restore were so slow that they were prettymuch useless.
We'd then rather wait for a full VM restore than to bang our heads against the wall in aggrevation. The "Application usability" was so bad, not having the VM running during the full restore made no difference. When users see an hour-glass all the time, why bother with instant-restore to begin with?
So those VM's that had SLA's requiring Instant Restore, were not compressed and/or deduped.

We are experimenting with a scenario where the Storage-reduction is handed over to a bad-ass Appliance so Veeam does not have to. Using "DeDup Friendly" compression at the source, and "decompress before writing" on the Linux Repository machine, showed reduced performance (we did not expect that). So now we complety avoid Veeam Comp+Dedup alltogether and let the big bad deduppert appliance handle that.

Tip from me:
I use a DRS Affinity-rule to keep the Veeam NFS VM and the Veeam Repository VM (with the NAS Appliance's NFS Mount) together on the same ESXi host to optimize the network communication between those two (they are in the same VLAN so they communicate completely inside the vSwitch).

Post by **tsightler** » Nov 14, 2012 10:36 pm this post

Yep, VAAI somewhat changes the game so hard to compare, but perhaps you can do one to a local disk or something.

You might be able to do something with some TCP optimizations on the Windows server (for example, disabling IRQ coalescing, delayed ACKs, etc., as these are great for throughput, but work against latency). That being said, I haven't seen these help on a consistent basis and seem more "environment specific".

When you say you are testing the "big bad" dedupe appliance and saw reduced performance, do you mean for backups and restores, or for IVR. The reality is, I've never seen a "big bad" dedupe appliance that works well with IVR, except for solutions like Exagrid that can do "post process" dedupe.

stevenrodenburg1 · Nov 15, 2012 9:05 am

sVmotions before the VAAI days were about 40 to 45 MB/s so that's more or less the same.
My thinking towards max. performance for all types of restore is to have a "bad ass dedupper" that is at least fast enough to handle the incoming streams from multiple simultaneous backups (or restores) in real-time.
I noticed that our current Linux VM acting as the Repo machine cannot generate more than about 80 to 85 MB/s when multiple proxies are sending data to it (tested that with a local raid-0 in a phyisical variant before we started using a VM). So when we replace the local Raid-0 with the NFS mount from the Dedup Appliance, which is capable of handling the 85 MB/s in realtime, it's fast enough and does not slow things down.

By not using dedup and compression on Veeam components, we don't introduce extra latency there. The Proxies then also don't need 4 vCPU's to handle enough parallel jobs, and IVR runs acceptable.

This at least, is what testing here has lead us to until now. I'm still tuning and testing (and not forgetting lunch

and i'm always open to new ideas.

What i have not yet tested is a really fast Windows VM with fast, large, storage attached directly via NPIV, running as the Veeam Master, the Veeam NFS component and be a proxy at the same time. To the other proxie VM's it acts as their repository.
Then, an IVR job can fetch the data directly from the repository which sits on the same VM on a (local) block-device, thus having no network and agents etc. in between to slow things down.

stevenrodenburg1 · Nov 15, 2012 9:09 am

tsightler wrote:The reality is, I've never seen a "big bad" dedupe appliance that works well with IVR, except for solutions like Exagrid that can do "post process" dedupe.

Question, Exagrid can enhance incoming backup-job speeds by not directly compressing and dedupping but store the incoming data on it's "landing zone" first instead. But how does that benefit IVR ?
I mean, If i want to restore a VM that no longer has any data in the landing-zone, it's al deduped etc. so the Exagrid system must dedup and decompress in realtime, just like a DataDomain for example. I don't see the benefit of ExaGrid in high-performance restore-scenario's yet (or i just don't understand them enough

Post by **dellock6** » Nov 15, 2012 11:42 am this post

The difference is, in ExaGrid the last version of every backup stays uncompressed in the landing zone. Since statistically, the vast majority of the IVR activities use the last backup, this can be started from the uncompressed backup.
I work with ExaGrid but I do not know the fine details of the technology, but they have a dedicated configuration for Veeam, so I must suppose they can "understand" the backup files structure of VBK and VIB files so they can keep the needed version in the landing zone.

Luca.

R&D Forums

NFS mounted to Linux repository: Instant Restore data flow?

Re: NFS mounted to Linux repository: Instant Restore data fl

Re: NFS mounted to Linux repository: Instant Restore data fl

Re: NFS mounted to Linux repository: Instant Restore data fl

Re: NFS mounted to Linux repository: Instant Restore data fl

Re: NFS mounted to Linux repository: Instant Restore data fl

Re: NFS mounted to Linux repository: Instant Restore data fl

Who is online