Host-based backup of VMware vSphere VMs.
Post Reply
kjstech
Expert
Posts: 160
Liked: 16 times
Joined: Jan 17, 2014 4:12 pm
Full Name: Keith S
Contact:

Can snapshot consolidation performance be improved anyway?

Post by kjstech »

We have a backup that ends at 1:28 AM and it is a SQL server. Our website goes down when this backup complets and the snapshot is consolidating. The error logs on the website is:

Code: Select all

20-01-2015 01:27:32:649 24492ms [         7] ERROR Log - Module           : Accounts 
Operation        : Provider :: FetchTransactionTypeTemplateForMobil 
Sys. Gen. Message : The connection was not closed. The connection's current state is connecting.
   at System.Data.ProviderBase.DbConnectionBusy.OpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory)
   at System.Data.SqlClient.SqlConnection.Open()
   at vendor.DataComponentsProvider.MSSQL.DataComponentsProvider.get_Conn()
   at vendor.DataComponentsProvider.MSSQL.DataComponentsProvider.AccountsTransactionTypesForMobile() 
App. Gen. Message : Exception raised while trying to Fetch Accounts Transaction Type Template for Mobile

20-01-2015 01:28:43:255 95097ms [        16] ERROR Log - Sys. Gen. Message : The request channel timed out while waiting for a reply after 00:00:59.9844000. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeout.
Seems like the perfomance goes to sh** when snapshots consilidate. We also leverage AlertBot and every night we see timeouts or slow page load alerts for the web server and also the sql server.

Short of buying enterprise SQL and clustering them and using two front ends behind a load balanancer, and backing up each pair in different jobs/times... is there any way to mitigate the intense performance hit when snapshots are consolidated?

The environment is EMC VNX5200 storage array, NFS, 10gbe 9000 mtu, brocade turbo iron 24x switches and QLogic adapters. Each filesystem is on its own subnet with its own vmkernel. 6 hosts each. The backup is on 1gig ethernet, but the vmdk's are on the 10gig (isolated) VNX5200. ESXi 5.0.0, 2312428, Veeam 8.0.0.917 on a Server 2012 R2 virtual machine in the same environment as production virtual machines.
Vitaliy S.
VP, Product Management
Posts: 27055
Liked: 2710 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by Vitaliy S. »

What is the latency and performance (read/write rate) of the datastore during VM snapshot consolidation process? Do you have any monitoring tool for that?
kjstech
Expert
Posts: 160
Liked: 16 times
Joined: Jan 17, 2014 4:12 pm
Full Name: Keith S
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by kjstech »

Upon further investigation in the vmware.log for our SQL VM that our site depends on, we see these stun times:
2015-01-20T06:18:03.473Z| vcpu-0| Checkpoint_Unstun: vm stopped for 1249990 us ---> 1.24999 seconds
2015-01-20T06:26:50.926Z| vcpu-0| Checkpoint_Unstun: vm stopped for 748057 us ---> 0.748057 seconds
2015-01-20T06:27:37.049Z| vcpu-0| Checkpoint_Unstun: vm stopped for 733033 us --> 0.733033 seconds

What causes the stun times to vary or become too high for the application to handle? Would iSCSI filesystems have the same problem?

We just use the vSphere performance tab. Currently the latency values are:
Read latency - latest 4 ms Maximum 50 ms Minimum 0 ms Average 2.878 ms
Read rate - latest 47 KBps Maximum 3487 KBps Minimum 0KBps Average 603.65 KBps
Write latency -latest 0ms Maximum 14 ms Minimum 0 ms Average 1.094 ms
Write rate - latest 107 KBps Maximum 47173 KBps Minimum 44 KBps Average 539.917 KBps
Unfortunately the chart only goes back 1 hour and I cannot figure out how to change it. Even in chart options, anything other than realtime is not an option. All the other options are grayed out.

I'm not sure what contributes to the specific amount of time it takes for the VM to be in a stunned state. As you can see 1.249 seconds may not bother a human being, but when applications are time sensitive yes they can get a mind of their own. Since this might also incorperate a few other logs like the hostd log and perhaps more, I did just open a case with vmware support on this.

We keep our backups in the overnight hours to minimize the impact from this. For now alerts that are generated are just ignored when we see the time that coincides with the backup time frame. Just curious from a veeam perspective has anyone dealt with these issues before with vmware and what was their experience with the process?
kjstech
Expert
Posts: 160
Liked: 16 times
Joined: Jan 17, 2014 4:12 pm
Full Name: Keith S
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by kjstech »

Image
Backup starts at 1:18 and finsishes at 1:28 (for this particular VM in question).
loelly
Enthusiast
Posts: 51
Liked: 10 times
Joined: Apr 17, 2014 8:25 am
Full Name: Jens Siegmann
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by loelly »

I would trade for your stun times in a heartbeat. We're facing 20-30 seconds on unstun. :\
kjstech
Expert
Posts: 160
Liked: 16 times
Joined: Jan 17, 2014 4:12 pm
Full Name: Keith S
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by kjstech »

loelly wrote:I would trade for your stun times in a heartbeat. We're facing 20-30 seconds on unstun. :\
I've seen that before on 1 gig NFS to a EMC Celerra. VNX5200 with the auto tiering storage, flash cache and 10gig with jumbo frames really improved that. I guess we have a very sensitive SQL application though that can't even take a second of stun times. I have a ticket open with VMWare to investagage and determine if this would happen on iSCSI or not.

Our VNX5200 has dual port 10gig Ethernet on both the active and failover sides. However my understanding for iSCSI we would need a different network card, so I'm not going to go through the expense unless its absolutely necessary. It was not budgeted this year anyway so it would have to be a 2016 project if we decide to migrate to iSCSI. We don't have FC / FCoE hardware so it would have to be iSCSI over 10gig Ethernet.
Vitaliy S.
VP, Product Management
Posts: 27055
Liked: 2710 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by Vitaliy S. »

kjstech wrote:Just curious from a veeam perspective has anyone dealt with these issues before with vmware and what was their experience with the process?
Yes, there is an existing topic discussing snapshot commit issues in VMware environment and I know that VMware keeps improving this process in the latest versions, so updating to their latest patch level/version might help. Also if there is any way to adjust timeouts in the application, then I would do that as well.
stevericks
Novice
Posts: 7
Liked: never
Joined: Jan 30, 2012 10:16 am
Full Name: Steven Ricks
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by stevericks »

I am looking into this again as well.
I keep trying to persuade our DBAs to backup our SQL servers with Veeam. But at the end of every backup, the VM will drop its network connection. Not good for high volume 24x7 SQL Servers.
I am sure that last time I looked into this I had it in writing somewhere that there is nothing we can do. VMWare will drop the network connect during the commit of the snapshot at the end of the backup job.

I cant find where I read that now though :?

Can someone from Veeam update me on this?
Is it better with VMware 5.5? We are on 4.1
Is it better with Veeam 8? We are on Veeam 7.

Any help appreciated,

Steve.
foggy
Veeam Software
Posts: 21069
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by foggy »

Most likely you've read it somewhere in this huge thread. Btw, you can find some hints there that can help to minimize the effect of the VM stun during snapshot commit. Updating to the latest versions is among them, as Vitaliy has stated above.
dellock6
Veeam Software
Posts: 6137
Liked: 1928 times
Joined: Jul 26, 2009 3:39 pm
Full Name: Luca Dell'Oca
Location: Varese, Italy
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by dellock6 »

Steven,
since this database servers need to be up 24/7, what about using Always-on availability groups? In this case we can backup one of the node while the other is still serving the web applications. Or leverage our storage integration to remove stun problems (what storage are you using?)
Other than this, the issue is all on VMware, the stun process is really something we can't prevent to happen because Veeam simply instruct vCenter 8and then the ESXi running the VM at that point in time) to consolidate the snapshot. But sure one step forward is to update to vsphere 5.x, there are quite important improvements from 4.1 in terms of snapshots management.
About the storage protocol, I'm quite sure there is no difference between FC or 10g iSCSI, it's more about the underlying performances of the storage, and the technology itself of VMware snapshots.
Luca Dell'Oca
Principal EMEA Cloud Architect @ Veeam Software

@dellock6
https://www.virtualtothecore.com/
vExpert 2011 -> 2022
Veeam VMCE #1
MAA
Expert
Posts: 101
Liked: 3 times
Joined: Apr 27, 2013 12:10 pm
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by MAA »

kjstech
Try this:
Disable "Enable VMware Tools quiescence"
Uninstall "Volume Shadow Copy Services Support" from VMware Tools
(аfter these settings, I no longer have problems)

Also, you can try "Use Storage Snapshots"
dellock6
Veeam Software
Posts: 6137
Liked: 1928 times
Joined: Jul 26, 2009 3:39 pm
Full Name: Luca Dell'Oca
Location: Varese, Italy
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by dellock6 »

"Use storage snapshots" makes sense only if there is a compatible storage array (HP or NetApp) and he has an enterprise Plus license.
Luca Dell'Oca
Principal EMEA Cloud Architect @ Veeam Software

@dellock6
https://www.virtualtothecore.com/
vExpert 2011 -> 2022
Veeam VMCE #1
kjstech
Expert
Posts: 160
Liked: 16 times
Joined: Jan 17, 2014 4:12 pm
Full Name: Keith S
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by kjstech »

Wow guys, sorry I didn't realize this was replied to.

I use EMC VNX5200 storage array on its own 10gbe switches with qlogic 10gbps nics in the vmware esxi hosts. It presents NFS shares to each server over this 10gbps network with jumbo frames.

Would there be any issues without VMWare Tools quiescence and Volume Shadow Copy Services Support? Would this just mean all backups are crash consistent, meaning a restore may or may not necessarily work properly?

There is a VAAI provider for EMC VNX installed in my vSphere infrastructure but as far as I can tell it only provides insight and the ability to provision / deprovision storage from vSphere client.
Vitaliy S.
VP, Product Management
Posts: 27055
Liked: 2710 times
Joined: Mar 30, 2009 9:13 am
Full Name: Vitaliy Safarov
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by Vitaliy S. »

kjstech wrote:Would there be any issues without VMWare Tools quiescence and Volume Shadow Copy Services Support? Would this just mean all backups are crash consistent, meaning a restore may or may not necessarily work properly?
For application consistency or application backup best practices, these options are required.
emachabert
Veeam Vanguard
Posts: 388
Liked: 168 times
Joined: Nov 17, 2010 11:42 am
Full Name: Eric Machabert
Location: France
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by emachabert »

Is the datastore dedicated to the SQL server ? If not, is there another activity on the Datastore during the backup ? Do not forget that VMware NfS implementation suffer the "one lane motorway" problem since there is only one data channel per mounted Datastore per ESX. And it is only using TcP, which is anything but a low latency protocol, especially when it comes to transmit error handling.
You could really give a try to an iSCSI setup, thus avoiding the Datamover layer.
Last but not least, what is the configuration of the VNx, perhaps you are hiting its performance limits when the snapshot commit occurs.
Veeamizing your IT since 2009/ Veeam Vanguard 2015 - 2023
kjstech
Expert
Posts: 160
Liked: 16 times
Joined: Jan 17, 2014 4:12 pm
Full Name: Keith S
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by kjstech »

There are 3 NFS file systems, all on their own subnets, with their own storage vmkernel. 9000 mtu, 10 gigabit Ethernet. It's graphed out and we don't even come close to max the capacity of the links. So with your analogy it would be a 3 lane highway with a speed limit of 200 mph and all buses full of people instead of cars or vans.

We would have to make a big investment into more cards for servers and vnx, plus the expensive twinax cable to do iscsi over 10gbe. After talking to vmware support, and others on vmware user communities, people with FC, iSCSI and NFS all have experienced the same problem.

I think the only way out is to have the application developer assist us with two web servers, two sql databases doing replication. Then the one web and sql are backed up in a different job / time slot than the other. Hardware load balancers on the outside of vmware to dictate access. That way when snapshot commit takes down one server, the hardware load balancer will just keep things trucking along with the other server.
emachabert
Veeam Vanguard
Posts: 388
Liked: 168 times
Joined: Nov 17, 2010 11:42 am
Full Name: Eric Machabert
Location: France
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by emachabert »

Ok so the server has multiple vmdk spreaded over the 3 nfs mount.
Looks like the VNX is hiting its max random read/write performance at the time of commit, so your best chance is the one you mentioned, have the application modified to handle multiple backend database servers(i don't see the need of a second web server in that case). It may also cost money,in licensing if you double your SQL server.
If Veeam had the storage integration with the VNX like it has with 3par and Netapp, you wouldn't have that problem since the vmware snapshot last for less than 30s in general.
Veeamizing your IT since 2009/ Veeam Vanguard 2015 - 2023
nielsengelen
Product Manager
Posts: 5618
Liked: 1177 times
Joined: Jul 15, 2013 11:09 am
Full Name: Niels Engelen
Contact:

Re: Can snapshot consolidation performance be improved anywa

Post by nielsengelen »

stevericks wrote:I am looking into this again as well.
I keep trying to persuade our DBAs to backup our SQL servers with Veeam. But at the end of every backup, the VM will drop its network connection. Not good for high volume 24x7 SQL Servers.
I am sure that last time I looked into this I had it in writing somewhere that there is nothing we can do. VMWare will drop the network connect during the commit of the snapshot at the end of the backup job.

I cant find where I read that now though :?

Can someone from Veeam update me on this?
Is it better with VMware 5.5? We are on 4.1
Is it better with Veeam 8? We are on Veeam 7.

Any help appreciated,

Steve.
There are improvements made in vSphere 5.x so an upgrade would be a very good idea there :-)
Personal blog: https://foonet.be
GitHub: https://github.com/nielsengelen
Post Reply

Who is online

Users browsing this forum: No registered users and 105 guests