Snapshot removal issues of a large VM

joergr · Post by **joergr** » Nov 10, 2010 9:11 pm this post

I bet this problem is related to esx 4.0 or storage.

It would be extremely interesting if you could test this with esxi4.1 and provide us with the results. Further more it would be even more interesting if you could provide us with disk i/o data during snapshot commit (especially read and write latency, iops, and disk read and write rate). You can get all these from within vsphere client or veeam monitor.

Best regards,
Joerg

Post by **tsightler** » Nov 10, 2010 10:26 pm this post

A couple of things to look at:

1. Several users have reported that having previous other, non-Veeam snapshots can cause the VM to hang

2. Like Joergr, I would agree that the problem is likely storage related. The snapshot removal process significantly lowers the total IOPS that can be delivered by the VM because of additional locks on the VMFS storage due to the increase in metadata updates, as well as the added IOP load of the snapshot removal process itself. In most environments, if you're already over 30-40% IOP load for your target storage, which isn't uncommon with a busy SQL/Exchange server, then the snapshot removal process will easily push that into the 80%+ mark, and, likely much higher. Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.

As an example, my older Equallogic storage arrays typically provide 3.5ms read latency when running at a 40% IOP load, but at 90% IOP load, which can happen during snapshot removal, read latency spikes to 7.5-10ms. That effectively means that IOPs required for Exchange are 2-3x slower but this doesn't tell the entire picture for how much Exchange might slow down. If Exchange cannot get the IOPs that are required to satisfy a user request before the next user request is made, the queue will grow as MAPI request are served slower than they come in. This is just like an Interstate that is unable to service the traffic that is flowing into it. If you're planning to use a repliation solution that is dependent on VMware snapshots you have to know that your storage can serve not just the IOPS for normal operations, but can maintain that performance during snapshot operations.

We've had some success by increasing the shares for storage on our IOP heavy VM's but I don't think this is some great solution.

3. One suggestion would be to upgrade to ESX 4.1. We've seen huge improvement in the snapshot consolidation process with ESX 4.1 on our Equallogic SAN clusters, and I highly suggest it. If this isn't possible, at least apply all of the latest ESX 4.0 patches.

Post by **Gostev** » Nov 10, 2010 10:55 pm this post

Tom, do your already have your Equallogic SAN on 5.0.2 firmware? Because in that case, improvement that you see may come from VAAI, which is not available for every storage?

matarvai · Post by **matarvai** » Nov 11, 2010 3:47 am this post

I will upgrade ESXi to 4.1 at weekend. Let's see if that helps

joergr · Post by **joergr** » Nov 11, 2010 7:31 am this post

Great, please keep us updated when you upgraded to esxi 4.1. Could you also provide some intel about your DAS? Vendor? SAS or SATA? 15K/10K/7K? Number of spindles? RAID Level? And the interface to your host?

best regards,
Joerg

matarvai · Post by **matarvai** » Nov 11, 2010 7:33 am this post

Here is information about DAS. Vender HP, P400 controller, 8x 146GB 10K SAS, two RAID-10 arrays, both have four disc.

joergr · Post by **joergr** » Nov 11, 2010 7:52 am this post

Good lord, this is 4(!) 10K spindles for direct IO access, quite bad in my opinion for performance considerations. Don´t expect miracles with this setup but believe me, ESXi 4.1 will get more out of it than ESX4.

best regards,
Joerg

Post by **tsightler** » Nov 11, 2010 2:48 pm this post

Gostev wrote:Tom, do your already have your Equallogic SAN on 5.0.2 firmware? Because in that case, improvement that you see may come from VAAI, which is not available for every storage?

Nope, we're not brave enough to jump on that bandwagon yet, not after the reliability issues we experience after jumping on the RAID 6 train with the Equallogic 4.x firmware far too early. We'll let some other customers be the beta testers for these new features for a while as EQL already proven to us that their QC is not that great with the 5.x code with the disaster that was the original release.

Post by **Gostev** » Nov 11, 2010 3:10 pm this post

Ah, so these huge improvements in the snapshot consolidation process that you see are actually solely due to ESX 4.1? Interesting information!
Concerning firmware I heard EQL QC actually took the time to test it properly this time, people seems to be very happy with 5.0.2...

Post by **tsightler** » Nov 11, 2010 6:01 pm this post

Everything is a judgment call, but I have no faith in Equallogic firmware at this point. We started off with 3.x firmware and the arrays seemed rock solid. We were actually part of the 4.x beta program and were a very early adopter. The 4.x train was only released in Aug 2008, just 2 years ago. We hit quite a few issues as we progressed through 4.x code, and EQL averaged a code release a month, which is simply too ofter for an enterprise storage array. Some of the issues were minor, others were catastrophic. I'm sure there were many EQL customers that were very happy with the stability of the firmware version that ate our data, but we weren't too happy with it.

QC should be able to shake out obvious issues (which they didn't with 5.0.1) but have a much harder time with fleshing out long term stability issues. Many times we managed to runs firmware for 4-6 months before we had an issue. The 5.0.2 code hasn't even been out that long yet, so you can't really make statements about it's long term stability yet.

I'm not saying that 5.0.2 is horrible, it may very well be the best code EQL has ever produced, but we're not going to be jumping ship on our finally stable 4.x code just year. I'll probably give it a year or so first. I think that's the difference. Some people measure stability as a matter of weeks or months, by I measure enterprise stability in terms of years.

matarvai · Post by **matarvai** » Nov 16, 2010 8:31 am this post

I updated our ESXi to 4.1 and Veeam to 5.0 and now snapshot removal issue disappeared. Thanks everyone!

joergr · Post by **joergr** » Nov 16, 2010 9:24 am this post

you are welcome

eriktxstate · Post by **eriktxstate** » Mar 24, 2011 8:40 pm this post

We've had a problem with VM's becoming unresponsive, and it was narrowed down to the second that Veeam releases a snapshot. We're using CBT for VM's, hoping that it'd help with backup times. Additionally, we're currently running Veeam B&R 5.0.1.198, Vmware 4.1 esxi on our hosts and vSphere 4.1.

The issue came up with our windows admins who kept seeing clustered VM's bark in SCCM about loosing a node in a cluster. I've noticed it on several linux-based systems and witnessed the unresponsiveness (up to 10 mintues!!!) myself and thought I was crazy. I've suggested bumping out the timeout values for those cluster checks on the windows systems, but it seems odd that this is a persistent issue that's becoming more and more trouble.

Now, I've seen the issue with NFS and CBT in VMware 4.1, here:

http://kb.vmware.com/selfservice/micros ... 0168771546

But all of our datastores are on a 8Gb SAN. This may be an exclusive VMware issue, but I need help pointing the finger. Veeam - is there anything you have heard of regarding long snapshot release times? I'm only guessing this is due to CBT because that's the only thing I can think of at this stage that would be causing the issue. When I make a regular snapshot, there's no issues.

Thanks,
Erik

eriktxstate · Post by **eriktxstate** » Mar 30, 2011 4:29 pm this post

My above post got merged into this discussion, but I'm still not convinced it's related to size. I think it's more or less related to CBT being slow.

Post by **Gostev** » Mar 30, 2011 4:43 pm this post

Erik, it was merged in this topic where this and similar issues are being discussed, no matter of VM size (don't pay attention to the topic name). It is good to have everything in the same place.

Did you read some suggestions below on what else can cause unresponsiveness (such as existing snapshots)? 10 minutes unresponsiveness is actually real issue that you have to address. Most people reporting issues here have pretty minor issues (occasional application timeouts).

mschulte · Post by **mschulte** » Sep 01, 2011 11:04 pm this post

[merged]

Hello,

hope someone can help us.
Sometimes we have trouble with backups in Veeam 5.1.
Backupjob run on VM and hangs with remaining time 0:00:00
The job could not be stopped.

In Vsphere Client we can see a process on the vm : Remove Snapshop 95%
The Snapshot will not be removed for 24h - also the snapshot isn't great. (17MB)

The VM is frozen, no network traffic, no Console

In Vsphere Client we could not stop or restart the VM (because another Task is running - Task: remove snapshot)

Sometimes we can kill the VM in SSH Console, sometimes not.
After killing the vm in sshConsole sometimes you can start the vm , sometimes not. (because of locked files etc.)

What can be the problem ?

Thanks a lot

offwire · Post by **offwire** » Oct 11, 2011 12:37 pm this post

[merged]

I would like to be able to use Veeam to backup our SQL Server (2008R2) every 4 hours, and then fill in the time in between with SQL backups. However, during working hours, when the Veeam backup is removing the snapshot, random users are disconnected from their applications that use databases on the SQL Server, and I am not sure what, if anything, I can do about it. One is our CRM application, and the other is our order entry/ecommerce application. They typically receive an error message stating that the network connection has timed out or that the connection was lost. They can close down and reopen and everything is fine again.

Has anyone experienced anything similar or have any ideas how I might work towards resolving this? I don't think this is a Veeam issue as much as an application issue, but it's a thorn in my side and I am out of ideas.

Server: Server 2008R2, x64
SQL: Standard Edition, 2008R2
VMWARE: ESXi 4.1 U1

Post by **Gostev** » Oct 11, 2011 3:01 pm this post

Short summary of things which may help (for more information, please read this topic):

1. Make sure VM does not have any other snapshots (including hidden).
2. Increase CPU reservations in the VM settings.
3. Move snapshot location to a different datastore (via workingDir parameter), preferably backed by faster storage (for example, SSD disk).

KiwiJJ · Post by **KiwiJJ** » Oct 11, 2011 8:53 pm this post

Hi,
We were having problems where our SQL server (and others) were freeezing during snapshot removal. What I have found that works is:

1. In the Veeam backup job go to the vSphere tab under Avanced Settings and untick "Use changed block tracking data"

2. Power off the Virtual Machine and edit settings, go to the Options tab and select Configuration parameters.
Change "ctkEnabled" to false
Also change any “scsi#:#.ctkEnabled” to false
Power on the Virtual Machine

cheers,

JJ

Post by **Gostev** » Oct 11, 2011 9:38 pm this post

John, I am guessing that you probably have NFS storage? Indeed, VMware had issues around CBT and NFS storage, but there is hotfix available now which resolves it. For other types of storage, CBT should not present any issues. Plus, disabling it kills incremental backup time...

KiwiJJ · Post by **KiwiJJ** » Oct 11, 2011 10:38 pm this post

Hi Anton,
No, we have iSCSI storage (Dell MD3000i). We do not use incremental backups so that is of no concern to us. But this change definately made a difference. For our Exchange and SQL servers when CBT was on users would lose access while the last part of the snapshot was being written back and our application servers would lose connection to the SQL server. When I made the change the users never lost access and neither did the application servers.

cheers,

JJ

Bunce · Post by **Bunce** » Oct 12, 2011 5:52 am this post

Thats very interesting. Are you positive it was changed block tracking added that extra pause JJ? We might give this a test..

offwire · Post by **offwire** » Oct 12, 2011 3:31 pm this post

KiwiJJ wrote:Hi,
We were having problems where our SQL server (and others) were freeezing during snapshot removal. What I have found that works is:

1. In the Veeam backup job go to the vSphere tab under Avanced Settings and untick "Use changed block tracking data"

2. Power off the Virtual Machine and edit settings, go to the Options tab and select Configuration parameters.
Change "ctkEnabled" to false
Also change any “scsi#:#.ctkEnabled” to false
Power on the Virtual Machine

cheers,

JJ

This is definitely something I can try if it would be useful to have someone else try it. I am also using iSCSI storage on IBM DS3300's

KiwiJJ · Post by **KiwiJJ** » Oct 12, 2011 7:38 pm this post

Hi Bunce,
It is to do with the way snapshots get written back when using CBT. (you can google this to see how this is done) This is a known issue and apparently VMware are working on it.
It was definately CBT that was causing the issue. As soon as I made the changes mentioned the problem went away.

cheers,

JJ

Nobody · Post by **Nobody** » Oct 13, 2011 1:35 pm this post

[merged]

Hi everybody

I'm looking out for someone who might see the same issue as we have.
We're backing up around 250 VMs every night with Veeam B&R and Veeam VSS quiesces.

All the systems are monitored with nagios. Nearly Nagios is reporting one or two vms for beeing offline during the backup window.
After reading through a whole bunch of logfiles and support calls on vmware and veeam we know that the vms are freezing for ~60 seconds during the remove snapshot process.

Our vm's are stored on 4x IBM V7000 who are accessed over FC - so the known issue with CBT and NFS does not match in this case.

In our tracing, we found no performace issue on the VMware side - and the snapshots are running only a few minutes without growing too much.

Did anyone out there experience the same issue?
With SAP Servers freezing for around a minute - we've started getting unwanted management attention.

Best regards
Nobody

daniflexx · Post by **daniflexx** » Oct 17, 2011 6:42 am this post

Did you check the storage for bottlenecks? I suggest you to specially check latencies and compare the iops that are hitting the SAN against your SAN limits.

I've been experiencing this network disconnection and log snapshoot removal when I faced iop contention in my old SANs.

I use Veeam Monitor to monitor the SAN performance during backup hours.

dani

Fox54 · Post by **Fox54** » Nov 09, 2011 4:08 pm this post

[merged]

Hi,

I have an application that uses files in a directory as a shared database.
Since I installed Veeam 5.02.230 (32 bits) following a problem with my original Veeam installation, every time a Backup or replication of the VM where the files for the the application are located, the application hangs at the end of the backup/replication. The backup/replication uses the VSS options as it was before.

It was working perfectly before (I had version 5 but can't remember the build number)

Any idea ?

Post by **Gostev** » Nov 09, 2011 4:16 pm this post

Hi - for version 5, there were only maintenance (bugfix) releases - they do not change the way product operates. If you did not have this issue before, you should look for other possible recent configuration changes in your hardware or VMware environment (for example, additional snapshots presence - as described on the first page of this topic). Thanks.

Fox54 · Post by **Fox54** » Nov 09, 2011 7:19 pm this post

I have a VM that serves as a file server for an application. There is a single directory with the files being the database and the executable application itself.
Whenever I replicate or backup, at the end of the backup/replication, the application hangs. It started with the installation of version 5.02.230 (32 bits). It was working perfectly in older version like 4.x. Even the first version of 5 were working fine

jjlp · Post by **jjlp** » Nov 21, 2011 10:43 am this post

[merged]

Hi all

Im currently testing veeam 5 on vsphere 5 and am having som trouble backing up my SQL server.
The backup goes ok but when veeam commits the snapshot, the server looses network connectivity.
The snapshot is approx 10 - 15 GB

I have been told that i can set the snapshot for safe removal under advanced settings, but does anyone know at what setting/size is best, and also what happens with the backup if I set this size?

Im really hoping that some of you can help

Thanks
Jesper

R&D Forums

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

[MERGED] CBT and VM unresponsiveness

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Snapshot removal hangs with 95%

Veeam Backup on SQL Server Causing Application Disconnects

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Snapshot removal issue of random vms

Re: Snapshot removal issues of a large VM

Veeam and file sharing/locking issues

Re: Snapshot removal issues of a large VM

Re: Snapshot removal issues of a large VM

Best practice to backup a SQL server

Who is online