Safe snapshot removal option should be used with pre-ESX 3.5 U2 hosts, so I doubt that using this option will resolve your issue.
Please take a look at a short summary that might help you to resolve network connectivity problems:
Gostev wrote:Short summary of things which may help (for more information, please read this topic):
1. Make sure VM does not have any other snapshots (including hidden).
2. Increase CPU reservations in the VM settings.
3. Move snapshot location to a different datastore (via workingDir parameter), preferably backed by faster storage (for example, SSD disk).
I thought I'd push this to the forum before contacting support.
We are running Veeam Replciation and backup to replicate and backup 4 VM's running on ESXi 5 (2 physical servers in the core using an EMC VNX SAN, iscsi, etc, etc).
Replication is successful without a hitch but we encountered some very odd things. The first is that at the completion of the replication on the Terminal Server the TD session will lock up for a while. This issue has been seen before and we tried the suggested fix "safe snapshot removal" but it still happens so we've had to move replication to out of hours only. This seems to defeat the object of the exercise!
I've also noticed in the event logs that an application run on the application server (windows 2k3 enterprise) is consistently crashing (Event ID 1000) during or at the end of the replication cycle.
I saw this quote elsewhere: "It sounds like you have those issues while VM snapshot is being removed."
I suspect both issues are related. Has anyone found a definitive workaround for VM freezing or throttling?
I'm having issues backing up a fairly large Exchange VM for a client (750GB Provisioned/500GB Used). It's actually taking a long time to do the backup (around 12 hours) but I'm remedying it by finding the bottlenecks in the process. However, from that 12 hours, I think 2 hours is spent waiting for VMWare to remove the snapshot (does it really take that long?) and the worst part is that at the tail end of the backup removal process, the Exchange freezes (console reports MKS error, can't ping the server, clients get no mail) for around 5-6 minutes.
I've tried safe removal of snapshot option, but as they are running Veeam B&R 6 and VSphere 5, i don't think it's applicable anymore.
@PRTan, I got the exactly same problem/error on a SQL server with an axapta application.
My backup also runs for 12 + hours and the SQL server freezes when committing snapshot.
Safe snapshot removal option wouldn't help as it only applies to ESX(i) versions prior to 3.5 U2... try the suggestions that I've highlighted above, should help.
I am running Veeam 5.0 every time a snapshot is created or Deleted we are getting a network disconnect. If you are pinging the server, you might get up to 5 timeouts before the server will start responded agian. Is this normal or a bug or what?
Indexing is performed when snapshot is created, and not when it is removed, so it cannot really affect removal. Your conclusion is likely just an artifact of your testing process. When you are performing job runs one after another, there is no changes to be processed in the VM, so the snapshot only stays open for a brief moment, which is why its commit is very fast. However, for daily backup, the amount of changes that must be processed is much larger - so the backup job runs longer, which in turn makes the snapshot grow large, resulting in issues on commit. Thanks.
This was a change that I implemented in my production environment on daily backups. During the transfer of the guestindexdata.zip was when I was receiving the disconnects. It also cut my backup time in half or more.
This is the time when snapshot is being removed, actual file transfer should not affect anything. But it does make sense that with the backup time cut in half (thanks to indexing disabled) you do not see the commit issue. As I explained above, the time it takes to backup the VM directly affects snapshot size, which in turn affects commit process (small snapshots are very fast to commit).
Not really a solution I think, as on the days with more changed data process backup will still run long enough to cause large snapshot. Some better solutions were proposed earlier in this thread, not sure if they work for you though. Other than that... just waiting VMware to introduce some snapshot commit optimizations...
We have a similar issue. We currently have an open ticket.
Full backup on veeam 5 for exchange 2010 in a dag (backing up active server) snapshot creates, snapshot removal takes 12 hours +
We are at our wits end. 3 other backup jobs containg, 3, 4, 19 vms all run fine, just not the exchange backup. (vsphere 4.1.0)
Previously exchange backups were running at 225mb/s which is fine. we had to do a restore, and now we are getting 10-17mb/s which causes backups to run long...etc snapshot removal kills the server, cto yells at me.
We also having been having this problem using vshpere 4.1 and veeam v.5.0.2.2.224 and a Dell MD3220i using iSCSI on raid5 LUN with no other vm running. The vm was 2003r2 with SQL 2005 with CBT enabled. The backup job would just hang at 95% during the removal of the snapshot. I finally fixed it last night by adding the line below to the vm's vmx file. It allows you to redirect the snapshots. In my case I redirected the snapshot to local storage on one of the virtual hosts in the cluster. So it seems as if the raid5 LUN that the virtual machine is on isn't fast enough and doesn't allow the snapshot to be removed in a timely manner. To add this line you must first power off the vm then do a df -h and copy the path to a local datastore. Then create a snapshot directory using the vshpere client or just ssh into the box. You also have to un-register the vm and register it again so the changes on the vmx file can take affect. Then run your backup and browse or cd into the snapshot directory and you will soon see the ctk and vmdk snapshot files being created. This way you know for sure the snapshot is being redirected to local storage. I'll eventually have to build raid10 LUNS for all my SQL vas but this work around gives me plenty of time to plan it out.
We are running Veeam 5.0.2.230(64bit). Our big size VMs are getting backed up on time, but removal snapshot process takes forever, and sometime it takes the VM down since the datastore gets filled up. When we remove the snapshot from vCenter directly, then it removes it properly. The VM size is from 500GB to 800GB.
I heard there is a new option in Veeam 6 which would terminate the job if the job does not complete within allocated backup window. How that would impact the snapshot removal process?
Backup window setting will not help here, as snapshot commit operation must be completed in any case. Last thing you want is to leave the VM running off the snapshot.
I had unresponsive VMs due to snapshot commit as well. I finally figured out it had much more to do with the disk backing the LUNs. We had our dev environment on SATA disks which was fine until the backups ran. There was a large enough change rate that sometimes we would suffer an interruption in service. Thankfully, we have a Compellent storage array so all we had to do was change the storage profile to include fibre and then data progression moved the blocks hardest hit to fibre. Haven't had a problem since. But I was pretty irritated w/ Veeam when they pushed me to call VMware. VMware took some prodding before they really investigated, but when they did they noticed the DAVG values were off the chart during snapshots.
Wait, snapshots are for sure a matter of VMware, since Veeam only asks to vCenter to do them on selected VMs. It would be the same if you did a snapshots of a VM and after a couple of hours you commit the snapshot.
I've seen similar cases on high traffic mailservers: after backup completed the VM delta file has already grown to a good size, and when the snapshot needs to be committed, it takes so much time (even 2 hours in one case) and at the and of the commitment the VM was unresponsive for few seconds.
The only solutions are pumping up the storage (as you did) or choosing a backup schedule in a low usage period during the 24 hrs.
Luca Dell'Oca Principal EMEA Cloud Architect @ Veeam Software
We've setup a Veeam Backup & Replication configuration at one of our customers, and they have purchased a Veeam Enterprise license.
It's a pretty standard Veeam v6 setup for DR purposes :
- 1x vCenter server and 1x Veeam backup server (+proxy,...) at local site
- 1x vCenter/Veeam server (proxy) at DR site
As all is running in test fase for the moment, we are creating Replication jobs and adding running/live VM's to those jobs, which runs fine at an interval of 1 replication to DR site every 15minutes.
First we've added some minor impact servers, but as test is proceeding, last week we've added the (quite important) database server.
During the next day, we received a call that several users were noticing 'hickups' in programs running on the database server, and also returning error messages then in those same programs. We've disabled the replication job in Veeam of this database server, no more complaints from users. Bizarre
As for a test, I've monitored a running VM which is being replicated with a simple ping test (which some larger ping packages). I've noticed that I can see ping drops on the exact very moments of creating snapshot & deleting snapshot (which are processed by Veeam and which you can monitor via the "recent task window" in VMware). After 1 to 4 ping drops, all back to normal ping replies ofcourse for that VM, which equals finishing snapshot creation or snapshot deletion. I've noticed the same behaviour at other VM's, but ofcourse a database server with running application does not 'appreciate' ping drops, which seems like the reason of the users noticing 'hickups'.
Does anyone ever noticed this behaviour, is this a known issue and is there any solution to this?
Not to be Veeam's advocate, but snapshots are only requested by Veeam, and they are processed by vCenter/vSphere.
If you are loosing pings during snapshots creation/deletion, it could be a VMware issue, sometimes this can happen if the underlying storage and ESXi server hardware is heavily loaded, OR if you are using quescing for that VM.
If you want to be sure, you can take the same snapshot directly from the vCenter interface. There are also some articles on VMware KB explaining what happens during a snapshot.
Luca Dell'Oca Principal EMEA Cloud Architect @ Veeam Software
Thanks to everyone that posted here. What an awesome resource!
Same issue as everyone here. VEEAM replication of a 300GB file server, would culminate with 5 PING drops (Request Timed Out). We thought at first it was our switch losing it's LAG settings, but that was just a red herring.
Our issue was compounded by the fact that we had VEEAM failing for over a week before we noticed it. Turned out THAT was because VEEAM was installed on a server that was falling dangerously low on disk space on C: drive (where the application was installed). Most likely the PROXY was pointed to this same drive. Simply by reinstalling the application on a drive on that same server that had 500+GB free, we were able to get VEEAM working again.
Coincidentally, because we were having these issues with VEEAM not working, we had made a manual SNAPSHOT! This was on Saturday. Thinking Nothing of this, we then proceeded to Monday morning, where the client's application was getting tossed every 5 minutes or so. This was exactly at the time the replications were happening - every 5 minutes. So we shut replication down, to troubleshoot it later in the evening.
We found two things.
1. if you set the backups to LOW CONNECTION BANDWIDTH on the first page of the replication job properties and followed thru later with all the default settings, the job would work. Slower than normal but it would NOT "stun" the VM when committing the snapshot and deleting it.
2. After removing all the snapshots using the snapshot manager for that particualar VM, EVERYTHING returned to normal.
BTW, the log file Gustav was talking about, to look for STUN times is VMWARE.LOG. You can "right-click" on it in VMWare browser and download it to your desktop. Our "stun" cycles were 20,000,000us or 20 seconds - about 5 "Request Time Outs."
That's OK, I got used to that. Just 2 errors in 6 letters... no big deal
luke wrote:1. if you set the backups to LOW CONNECTION BANDWIDTH on the first page of the replication job properties and followed thru later with all the default settings, the job would work. Slower than normal but it would NOT "stun" the VM when committing the snapshot and deleting it.
This is definitely unrelated to what has fixed the issue. The only thing this check box does is show the additional wizard step (with replica seeding controls), which is hidden by default to reduce the amount of wizard steps.
We are running Veeam 6.0.0.181 (64 bit) as a virtual server running 8 cores 4 GB RAM W2008 R2 Enterprise on HPDL380G7 diskless (boot from USB) hosts using Dell EqualLogic PS4000XV SAN. Our backups are configured to a new HP X1600 device running Windows 2008 R2 Storage Server. Our backup speeds are generally very good, for example backing up 4 x 72GB DCs in 13mins! However, when it comes to backing up our Exchange 2007 server it is a different story with the backup itself (1TB drive with 290GB of changes with CBT) taking 13+ hours and, more worrying the snapshot removal process taking in excess of 6 hours!
There appears to be a number of issues on this thread around Exchange/SQL in particular and I just wondered if anyone has got to the bottom of the problem. I have a ticket logged but understandably Veeam need the log files which I haven't been able to download yet as they are > 100MB!
As I wrote on a previous post in this thread, snapshot management is VMware duty, not Veeam, so the snapshot removal time is due to vSphere environment performances. I've seen many times huge mail servers taking hours to commit a snapshot, and is something un-avoidable if the combination of underlying storage and disk activity on that VM is sub-optimal.
There is no real way to limit this behaviour, as also stated by VMware itself (http://kb.vmware.com/selfservice/micros ... Id=1025279).
Luca Dell'Oca Principal EMEA Cloud Architect @ Veeam Software
Many thanks for the prompt response. I fully accept that VMWare is responsible for the snapshot removal but wouldn't it be fair to say that, because the back up of the server takes so long (using Veeam) and, using Veeam you don't remove the snapshot until the backup is over, that is why (as the per the VMWare article) the snapshot removal is so difficult. As such perhaps my question should be, "Why are my Exchange only backups so slow when they are sourced from the same SAN, going to the same target etc.
Even not knowing about your Exchange, I would say it's like any other Exchange Server, disks are big so the storage and Veeam take time to process them, and Exchange datastores change many blocks, so even during CBT operations there is always a fair amount of changed blocks that need to be saved (and committed back to the original vmdk).
What's the size of your exchange server, and the amount of daily CBT blocks saved by Veeam?
Luca Dell'Oca Principal EMEA Cloud Architect @ Veeam Software
Thanks again for the reply. We only have 700 users with about 700 GB data and, according to Veeam about 200GB changes per day - which does seem a little high! Fundamentally though I do agree with you and I guess I am just looking to understand what others do as my environment is tiny compared to others I have worked in. What is really weird is that a full backup takes less time than a reverse incremental. Perhaps I should switch the Exchange servers to normal incrementals and then just run a full backup at the weekend? Shame though as the performance on the other servers is generally superb!
Well, if the daily change amount of data is 200 Gb AND you are backing up in reverse incremental, we probably found the reason of the backup time.
Everyday Veeam Backup needs to inject those 200 Gb into the previous VBK file to create the new VBK, than read those changed 200 Gb from the VBK and save them into a new VRB file.
The total amount of data it needs to handle is 600 Gb, as reverse incremental creates 3 IO per saved bit. If the total amount of the server is 700 gb, you are almost doing the same amount of IO as a full backup!
Yes, you better run a forward incremental and do a simple full backup weekly. In this way, you only write 200 gb per day and 700 gb only once per week.
Luca Dell'Oca Principal EMEA Cloud Architect @ Veeam Software