Single VM failing

HaroldC · Post by **HaroldC** » Jun 19, 2012 8:37 pm this post

I've got a case with support but I thought I'd check here just in case someone has run into a similar issue.

I have one very large VM, well actually it has a second Virtual hard drive that is very large, sized at 1TB with 750GB used. I have a single job that only process's this one VM. It gets through the first Hard Disk and then fails after several hours processing the second hard disk, the large one with this error:

Code: Select all

6/19/2012 4:01:02 AM :: Error: Client error: Timed out to wait for free pre-read buffer.
Unable to retrieve next block transmission command. Number of already processed blocks: [439254].

I make small changes, like the backup proxy or repository and then retry the job only to have it fail after 16 hours of processing.

We have a 3 proxies that are Server 2003 machines, The veeam backup server and a separate backup proxy that is a Windows 2008 server.

Post by **Gostev** » Jun 19, 2012 9:07 pm this post

Please include your support case ID.

unnhem · Post by **unnhem** » Jun 20, 2012 6:34 am this post

Please, if you find anything that fixes the error post here.
I've an ongoing, long running, support request about the same kind of error. (ID#5195744).

HaroldC · Post by **HaroldC** » Jun 27, 2012 6:09 pm this post

My support case id is 5198986.

I'm really at a loss. All of my other VM's and backup jobs work fine. This one just doesn't.

Cokovic · Post by **Cokovic** » Jun 27, 2012 8:18 pm this post

I dont know if it could be of any help for you. But recently i had a major issue with one of our VMs that i couldn't get backed up. This VM has a total of 29 VMDKs with provisioned space in total of 13.5TB. Always at around Harddisk 17 or 18 i got a SAN transport error and failback to network mode failed too. I had a support case with Veeam and also with VMware open and we finally solved it. In the end it was a setting on the corresponding host cause of the big size of this VM. And we have too alot of 750GB harddisks within this VM. It' just a guess but try to increase the MaxHeapSize on your VMWare host where this failing VM is running on. Per default its set to 80MB and can be increased up to 256MB on ESXi 5. You will find it if you click your host in vSphere Client --> Configuration --> Advanced Configuration --> VMFS3. it's really just a guess but that did the trick for us. Since then my backups are running fine for this VM. After changing this value you have to reboot the host.

cmcc82 · Post by **cmcc82** » Jul 09, 2012 2:39 pm this post

Cokovic
If i decided to change this setting at Advanced Settings, I don't have to restart any esx right?
Thanks

Cokovic · Post by **Cokovic** » Jul 09, 2012 3:27 pm this post

No. You have to restart the ESX server where you change this value as it will only take effect after a reboot. If you have VMotion licensed this shouldn't be a problem.

jeremyh8 · Post by **jeremyh8** » Jul 14, 2012 2:41 am this post

did this resolve your issue?

pendragoncrw · Post by **pendragoncrw** » Dec 27, 2012 3:47 pm this post

I am experiencing the same problem as the original poster also with a VM that has some large drives: 40gb, 800gb, 500gb, 600gb. My other job with 29 VM's in it and a few larger ones runs without any problems. Support case is 00169169, but I understand I need to wait for Level 2 support now (response times were a little slow yesterday due to snow as I understand it).

We are running ESXi 5.0 U1 which according to VMWare should be able to have 8tb of open virtual disk on a single host with the default 80mb heap size. The total storage in use by the host holding our troublesome VM is a little of 4tb and there have been no problems with numerous normal and storage VMotions so I'm not convinced the heap size setting is really the culprit. Our other 2 ESXi hosts have on average 1tb of active VMFS storage.

Any ideas or updates while support chews on this?

Thanks,
Chris

Post by **foggy** » Dec 27, 2012 4:03 pm this post

According to the OP's case, the error has gone after consolidating the VM.

pendragoncrw · Post by **pendragoncrw** » Dec 27, 2012 4:17 pm this post

Before I started any of the backup jobs, this VM did need a consolidate and it was performed. There were no active or orphaned snapshots for this VM prior to our first attempt to back it up. Would you still recommend trying a consolidate?

I reviewed the "Needs Consolidation" status in vCenter for all my VM's and they are all "No"

Thanks,
Chris

Post by **Vitaliy S.** » Dec 27, 2012 8:13 pm this post

If there are no snapshots then there is nothing to consolidate. Looks like you have a slightly different issue.

pendragoncrw · Post by **pendragoncrw** » Dec 27, 2012 10:03 pm this post

Forcing transport mode to network and it looks like we are having better luck. If network mode works out, support has instructed me to try in VA mode with the virtual disk it keeps hanging on excluded. After that test, a possible clone/migration may be in the future.

Thanks for the help.

Chris

pendragoncrw · Post by **pendragoncrw** » Dec 28, 2012 3:56 am this post

Just to update.

Tried running the backup with proxy forced to network mode. It made it through the disk it usually had problems at (disk 2), but slowed to a crawl on disk 3 (a few hundred megs every 45min to an hour). I'm waiting for the job to time-out and will then try cloning the VM and backing up the clone.

First time I've ever had a problem like this with a VM. Any thoughts are welcome.

Thanks,
Chris

Post by **tsightler** » Dec 28, 2012 4:09 am this post

Can you tell me a little more details about the job setup? For example, are you making any changes to the job settings, for example, block size (Local, LAN, WAN). Is the repository SMB/CIFS? How much memory do you have on the proxy and repository. How big is the VBK when you start having performance problems?

pendragoncrw · Post by **pendragoncrw** » Dec 28, 2012 7:21 am this post

Job 1: 30 VM's, about 2.5tb total storage compresses and de-dupes down to 600gb for a full to same repository, proxy, etc. and settings as below. This job runs flawlessly and includes our 1tb Exchange 2010 database server.
Job 2: Single problem VM
Job is setup as reverse incremental with 30 restore points retained.
Proxy is 8gig RAM VM with 4 vCPUs. It and the repository server hardly break a sweat.
Repository is Windows 2008 storage server with 6gb RAM, 8 cores, running Veeam Agent (not being used as NFS/CIFS/ISCSI device)
Compression is set to LAN, de-dupe left at default.

Doesn't seem to be any consistency on the failure point. In VA mode, it would die somewhere on the second disk (typically), once it died on the first disk, today in network mode it was on the 3rd disk. No other VM operations are affected. I'm also not sure why Veeam takes so long to fail out.

Making enough space to clone it tonight although I might re-size the disks inside the VM and use the VMWare Converter to get a fresh copy of it. This is one of our main production servers and it is hard to justify playing with it so much when it works fine for everything except being backed up by Veeam. This weekend will be my only chance to get significant downtime with it for quite a while so it's either fix it this weekend or put an alternative backup method in place.

Again, this is very atypical of my experience with Veeam (have it at about 20 individual clients), but it sure hurts when it happens. The only thing unique about this VM is that it had a rough history with snapshots before I came to it which required a few sessions of take a manual snapshot, delete all, shut down the VM, try again, etc. but it's clean now.

pendragoncrw · Post by **pendragoncrw** » Dec 28, 2012 7:55 am this post

Job finally failed out around 11:45pm (6hr 24min) spent on the third disk. Same error "Timed out to wait for free pre-read buffer."

I guess if network mode did not change anything, it's time to look at the VM itself.

Chris

Post by **tsightler** » Dec 28, 2012 2:27 pm this post

pendragoncrw wrote:Proxy is 8gig RAM VM with 4 vCPUs. It and the repository server hardly break a sweat.
Repository is Windows 2008 storage server with 6gb RAM, 8 cores, running Veeam Agent (not being used as NFS/CIFS/ISCSI device)
Compression is set to LAN, de-dupe left at default.

OK, some good info there, specifically the part about storage optimization being set to LAN (512K Blocks) and server memory. I'd really like to know how big the VBK is after it crashes. I'm guessing it's going to be larger than 1TB. If so, you'll probably need to set the block size back to the storage optimization of Local (1MB) and try again.

pendragoncrw · Post by **pendragoncrw** » Dec 28, 2012 3:02 pm this post

The reported size of the VBK file in explorer is usually between 300gb and 400gb when it heads south, much smaller than the VBK generated by our working job when it runs a full (which has the same block size).

Post by **tsightler** » Dec 28, 2012 3:34 pm this post

OK, not as big as I expected. I might still try running the job with Local instead of LAN optimization just to eliminate the VeeamAgent memory consumption from being a possible issue.

I know you mentioned that you don't think the heap size is you likely issue, but it might be worth checking the stats by running this command from the ESXi console:

Code: Select all

memstats -r heap-stats | grep "\(vmfs\)\|\(size\)"

Never hurts to be sure. Do you happen to have any other place that you can use as a repository for a test, perhaps even using the same storage server as a CIFS share? Just to change some things up. BTW, what do the realtime bottleneck stats show while the job is running? Do you have a support case opened?

pendragoncrw · Post by **pendragoncrw** » Dec 28, 2012 3:57 pm this post

I did check the heap stats using the command you listed and the counters were in spec according to the KB and forum articles from VMWare. I kept an eye on it when the job started, when it slowed down, and at the end. The counter numbers were more or less similar to the good job when it was backing up our 1tb Exchange 2010 database server.

On the bad job, the real-time bottleneck stats always show "source" as the bottleneck (between 65 and 85% of the time).
On the good job, the real-time bottleneck stats always show "target" as the bottleneck (between 85-95% of the time) with source coming in second around 45%.

I have a Synology device I can use to test with a different repository, just got to clear some space on it. Unfortunately, with a VM this big, testing an individual variable is a time consuming process.

Hopefully I hear back from support today (case was opened on Wednesday morning) since I added last night's logs to it.

Chris

pendragoncrw · Post by **pendragoncrw** » Dec 30, 2012 6:12 am this post

Based on support's recommendation, I cloned the VM and tried to backup the clone while it was turned off. Same behavior and I can clearly see the drop-off in SAN activity when reads drop to almost nil. I'm going to VMotion it to a different host and try again.

The production VM works perfectly fine (as do all other VM's on that host) so I am very perplexed with what is going on.

Chris

mnaveedishtiaq · Dec 31, 2012 4:46 am

Hi All,

was experiencing similar sort of error with one of theVM. Tried may options and atlast resolved the issue with following work around after 1 month effort.

Create a local backup of VM.
copy the same to secondary site.
Seed the backup with Production site.

please try the same, maybe this would work in your case as well.

Regards,

Muhammad Naveed

Mopad · Post by **Mopad** » Jan 03, 2013 2:31 pm this post

I had and am having the same problem. I have two backup jobs. One points to a local backup repository, and the other to a offsite repository. The local job was the first to start having the same issues described in this thread. It kept failing on a certain vm. The vm has two vmdk's. 1 is 50GB and the other is 1.8TB. It always kept failing on the 1.8TB vmdk. The backupjob would run fine on the VM untill it hit the 70-72% mark. It would then lock up the repository server and then hang for hours (sometimes up to 40-60 hours). Running a full active backup would complete successfully. Any incremental after the full would fail. I was running WS2008R2 on both repositories. I also created a new backup job, and the new job would still fail on the same vm when a incremental was ran.

I finally blew away my local repository server and installed win7 32bit. My local onsite backups have working ever since re installing win7 on the local repository.

Now my offsite repository is having the same exact problem.

Whats really weird is the backup job with the offsite repository was failing before Christmas break (I work in a k-12 school). During break when no one was here the job completes successfully 11 days straight. The day everyone came back, the job starts failing again.

Post by **Vitaliy S.** » Jan 03, 2013 2:41 pm this post

Hi Benjamin, can you please tell me what our support team says on that behavior? Have you logged a ticket?

Mopad · Post by **Mopad** » Jan 03, 2013 2:57 pm this post

Here is what support told me to do.

Let's check that job does not stuck on the user profiles. Open properties for the job "Offsite", click Next 3 times, click on Advanced button, highlight MAHELStaff, click on Edit, open Indexing tab, exclude the whole "C:", or just disable indexing.
Please let us know the results of job.

I doubt that will help any since the job is not failing on the vmdk that holds my c: drive.

Support case # 00168695. There won't be much activity since the job was completing successfully during christmas break. But I will start giving feeback on the recommendations from support.

Here is the support case for the first one I opened but is now closed. 00159629. Its closed cause I re installed the backup repository OS. I don't really feel like doing that to the offsite repository....

Mopad · Post by **Mopad** » Jan 04, 2013 2:48 pm this post

After applying supports suggestions, the job is still failing. Logs have been uploaded.

goldsmith · Post by **goldsmith** » Jan 10, 2013 10:10 am this post

We are also experiencing this problem with one of our 2003 VMs (about 1TB total storage), Error: Client error: Timed out to wait for traffic control event.

Any suggestions are more than welcome.

Post by **Vitaliy S.** » Jan 10, 2013 10:26 am this post

The best way to troubleshoot this would be to open a support ticket with our technical team.

goldsmith · Post by **goldsmith** » Jan 10, 2013 10:36 am this post

Might be worth looking at KB940349 from microsoft as it is an update for 2003 vss, I will try applying this patch tonight and let you know the result.

This is a replication job and it is rather large so it may take a day or 2 to post an update.

R&D Forums

Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Re: Single VM failing

Who is online