Comprehensive data protection for all workloads
ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Synthetic full rollup failures

Post by ashleyw »

pizang wrote:I guess you are using 64bit system for this server. I had similar issue. It was because system cache (for disk access) tried to allocate more and more memory until it completely dries it up. Please see another great tool from sysinternals.
http://technet.microsoft.com/en-us/sysi ... s/bb897561

Set working set max to logical value and you will be fine.
Hi *pizang, we have been battling with similar memory related issues (since v5 days) particularly during the synthetic full rollup process. When you say "set working set max to logical value" what do you mean by the logical value exactly? Do you mean the RAM to the appliance less the OS overhead less the normal Veeam proxy overhead - e.g. 8GB ram appliance less 2GB ram to be safe = working set max of (6x1024)*1024=6291456 KB? (The current working set maximum value is showing as 1073741824 KB).

I see this link which shows the working set maximum changes do not survive a reboot, so the setting must be set on each boot; http://blogs.technet.com/b/mikelag/arch ... erver.aspx

I see also the the sysinternal tools will not let me set the Working set maximum to the level I want - when I put in working set maximum of 6291456 KB and hit apply, it's automatically changed to 2097152 KB - any ideas here?

If this setting is causing a real headache to Veeam v5 and v6 customers (particularly with large backup loads/jobs) is there no better official fix for windows 2008r2 users?

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Memory performance

Post by ashleyw »

ok, so I've read the whole post form start to finish. In our situation with us running windows 2008R2, as suggested we have now installed DynCache; http://support.microsoft.com/kb/976618
and set "MaxSystemCacheMBytes" to decimal 70 to represent limiting the system cache to a maximum of 70% of the total RAM and we have rebooted the box. However when we try to start the service, we get a message saying "Error 1153: the specified program was written for an earlier version of windows".. Gostav, please can you ask the Veeam development team to compile and x64 binary of this which will work under windows 2008R2. I know that earlier in the post you suggest contacting Microsoft - but the reality is for most people this doesn't lead to a solution from Microsoft...

I see a critical HP support note specifically relating to 2008R2 pointing to the same 976618KB but again saying "contact Microsoft" http://h20000.www2.hp.com/bizsupport/Te ... ical_009_0

Any pointers on how we get a version of DynCache that works with win2008R2 would be great.

tsightler
VP, Product Management
Posts: 5701
Liked: 2527 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Memory performance

Post by tsightler »

ashleyw wrote:Gostav, please can you ask the Veeam development team to compile and x64 binary of this which will work under windows 2008R2. I know that earlier in the post you suggest contacting Microsoft - but the reality is for most people this doesn't lead to a solution from Microsoft...
This wouldn't likely have any impact as several users on the Microsoft forums have already done this and, while they can get the service to run, it doesn't seem to have any measurable impact on this problem. This really is a Microsoft issue and thus Microsoft support really is the best way to resolve the issue. That being said, I've had quite a bit of success with simply having users remove some of the RAM from their systems. There is very little benefit to having tons of unused RAM on the Veeam server as it's only going to be used for write cache, and that's what causes this problem. If you shrink the memory to something more reasonable, i.e. the amount of memory required to run the number of processes you would like in parallel, and a little headroom (maybe 20%), then your system will actually perform better.

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Memory performance

Post by ashleyw »

thanks, currently we have 2 proxies with 8GB ram each. We have 2 threads and 4 vCPUs each proxy. We have tried all combinations of ram between 2GB ram and 10GB ram and we hit the same issue that during the synthetic full process, random jobs will fail (especially the larger jobs) with the "cannot allocate memory" errors. I can't see how we can work around this issue. As a gold partner, surely Veeam can arrange to get Microsoft involvement and commitment to fix the issue. I have tried numerous times in the past to get solutions out of Microsoft for similar types of things and a lot of time ends up being wasted for nothing. If that service isn't the solution, does the other alternative of scripting the sysinternals utility to run at start up work?

tsightler
VP, Product Management
Posts: 5701
Liked: 2527 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Memory performance

Post by tsightler »

Are you sure this thread is talking about the same issue? This thread relates to a performance issue that is well known. The issue with synthetic full rebuild and out of memory may very well be a different issue. Are you working a Veeam support case for that issue?

Gostev
SVP, Product Management
Posts: 27144
Liked: 4450 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Memory performance

Post by Gostev »

To add to that, the "Cannot allocate memory" error was resolved in Patch 2 (see the sticky Known Issues topic).

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Memory performance

Post by ashleyw »

thanks guys, yep the issue is definitely the same as the performance on the proxies becomes terrible when they are out of memory and yes even though we have applied that specific patch we still get the memory issues. I can only assume that the synthetic full process itself involves a larger amount of read/writes in a realtively short period due to its design which leads to this issue showing up more consistently.

I have a support ticket open (5161812).

If Veeam can reproduce the issue (should be easy enough) can't they work with Microsoft to find a fix?

averylarry
Expert
Posts: 264
Liked: 30 times
Joined: Mar 22, 2011 7:43 pm
Full Name: Ted
Contact:

Re: Memory performance

Post by averylarry »

Good luck getting them to believe it's not a storage issue.

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Memory performance

Post by ashleyw »

as pizang suggested, we have used cacheset sysinternals tool, and scripted it to start at boot up;
http://blogs.technet.com/b/mikelag/arch ... erver.aspx

(and chosen to ignore the many Microsoft employee comments claiming that the utility is not required!)

and we have dropped each of our proxy servers down to 3GB of ram (4xvCPUs each) and call "CacheSet.exe 1024 3145728" in the boot script.

We consistently see synthetic full failures every week on two of our larger jobs, so I've set the synthetic fulls to run tonight, so I'll give some feedback early next week to see how well things run.

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Memory performance

Post by ashleyw »

After reducing the memory and limiting the memory using Cacheset as described above, nearly all the jobs failed with either;
"CompileFIB failed Client error: Insufficient system resources exist to complete the requested service" or
"CompileFIB failed Client error: Cannot allocate memory for an array. Array size: [3145728]. "

The normal daily incremental jobs run fine on the lower memory footprint.
The retry on a failed synthetic full still does not take into account that the incremental part may go through fine but the roll up fails - resulting in a retry that ends with a "success" status even though the synthetic roll up is not retried. Would be great if Veeam can fix this bug.

So what we are hoping for is that Veeam attempts to identify why the rollup process has such a large impact on ram and what the solution is to stop jobs from failing - as I'm running out of ideas here.

I'll continue to work with support on the case in the interim.

tsightler
VP, Product Management
Posts: 5701
Liked: 2527 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Memory performance

Post by tsightler »

ashleyw wrote:and we have dropped each of our proxy servers down to 3GB of ram (4xvCPUs each) and call "CacheSet.exe 1024 3145728" in the boot script.
Well, I have no idea if the CacheSet.exe utility will work, but there's at least one obvious error in the example on the referenced website. Since you have stated that you lowered the memory to 3GB of RAM, but then you used cache set to configure the maximum working set to 3GB as well. It doesn't really make much sense to use this tool if you simply set the maximum working set to 100% of memory as what good could the tool possibly do in this case (that's assuming it will do anything anyway). The referenced article suggest using 10% of maximum memory, but then their own example sets the entire memory, so I agree that article is confusing.

That being said, I would suggest taking 10% of 3GB if you really want to prove whether this will help or not, so perhaps "CacheSet.exe 1024 314572", which would set the maximum cache size to ~300MB. Have you monitored your caches and free memory during the synthetic full process? How many processes do you have running at the same time?

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Memory performance

Post by ashleyw »

thanks, After the failures over the weekend, I've left the CacheSet limiting memory usage to 3GB but increased each proxy to have 8GB ram and again set the job to start (2 threads per proxy, 2 proxies).
I'll run some more tests over the course of today.
However already I can see a bug in the Veeam scheduling engine which is compounding this memory problem.
Even though we have 2 x proxies with 2 threads each, this means that we should have in total 4 jobs running.. however two jobs are currently in roll up status (some jobs roll up for several hours), but an additional 4 jobs are running.

So on top of the issues I have, the scheduler is not taking into account that only 4 jobs should run at a time (including the rollup portion of the synthetic full). This will result with yet more memory pressure on the proxies. Are Veeam aware of this issue?

tsightler
VP, Product Management
Posts: 5701
Liked: 2527 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Memory performance

Post by tsightler »

Rollups are performed by repositories, not proxies. Are your proxies also repositories?

Gostev
SVP, Product Management
Posts: 27144
Liked: 4450 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Memory performance

Post by Gostev »

Gostev wrote:To add to that, the "Cannot allocate memory" error was resolved in Patch 2 (see the sticky Known Issues topic).
Just wanted to make sure you caught that, and have the patch installed. Thanks!

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Memory performance

Post by ashleyw »

We are backing up to CIFS storage (A fast Nexentator ZFS 10GbE backup target), and we have setting set on the backup repository to use "write data to this share "through the proxying server".
So from what we have been told previously the rollups happen on the backup proxy (when we first started using Veeam they were running on the Veeam console server which was causing us all sorts of headaches)- we are not aware of any other way of configuring CIFS stores to Veeam.

We are definitely running the patch - I put in the patch in at the end of December.

tsightler
VP, Product Management
Posts: 5701
Liked: 2527 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Memory performance

Post by tsightler »

ashleyw wrote:We are backing up to CIFS storage (A fast Nexentator ZFS 10GbE backup target), and we have setting set on the backup repository to use "write data to this share "through the proxying server".
So from what we have been told previously the rollups happen on the backup proxy (when we first started using Veeam they were running on the Veeam console server which was causing us all sorts of headaches)- we are not aware of any other way of configuring CIFS stores to Veeam.
Got it, selecting the "write data to this share through the proxying server" is the CIFS equivalent of creating a repository on a remote machine so your setup is right. I just wanted to make sure.

tsightler
VP, Product Management
Posts: 5701
Liked: 2527 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Memory performance

Post by tsightler »

Anton, any chance that "memory" fix that was implemented on the server wasn't included in the client transport agents that would be used in the case of CIFS repository configured the way Ashely's are? I suspect this is unlikely, but just a thought.

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Memory performance

Post by ashleyw »

I can confirm after setting up a test today - jobs are still failing due to "Compile FIB failed: Client error: Insufficient system resources exist to complete the requested service" despite supposedly limiting the cacheset to 3GB and giving each of the 2 proxies 8GB ram.

At one stage in the job I saw 5 roll ups happening at the same time and 4 back up jobs running simultaneously (due to the scheduler bug not taking into account running roll ups on the backup proxy as active jobs) - when only 4 jobs should have been running simultaneously.

The retry runs again but then wrongly returns a "success" status as due to a retry bug does not attempt to rerun the roll up (as previously discussed).

so "Help me Obi Wan Kanobi your my only hope"!

Gostev
SVP, Product Management
Posts: 27144
Liked: 4450 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Memory performance

Post by Gostev »

@Tom we have the single data processing agent that is used everywhere (in every component).

I've split this into new topic since this seems to be isolated issue in the single environment and unrelated to the original topic about Windows system cache taking all available RAM when heavy disc I/O takes place.

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Synthetic full rollup failures

Post by ashleyw »

thanks. So should I just continue working with the Veeam team for the APAC region (things have gone silent since the support call was routed to them) and forget about the forums until I can update with the solutions etc?

tsightler
VP, Product Management
Posts: 5701
Liked: 2527 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Synthetic full rollup failures

Post by tsightler »

If the issues you are seeing are bugs (certainly possible) then support is going to be the only way to get a resolution.

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Synthetic full rollup failures

Post by ashleyw »

Is there anyone else that can confirm what we are seeing here in terms of the job scheduling engine. We have 2 x backup proxies with 2 threads each, so this means 4 jobs should run at once, but once a job reaches the 99% phase (ie the start of the rollup), the scheduler thinks the job has completed so tries to schedule another job to the engine, so in our case when a job synthetic roll up can take hours, we end up with 9 jobs actually running and overloading the engines making any memory issue much worse than it should be. The screen shot shows 4 jobs in the incremental stage and 5 that are busy with the synthetic full rollup.

See screen shot below;
Image

I'm battling to get Veeam support to understand this and it's fundamental to getting these synthetic full failures resolved IMHO. If the sythetic fulls are so resource intensive then a single synthetic full running on an engine should perhaps stall/queue up all other jobs on that engine.

Gostev
SVP, Product Management
Posts: 27144
Liked: 4450 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Synthetic full rollup failures

Post by Gostev »

The above is "normal" because backup proxies do not take part in the synthetic full processing, so they get assigned to the other tasks. If these jobs are going to the same backup repository, you may want to limit the amount of concurrent tasks per repository instead (because it is backup repository agent that handles synthetic full processing).

tsightler
VP, Product Management
Posts: 5701
Liked: 2527 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Synthetic full rollup failures

Post by tsightler »

He is using a CIFS repository. In that case, either the main Veeam server performs the rollup (because he "owns" the repository), or the proxy performs the rollup if it is configured with "use this proxy as a gateway" option which is useful for spreading the load of synthetic fulls across proxies when using CIFS. If you configure a limit on the number of sessions on a repository, does it count a synthetic full against that limit? It probably should. If so, then there needs to be some similar limit for CIFS repositories as well.

Currently the best workaround is to simply spread the jobs out, ideally scheduling your synthetic fulls for different days of the week.

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Synthetic full rollup failures

Post by ashleyw »

thanks. I noticed that the load control option on the repository was set to the default of 4 concurrent jobs - I have lowed that to 2 on each engine now. I suspect this doesn't include jobs at the rollup stage as the screen shot I pasted had 9 running jobs, whereas this should have limited it to 8 (but each engine is still set to 2 threads). As @Tom says we use the option on the repository to "Write data to this share: through the following proxy server" to specifcally load balance the rollups over 2 proxies rather than have the Backup console doing this. We have the same CIFS share visible on 2 different IP addresses (10.42.3.105\backups and 10.42.4.105) so that we can make full use of an active/active 10GbE connection to our storage.

I increased the RAM to each of the backup proxies to 12GB ram and even though there were still 9 jobs running simultaneously, the jobs went through with only a single failure of "RPC function call failed. Function name: [DoRpcWithBinary]" which is apparently a known issue in the prcoess of being resolved according to support.

So to be able to see if there is an issue with the memory side of things, there needs to be;
1. Some changes to veeam around the limits on rollup jobs as @Tom describes precisely. (Nice one Tom!)
2. Some changes to the job retry code, so failed synthetic jobs are retried correctly.

Once we have fixes for these then we can look at why the memory requirements for Veeam synthetic rollups are so steep.

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Synthetic full rollup failures

Post by ashleyw »

I received confirmation from support that the transformation activity is not counted as part of concurrent jobs in the repository limits which causes a potentially serious resource issue with multiple transformations running simultaneously in addition to standard backup tasks. It was suggested this will be resolved in a future release most likely 6.0.1 - what timescales are we looking at for this? Is it possible for a hotfix before this if the timescales are long?

We created a pair of additional VMs to separate out the repository layer from the backup proxy layer and after extensive testing, we found the our jobs run fine with the 2xbackup proxies set to 4vCPU, 4GB ram each (2 threads each) and the pair of repository VMs set to 2xvCPU, 12GB ram each (2 threads each). If we drop the repository VMs down to 8GB ram each we start to see errors during the transformations with "Compile FIB failed Client error: Insufficient system resources exist".

I stil think the RAM requriements specifically during a transformation process are extremely heavy - perhaps this needs to be noted in the FAQ?
A failed transformation should either not trigger a job retry, or the retry should attempt the transformation again - at the moment the retry results in no transformation and a success status. Is this soemthing that can be fixed?

Gostev
SVP, Product Management
Posts: 27144
Liked: 4450 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Synthetic full rollup failures

Post by Gostev »

We are not planning for 6.0.1 release, and as always I cannot publicly share any dates for the next "normal" release. Inclusion in hotfix is unlikely for these kind of enhancements especially since they touch the job scheduling engine - I will check, but very unlikely.

Could you please lookup the actual memory consumption of Veeam agent process during the transform, and see what you are getting there?

Retry behavior should be fixed for sure, if it is indeed works the way you have described.

ashleyw
Service Provider
Posts: 154
Liked: 20 times
Joined: Oct 28, 2010 10:55 pm
Full Name: Ashley Watson
Contact:

Re: Synthetic full rollup failures

Post by ashleyw »

Anton, do you know if the change to the scheduling engine to include jobs in the synthetic transformation stage as part of the concurrent job count will be included in Patch #3 that you refered to in your weekly digest?

Gostev
SVP, Product Management
Posts: 27144
Liked: 4450 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Synthetic full rollup failures

Post by Gostev »

Hi Ashley, no - as I've said above, this particular fix appears to be quite complex (touches too many things), so unfortunately this issue will only be fixed in the next minor release (Q2 this year). Thanks!

clint
Novice
Posts: 4
Liked: never
Joined: Jan 16, 2012 12:00 am
Full Name: Clint
Contact:

Re: Synthetic full rollup failures

Post by clint »

Hi,

Is there a work around for this problem. Has been causing me serious issues for weeks (open support case, just found this thread which is exactly what my issue is).

Is it the case that this was not an issue in V5? This only seemed to start happening when we went to V6.

Thanks,

Clint

Post Reply

Who is online

Users browsing this forum: Bing [Bot], Google [Bot], mplaza and 42 guests