Host-based backup of VMware vSphere VMs.
Post Reply
bg.ranken
Expert
Posts: 121
Liked: 21 times
Joined: Feb 18, 2015 8:13 pm
Full Name: Randall Kender
Contact:

NetApp Application Consistent Storage Snapshots During Production Hours

Post by bg.ranken »

We currently have a NetApp SnapCenter server setup and configured to do application consistent snapshots for many of our SQL servers throughout the day. However we'd like to transition these jobs into Veeam to run periodic application consistent storage snapshots. So this means a job that runs periodically (some set for every hour, some set for every 4 hours) that only triggers a storage snapshot and not any Veeam backups. This was suggested by quite a few Veeam reps and even a NetApp rep since if Veeam is able to create the storage snapshot itself it does not have to go through the mounting process to scan snapshots as it currently does every time SnapCenter creates one.

However one issue we're running into is that for some reason when Veeam does the same thing the VSS freeze on the SQL server is too long to the point that we can't run them during production hours like we can with the SnapCenter jobs. Currently the SnapCenter jobs will trigger the VSS freeze, create the storage snapshot, and then unfreeze and that takes anywhere from 1-3 seconds. However when the same thing is done in Veeam with a job setting the "ONTAP Snapshot (Primary Storage Snapshot Only)" setting we are looking at freeze times between 15-40 seconds.

Is this expected behavior? Does anyone else also have a NetApp with Veeam storage snapshot jobs that run during production hours without this long of a freeze? Right now the freeze is too long and many of our applications time out or throw errors due to this and we've had to disable the jobs.

A college has a case open (04716753) but so far we haven't been able to figure out what is causing the long freeze or if it is expected.
foggy
Veeam Software
Posts: 21069
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by foggy »

Hi Randall, in the general case, snapshot-only job creates a VMware snapshot after quiescing the VM. However, in case the volume doesn't contain disks of any other VMs from the same job, the VM can be processed without VMware snapshot. I suspect it takes longer due to VMware snapshot processing - could you please check if this is the case?
bg.ranken
Expert
Posts: 121
Liked: 21 times
Joined: Feb 18, 2015 8:13 pm
Full Name: Randall Kender
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by bg.ranken »

So we did some testing yesterday and confirmed that the job was creating VMware snapshots.

We went ahead and tried relocating a VM to it's own datastores and were able to obtain freeze-only mode without it doing the VMware snapshot. However the freeze on the system still lasted 20+ seconds even without the VMware snapshot being processed.
orb
Service Provider
Posts: 126
Liked: 27 times
Joined: Apr 01, 2016 5:36 pm
Full Name: Olivier
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by orb »

Hi,

We had something similar years ago with a customer and a very busy MS-SQL where NetApp Snapshot was involved with DirectNFS. We had some major stuns, timeouts and clients disconnection during our backup. The VSS was forcing a memory flush on the disk and created massive I/O.

We never went to the bottom of this. The system was running 24/7 intensely and classical dumps were enough for our customers.

Did you use the SQL Agent from NetApp as well with SnapCenter? It is not very clear to me.

Oli
bg.ranken
Expert
Posts: 121
Liked: 21 times
Joined: Feb 18, 2015 8:13 pm
Full Name: Randall Kender
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by bg.ranken »

So here's what the support rep said are the times in regard to our freeze-only jobs:

Code: Select all

It takes 12 seconds to freeze the vm, 18 seconds to take the storage snapshots then 5 seconds to complete the unfreeze. 
So the long time to freeze is fine, but 18 seconds for the storage snapshots and 5 seconds for the unfreeze doesn't sound right, especially when SnapCenter is able to do those same tasks in 1-3 seconds. So far the support rep just gave us the times so still waiting on why it is taking so long to do those steps.
orb
Service Provider
Posts: 126
Liked: 27 times
Joined: Apr 01, 2016 5:36 pm
Full Name: Olivier
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by orb »

bg

You can find a file with all steps/timing on the SQL server in %ProgramData%\Veeam for
What model do you have? How many and what type of disks do you have in your aggregate which supports your SQL volumes? Your NetApp may be very busy also.

Oli
Andreas Neufert
VP, Product Management
Posts: 6707
Liked: 1401 times
Joined: May 04, 2011 8:36 am
Full Name: Andreas Neufert
Location: Germany
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by Andreas Neufert »

Hi Randall,

we are processing many additional things like metadata collection and setting restore awareness settings which SnapCenter does not.

During a normal VSS writer processing the application should not go down. Can you please describe what issues do you face?
Our support can give you a VSS snapshot tools where you can run native VSS processing (without our software in the mix) to veriffy that you do not have an issue with the native Microsoft VSS commands.
bg.ranken
Expert
Posts: 121
Liked: 21 times
Joined: Feb 18, 2015 8:13 pm
Full Name: Randall Kender
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by bg.ranken »

Well the problem is that the freeze is so long that applications start to time out. Most of our applications have a 15 second timeout window. Some of the freezes we're seeing on some SQL servers are in the 45+ second range.

And I know there are other things that Veeam does that SnapCenter does not, but I would assume that once you get to the point where the server is frozen, the only thing that needs to be done at that point is the storage snapshots and the unfreeze. Support has not been able to tell why it takes Veeam so long to trigger the snapshots against the NetApp.

We can try the VSS snapshot tools, but at this point looking at the timings that support has given us it doesn't seem to be a VSS issue and is more of a communications issue between Veeam and NetApp. But it appears we've escalated the case as high as we can go, and so far support has simply given us the times from the logs and have not provided us with any direction or possible solutions at all. If you look at the case notes the support reps have many times just sent back the Veeam logs to us to review.
foggy
Veeam Software
Posts: 21069
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by foggy »

According to the logs, 15 out of those 18 seconds takes the connection to the storage, actual snapshots are quite fast - I think this is the issue that should be investigated. Could you please also elaborate on the timeout value - 15 seconds looks quite short, I believe the default VSS writers timeout is 60 sec (20 sec for Exchange).
Andreas Neufert
VP, Product Management
Posts: 6707
Liked: 1401 times
Joined: May 04, 2011 8:36 am
Full Name: Andreas Neufert
Location: Germany
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by Andreas Neufert »

Overall the application should not be affected from the VSS processing itself as well not from the storage snapshot processing.
VSS writers can slow down an application but it should not lead into any issues. SnapCenter consistency processing should take the same time in case of VSS creation.
bg.ranken
Expert
Posts: 121
Liked: 21 times
Joined: Feb 18, 2015 8:13 pm
Full Name: Randall Kender
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by bg.ranken »

foggy wrote: Apr 09, 2021 9:58 am According to the logs, 15 out of those 18 seconds takes the connection to the storage, actual snapshots are quite fast - I think this is the issue that should be investigated. Could you please also elaborate on the timeout value - 15 seconds looks quite short, I believe the default VSS writers timeout is 60 sec (20 sec for Exchange).
So that 15 second delay is definitely odd and we're going to start looking into it, thank you for pointing that out.

Not sure if it's related or not but I tested editing the NetApp SVM under the storage integration, going to credentials, and pressing next, and there was almost exactly a 15 second delay where it said "Checking connection..." before if flashed away and started "saving to storage configuration...". But interestingly enough if I cancel out the window and do the same thing again, it only takes a second or two now to do the "Checking connection..." part. But if I leave it for a few hours and come back it takes 15 seconds again. We'll see if we can get support to review the connection to see if something is causing the delay.

Regarding the timeout issue, here's one of the errors generated from one of our applications, but we've gotten similar errors from other applications when the freeze took too long:

Code: Select all

Commit failed with SQL exception
Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
The wait operation timed out
Andreas Neufert
VP, Product Management
Posts: 6707
Liked: 1401 times
Joined: May 04, 2011 8:36 am
Full Name: Andreas Neufert
Location: Germany
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by Andreas Neufert »

Strange. Looks like you run into a 30 sec default timeout for SQL operations (incl. SQL VSS Writer release processing).

Do you have something like NLB-Cluster in use that use the same IP address in multiple servers?
bg.ranken
Expert
Posts: 121
Liked: 21 times
Joined: Feb 18, 2015 8:13 pm
Full Name: Randall Kender
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by bg.ranken »

I do not believe so. So far all of our testing has been with single node SQL servers; no SQL always-on or failover clustering involved as of yet.
mcz
Veeam Legend
Posts: 835
Liked: 172 times
Joined: Jul 19, 2016 8:39 am
Full Name: Michael
Location: Rheintal, Austria
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by mcz »

Not sure if it's related or not but I tested editing the NetApp SVM under the storage integration, going to credentials, and pressing next, and there was almost exactly a 15 second delay where it said "Checking connection..." before if flashed away and started "saving to storage configuration...". But interestingly enough if I cancel out the window and do the same thing again, it only takes a second or two now to do the "Checking connection..." part. But if I leave it for a few hours and come back it takes 15 seconds again.
guys, have you ever done a trace during that operation to see if there's maybe a delay from the storage side, packet loss or such stuff?
bg.ranken
Expert
Posts: 121
Liked: 21 times
Joined: Feb 18, 2015 8:13 pm
Full Name: Randall Kender
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by bg.ranken »

We actually just got off the phone with the support reps.

We have not done a trace yet. We actually brought it up to support but they said they do not need it yet.

They had us enable extra logging for the NetApp integration via reg keys from this KB: https://www.veeam.com/kb2409

After adding the key we tried to see if we could reproduce the delay in the console but were unable to produce the issue enough for it to stand out in the logs. We ended up running the job again and the delay in contacting the NetApp after the freeze was still there with no additional logging (at least at the job log level). The actual full snapshot process only takes 4-5 seconds once Veeam is finished connecting to the NetApp. We are sending them the logs again and they are going to be looking into it further to see if there's more logging for what's going on during the delay.

Code: Select all

[13.04.2021 12:07:28] <01> Info         [CAutoSnapshot] Finished VSS Freeze, freezed: 'True'
[13.04.2021 12:07:28] <01> Info         [NetApp] Connecting to NetApp server 'svm***'. SVM: 'svm***' API version '1.15'. User: '***'. Port: '443'. Protocol: 'HTTPS'.
[13.04.2021 12:07:43] <01> Info         [NetApp] Getting ONTAPI version.
[13.04.2021 12:07:43] <01> Info     Invoke:
[13.04.2021 12:07:43] <01> Info         <system-get-ontapi-version/>
[13.04.2021 12:07:43] <01> Info     Response:
They did recommend us some alternatives such as using native SQL backups to avoid the freezes, but for obvious reason we would prefer not to go down that path. Worse case we would change our storage snapshots that Veeam is doing to crash consistent or continue using SnapCenter for SQL which is still working.
bg.ranken
Expert
Posts: 121
Liked: 21 times
Joined: Feb 18, 2015 8:13 pm
Full Name: Randall Kender
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by bg.ranken »

So I just wanted to give an update to this in case anyone else is looking into the same issue.

Unfortunately Veeam support was never able to determine why this extra 15 seconds was happening when connecting to the storage. We ran multiple tests but they weren't able to determine what was causing it. Because the actual freeze was still under Microsoft's recommendation of 60 seconds all of the support techs seemed to consider this a non issue, something that is somewhat frustrating for us as a customer. They also continued to think the issue was on the target SQL servers even though it was pretty easy to prove it wasn't as regular backup jobs without storage integration didn't have the same issue.

That being said, the issue did go away, however there were multiple changes were made in-between the last test that was done and the issue disappearing:
  • April 2021 Windows Updates, including SQL 2016 SP2 CU17 (All Veeam servers were previously on March updates and SQL 2016 SP2 CU16)
  • NetApp upgrade from 9.7P8 to 9.7P12
With the issue gone a lot of the servers that were previously taking 20-26 seconds to freeze were now taking 4-6 seconds.

Long term it seems like the best solution to prevent this issue from happening in the first place (because it's so hard to detect it may actually be happening to other customers and they aren't even aware) seems to be to move the connection process for storage outside of the freeze process. Something like having Veeam connect to the storage before it starts the freeze, then stay connected while the freeze is happening so it's ready to initiate the storage snapshots the moment the freeze. Obviously this is probably a much bigger consideration on the engineering side as it could actually introduce problems, but there may be some performance benefit from doing this. Looking at our freeze times currently with things fixed, moving the connection to storage outside of freeze could still reduce freeze times by 10-15%, and obviously for people having the same connection issues we were having it would reduce freeze times by almost 75% or more.

Not sure if this can be a feature request or something that engineering could consider, but seeing as it could reduce freeze times for all customers if it works it may be worth looking into.
foggy
Veeam Software
Posts: 21069
Liked: 2115 times
Joined: Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson
Contact:

Re: NetApp Application Consistent Storage Snapshots During Production Hours

Post by foggy »

Hi Randall, glad it is finally resolved for you and appreciate the update. Also thank you for the detailed request - we'll estimate the feasibility of your suggestion.
Post Reply

Who is online

Users browsing this forum: No registered users and 80 guests