VSS SQL Writer Failing

Availability for the Always-On Enterprise

VSS SQL Writer Failing

Veeam Logoby Sushil_BA » Wed Dec 21, 2016 10:41 pm

Hi All,

We have application-aware backup of a SQL server failing intermittently. The SQL Server is a secondary node in an always-on cluster, with total of 9 databases, of which 7 are synced and 2 are standalone.

Error reported on backup job:

Code: Select all
21/12/2016 1:06:15 AM :: Unable to release guest. Details: Failed to call RPC function 'Vss.Unfreeze': Error code: 0x80004005. Failed to invoke func [Unfreeze]: Unspecified error. Unfreeze error (over VIX): [Backup job failed. Cannot create a shadow copy of the volumes containing writer's data. A VSS critical writer has failed. Writer name: [SqlServerWriter]. Class ID: [{a65faa63-5ea8-4ebc-9dbd-a0c4db26912a}]. Instance ID: [{1860ad42-e7fa-4b97-8331-806abfee0c4e}]. Writer's state: [VSS_WS_FAILED_AT_POST_SNAPSHOT]. Error code: [0x800 


A restart of VSS SQL Writer service stabilizes the writer and backup works fine for 3-4 days and the above error is reported again.

Windows Event logs:
Code: Select all
Source:        SQLVDI
Date:          21/12/2016 1:06:07 AM
Event ID:      1
Level:         Error
Description:
SQLVDI: Loc=TriggerAbort. Desc=invoked. ErrorCode=(0). Process=1380. Thread=572. Server. Instance=DB01. VD=Global\{F8269FBE-DC09-4853-9C78-E03CF39A8FEA}1_SQLVDIMemoryName_0.
Starting up database 'SecurityAnalyticsEngine'.


Source:        SQLWRITER
Date:          21/12/2016 1:06:07 AM
Event ID:      24583
Level:         Error
Description:
Sqllib error: OLEDB Error encountered calling ICommandText::Execute. hr = 0x80040e14. SQLSTATE: 42000, Native Error: 3013
Error state: 1, Severity: 16
Source: Microsoft SQL Server Native Client 11.0
Error message: BACKUP DATABASE is terminating abnormally.
SQLSTATE: 42000, Native Error: 3271
Error state: 1, Severity: 16
Source: Microsoft SQL Server Native Client 11.0
Error message: A nonrecoverable I/O error occurred on file "{F8269FBE-DC09-4853-9C78-E03CF39A8FEA}1:" 995(The I/O operation has been aborted because of either a thread exit or an application request.).
SQLSTATE: 01000, Native Error: 4035
Error state: 1, Severity: 0
Source: Microsoft SQL Server Native Client 11.0
Error message: Processed 0 pages for database 'AHP', file 'AHP_mdb' on file 1.

Source:        MSSQL$DB01
Date:          21/12/2016 1:06:07 AM
Event ID:      18210
Level:         Error
Description:
BackupVirtualDeviceFile::SendFileInfoBegin:  failure on backup device '{F8269FBE-DC09-4853-9C78-E03CF39A8FEA}1'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).


Initially I thought there was high rate of I/O during the snapshot, so VSS storage is not able to keep up with the I/O. Keeping this in mind I increased the shadow storage limit, although there were no volsnap out of storage error. But this didn't work. The primary SQL server and secondary SQL server were backing at the same time, so I changed the schedule to run secondary earlier than primary, but this didn't work either.

I had a call opened with support # 01998583. The support engineer referred me to Microsoft KB article - https://msdn.microsoft.com/en-us/librar ... 30%29.aspx. We already have max worker threads setup to 512, so this wasn't applicable. He has now asked me to contact Microsoft as the issue is not with Veeam components but with VSS writer. I'm in the process of contacting Microsoft.

However, in the meantime I want to ask has anybody come across this before? Any SQL guru out there who can advise me to things to look at in order to troubleshoot this.

Another thing I noticed in Windows logs, after the SQL writer, VDI errors, there is an event for starting up database.

Starting up database 'SecurityAnalyticsEngine'.

I understand this is reported because auto-close in SQL is on, could this be in any way contributing towards SQL writer failure?

Thanks
Sushil
Sushil_BA
Novice
 
Posts: 8
Liked: never
Joined: Wed Dec 21, 2016 9:29 pm
Full Name: Sushil Gambhir

Re: VSS SQL Writer Failing

Veeam Logoby foggy » Thu Dec 22, 2016 12:32 pm

Hello Sushil, yes, this looks like a Microsoft Issue, so contacting them is indeed the right step. You can search the forums for similar errors, chances are you will get some hints to check while waiting for reply from them.
foggy
Veeam Software
 
Posts: 15303
Liked: 1133 times
Joined: Mon Jul 11, 2011 10:22 am
Full Name: Alexander Fogelson

Re: VSS SQL Writer Failing

Veeam Logoby nmdange » Thu Dec 22, 2016 7:16 pm

Auto-close is pretty unusual so you may want to try turning that off. It looks like it's deprecated anyway so you probably shouldn't be using it.
nmdange
Expert
 
Posts: 233
Liked: 60 times
Joined: Thu Aug 20, 2015 9:30 pm

Re: VSS SQL Writer Failing

Veeam Logoby Sushil_BA » Thu Dec 22, 2016 9:17 pm

foggy wrote:Hello Sushil, yes, this looks like a Microsoft Issue, so contacting them is indeed the right step. You can search the forums for similar errors, chances are you will get some hints to check while waiting for reply from them.


Thanks foggy, that's what I'm hoping for.
Sushil_BA
Novice
 
Posts: 8
Liked: never
Joined: Wed Dec 21, 2016 9:29 pm
Full Name: Sushil Gambhir

Re: VSS SQL Writer Failing

Veeam Logoby Sushil_BA » Thu Dec 22, 2016 9:17 pm

nmdange wrote:Auto-close is pretty unusual so you may want to try turning that off. It looks like it's deprecated anyway so you probably shouldn't be using it.


Thanks nmdange, I'll ask the DBAs to turn off auto-close, will see how it goes.
Sushil_BA
Novice
 
Posts: 8
Liked: never
Joined: Wed Dec 21, 2016 9:29 pm
Full Name: Sushil Gambhir

Re: VSS SQL Writer Failing

Veeam Logoby Sushil_BA » Mon Jan 09, 2017 11:36 pm

Hi Guys,

I've heard back from the DBA's after the long holiday period. Interestingly auto-close is set to off on all the databases but still starting up database messages are reported, could this mean database is crashing when SQL writer reports error?
Sushil_BA
Novice
 
Posts: 8
Liked: never
Joined: Wed Dec 21, 2016 9:29 pm
Full Name: Sushil Gambhir

Re: VSS SQL Writer Failing

Veeam Logoby Mike Resseler » Tue Jan 10, 2017 7:41 am

Hi Sushil,

I would not expect a DB to crash when the SQL VSS writer reports an error. This most likely 'hides' another problem. What is the status with your MSFT support call?

Thanks
Mike
Mike Resseler
Veeam Software
 
Posts: 3382
Liked: 384 times
Joined: Fri Feb 08, 2013 3:08 pm
Location: Belgium, the land of the fries, the beer, the chocolate and the diamonds...
Full Name: Mike Resseler

Re: VSS SQL Writer Failing

Veeam Logoby skrause » Tue Jan 10, 2017 8:44 pm

I have gotten similar errors with file servers that had VSS snapshots enabled on their volumes, though those VSS errors all related to "not enough space on target volume" even though there was ample space.

I ended up setting my AAIP settings on those jobs to ignore failures instead of require success so the backups would complete with a warning instead of erroring out all the time.
Steve Krause
Veeam Certified Architect
skrause
Expert
 
Posts: 308
Liked: 45 times
Joined: Mon Dec 08, 2014 2:58 pm
Full Name: Steve Krause

Re: VSS SQL Writer Failing

Veeam Logoby Sushil_BA » Wed Jan 11, 2017 5:53 am

Mike Resseler wrote:I would not expect a DB to crash when the SQL VSS writer reports an error. This most likely 'hides' another problem. What is the status with your MSFT support call?


Hi Mike,

Unfortunately I've not been able to raise a case with microsoft due to some technical difficulties at our end.

However, looking more into the SQL logs , I've found that whenever the backup ends up in warning with VSS writer failing, there is a timeout error about losing connection with primary replica. And then there is a log entry of restarting the database to re-synchronize it with primary database, which explains the starting up database messages.

Typically this is the sequence when backup is successful:
Code: Select all
2017-01-05 01:04:12.34 spid71      I/O is frozen on database WebPasswordSafe. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2017-01-05 01:04:12.35 spid70      I/O is frozen on database master. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2017-01-05 01:05:07.19 spid60      I/O was resumed on database CTAudit. No user action is required.
2017-01-05 01:05:07.19 spid66      I/O was resumed on database msdb. No user action is required.
2017-01-05 01:05:10.11 Backup      Database backed up. Database: AHP, creation date(time): 2016/07/21(07:53:11), pages dumped: 1704588, first LSN: 6542:205784:1, last LSN: 6542:217493:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {'{352A65EB-814F-4E2B-9686-D807547E48F7}1'}). This is an informational message only. No user action is required.
2017-01-05 01:05:10.11 Backup      Database backed up. Database: CTAudit, creation date(time): 2016/08/03(11:56:17), pages dumped: 2301, first LSN: 911:1389:1, last LSN: 911:2868:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {'{352A65EB-814F-4E2B-9686-D807547E48F7}2'}). This is an informational message only. No user action is required.


And when the backup is warning and SQL writer fails:

Code: Select all
2017-01-04 01:04:18.50 spid67      I/O is frozen on database WebPasswordSafe. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2017-01-04 01:04:18.50 spid68      I/O is frozen on database master. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2017-01-04 01:05:08.22 spid50s     A connection timeout has occurred on a previously established connection to availability replica 'PRD-SQL-01\DB01' with id [C76F5774-E079-4351-B9C8-C99A92745E7B].  Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
2017-01-04 01:05:08.22 spid50s     AlwaysOn Availability Groups connection with primary database terminated for secondary database 'WebPasswordSafe' on the availability replica 'PRD-SQL-01\DB01' with Replica ID: {c76f5774-e079-4351-b9c8-c99a92745e7b}. This is an informational message only. No user action is required.
2017-01-04 01:05:08.22 spid32s     AlwaysOn Availability Groups connection with primary database terminated for secondary database 'CTAudit' on the availability replica 'PRD-SQL-01\DB01' with Replica ID: {c76f5774-e079-4351-b9c8-c99a92745e7b}. This is an informational message only. No user action is required.
2017-01-04 01:05:08.57 spid26s     The recovery LSN (1840:1343:1) was identified for the database with ID 9. This is an informational message only. No user action is required.
2017-01-04 01:05:08.57 spid75s     The recovery LSN (198475:106583:1) was identified for the database with ID 12. This is an informational message only. No user action is required.
2017-01-04 01:05:08.57 spid38s     The recovery LSN (13588:57546:1) was identified for the database with ID 11. This is an informational message only. No user action is required.
2017-01-04 01:05:11.08 spid71      I/O was resumed on database ReportServerTempDB. No user action is required.
2017-01-04 01:05:11.08 spid60      I/O was resumed on database CTData. No user action is required.
2017-01-04 01:05:11.09 spid62      I/O was resumed on database CTAudit. No user action is required.
2017-01-04 01:05:11.22 Backup      Error: 3041, Severity: 16, State: 1.
2017-01-04 01:05:11.22 Backup      BACKUP failed to complete the command BACKUP DATABASE ORY. Check the backup application log for detailed messages.
2017-01-04 01:05:11.23 spid69      Error: 18210, Severity: 16, State: 1.
2017-01-04 01:05:11.23 spid69      BackupVirtualDeviceFile::SendFileInfoBegin:  failure on backup device '{2C0B396F-97A0-4489-ACFE-709AF03F1061}7'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2017-01-04 01:05:11.23 Backup      Error: 3041, Severity: 16, State: 1.
2017-01-04 01:05:11.28 spid50s     Availability database 'AHP', which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.
2017-01-04 01:05:11.28 spid78s     Nonqualified transactions are being rolled back in database AHP for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
2017-01-04 01:05:11.28 spid78s     State information for database 'AHP' - Hardended Lsn: '(6462:178090:1)'    Commit LSN: '(6462:178088:5)'    Commit Time: 'Jan  4 2017  1:04AM'
2017-01-04 01:05:11.29 spid81s     AlwaysOn Availability Groups connection with primary database terminated for secondary database 'AHP' on the availability replica 'PRD-SQL-01\DB01' with Replica ID: {c76f5774-e079-4351-b9c8-c99a92745e7b}. This is an informational message only. No user action is required.
2017-01-04 01:05:11.30 spid75s     Error: 35278, Severity: 17, State: 1.


I see these connection timeout errors on primary node as well, it's backup works fine. Do you know if there is a way to increase the connection timeout between secondary and primary replica?

Thanks
Sushil
Sushil_BA
Novice
 
Posts: 8
Liked: never
Joined: Wed Dec 21, 2016 9:29 pm
Full Name: Sushil Gambhir

Re: VSS SQL Writer Failing

Veeam Logoby skrause » Wed Jan 11, 2017 1:48 pm

There are a couple of ways to adjust the timeouts. The KB below was linked in an old thread about backing up AlwaysOn.

https://www.veeam.com/kb1744
Steve Krause
Veeam Certified Architect
skrause
Expert
 
Posts: 308
Liked: 45 times
Joined: Mon Dec 08, 2014 2:58 pm
Full Name: Steve Krause


Return to Veeam Backup & Replication



Who is online

Users browsing this forum: Google [Bot] and 1 guest