VSS SQL Writer Failing

Sushil_BA · Post by **Sushil_BA** » Dec 21, 2016 10:41 pm this post

Hi All,

We have application-aware backup of a SQL server failing intermittently. The SQL Server is a secondary node in an always-on cluster, with total of 9 databases, of which 7 are synced and 2 are standalone.

Error reported on backup job:

Code: Select all

21/12/2016 1:06:15 AM :: Unable to release guest. Details: Failed to call RPC function 'Vss.Unfreeze': Error code: 0x80004005. Failed to invoke func [Unfreeze]: Unspecified error. Unfreeze error (over VIX): [Backup job failed. Cannot create a shadow copy of the volumes containing writer's data. A VSS critical writer has failed. Writer name: [SqlServerWriter]. Class ID: [{a65faa63-5ea8-4ebc-9dbd-a0c4db26912a}]. Instance ID: [{1860ad42-e7fa-4b97-8331-806abfee0c4e}]. Writer's state: [VSS_WS_FAILED_AT_POST_SNAPSHOT]. Error code: [0x800

A restart of VSS SQL Writer service stabilizes the writer and backup works fine for 3-4 days and the above error is reported again.

Windows Event logs:

Code: Select all

Source:        SQLVDI
Date:          21/12/2016 1:06:07 AM
Event ID:      1
Level:         Error
Description:
SQLVDI: Loc=TriggerAbort. Desc=invoked. ErrorCode=(0). Process=1380. Thread=572. Server. Instance=DB01. VD=Global\{F8269FBE-DC09-4853-9C78-E03CF39A8FEA}1_SQLVDIMemoryName_0. 
Starting up database 'SecurityAnalyticsEngine'.


Source:        SQLWRITER
Date:          21/12/2016 1:06:07 AM
Event ID:      24583
Level:         Error
Description:
Sqllib error: OLEDB Error encountered calling ICommandText::Execute. hr = 0x80040e14. SQLSTATE: 42000, Native Error: 3013
Error state: 1, Severity: 16
Source: Microsoft SQL Server Native Client 11.0
Error message: BACKUP DATABASE is terminating abnormally.
SQLSTATE: 42000, Native Error: 3271
Error state: 1, Severity: 16
Source: Microsoft SQL Server Native Client 11.0
Error message: A nonrecoverable I/O error occurred on file "{F8269FBE-DC09-4853-9C78-E03CF39A8FEA}1:" 995(The I/O operation has been aborted because of either a thread exit or an application request.).
SQLSTATE: 01000, Native Error: 4035
Error state: 1, Severity: 0
Source: Microsoft SQL Server Native Client 11.0
Error message: Processed 0 pages for database 'AHP', file 'AHP_mdb' on file 1.

Source:        MSSQL$DB01
Date:          21/12/2016 1:06:07 AM
Event ID:      18210
Level:         Error
Description:
BackupVirtualDeviceFile::SendFileInfoBegin:  failure on backup device '{F8269FBE-DC09-4853-9C78-E03CF39A8FEA}1'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).

Initially I thought there was high rate of I/O during the snapshot, so VSS storage is not able to keep up with the I/O. Keeping this in mind I increased the shadow storage limit, although there were no volsnap out of storage error. But this didn't work. The primary SQL server and secondary SQL server were backing at the same time, so I changed the schedule to run secondary earlier than primary, but this didn't work either.

I had a call opened with support # 01998583. The support engineer referred me to Microsoft KB article - https://msdn.microsoft.com/en-us/librar ... 30%29.aspx. We already have max worker threads setup to 512, so this wasn't applicable. He has now asked me to contact Microsoft as the issue is not with Veeam components but with VSS writer. I'm in the process of contacting Microsoft.

However, in the meantime I want to ask has anybody come across this before? Any SQL guru out there who can advise me to things to look at in order to troubleshoot this.

Another thing I noticed in Windows logs, after the SQL writer, VDI errors, there is an event for starting up database.

Starting up database 'SecurityAnalyticsEngine'.

I understand this is reported because auto-close in SQL is on, could this be in any way contributing towards SQL writer failure?

Thanks
Sushil

Post by **foggy** » Dec 22, 2016 12:32 pm this post

Hello Sushil, yes, this looks like a Microsoft Issue, so contacting them is indeed the right step. You can search the forums for similar errors, chances are you will get some hints to check while waiting for reply from them.

nmdange · Post by **nmdange** » Dec 22, 2016 7:16 pm this post

Auto-close is pretty unusual so you may want to try turning that off. It looks like it's deprecated anyway so you probably shouldn't be using it.

Sushil_BA · Post by **Sushil_BA** » Dec 22, 2016 9:17 pm this post

foggy wrote:Hello Sushil, yes, this looks like a Microsoft Issue, so contacting them is indeed the right step. You can search the forums for similar errors, chances are you will get some hints to check while waiting for reply from them.

Thanks foggy, that's what I'm hoping for.

Sushil_BA · Post by **Sushil_BA** » Dec 22, 2016 9:17 pm this post

nmdange wrote:Auto-close is pretty unusual so you may want to try turning that off. It looks like it's deprecated anyway so you probably shouldn't be using it.

Thanks nmdange, I'll ask the DBAs to turn off auto-close, will see how it goes.

Sushil_BA · Post by **Sushil_BA** » Jan 09, 2017 11:36 pm this post

Hi Guys,

I've heard back from the DBA's after the long holiday period. Interestingly auto-close is set to off on all the databases but still starting up database messages are reported, could this mean database is crashing when SQL writer reports error?

Post by **Mike Resseler** » Jan 10, 2017 7:41 am this post

Hi Sushil,

I would not expect a DB to crash when the SQL VSS writer reports an error. This most likely 'hides' another problem. What is the status with your MSFT support call?

Thanks
Mike

skrause · Post by **skrause** » Jan 10, 2017 8:44 pm this post

I have gotten similar errors with file servers that had VSS snapshots enabled on their volumes, though those VSS errors all related to "not enough space on target volume" even though there was ample space.

I ended up setting my AAIP settings on those jobs to ignore failures instead of require success so the backups would complete with a warning instead of erroring out all the time.

Sushil_BA · Post by **Sushil_BA** » Jan 11, 2017 5:53 am this post

Mike Resseler wrote:I would not expect a DB to crash when the SQL VSS writer reports an error. This most likely 'hides' another problem. What is the status with your MSFT support call?

Hi Mike,

Unfortunately I've not been able to raise a case with microsoft due to some technical difficulties at our end.

However, looking more into the SQL logs , I've found that whenever the backup ends up in warning with VSS writer failing, there is a timeout error about losing connection with primary replica. And then there is a log entry of restarting the database to re-synchronize it with primary database, which explains the starting up database messages.

Typically this is the sequence when backup is successful:

Code: Select all

2017-01-05 01:04:12.34 spid71      I/O is frozen on database WebPasswordSafe. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2017-01-05 01:04:12.35 spid70      I/O is frozen on database master. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2017-01-05 01:05:07.19 spid60      I/O was resumed on database CTAudit. No user action is required.
2017-01-05 01:05:07.19 spid66      I/O was resumed on database msdb. No user action is required.
2017-01-05 01:05:10.11 Backup      Database backed up. Database: AHP, creation date(time): 2016/07/21(07:53:11), pages dumped: 1704588, first LSN: 6542:205784:1, last LSN: 6542:217493:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {'{352A65EB-814F-4E2B-9686-D807547E48F7}1'}). This is an informational message only. No user action is required.
2017-01-05 01:05:10.11 Backup      Database backed up. Database: CTAudit, creation date(time): 2016/08/03(11:56:17), pages dumped: 2301, first LSN: 911:1389:1, last LSN: 911:2868:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {'{352A65EB-814F-4E2B-9686-D807547E48F7}2'}). This is an informational message only. No user action is required.

And when the backup is warning and SQL writer fails:

Code: Select all

2017-01-04 01:04:18.50 spid67      I/O is frozen on database WebPasswordSafe. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2017-01-04 01:04:18.50 spid68      I/O is frozen on database master. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2017-01-04 01:05:08.22 spid50s     A connection timeout has occurred on a previously established connection to availability replica 'PRD-SQL-01\DB01' with id [C76F5774-E079-4351-B9C8-C99A92745E7B].  Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
2017-01-04 01:05:08.22 spid50s     AlwaysOn Availability Groups connection with primary database terminated for secondary database 'WebPasswordSafe' on the availability replica 'PRD-SQL-01\DB01' with Replica ID: {c76f5774-e079-4351-b9c8-c99a92745e7b}. This is an informational message only. No user action is required.
2017-01-04 01:05:08.22 spid32s     AlwaysOn Availability Groups connection with primary database terminated for secondary database 'CTAudit' on the availability replica 'PRD-SQL-01\DB01' with Replica ID: {c76f5774-e079-4351-b9c8-c99a92745e7b}. This is an informational message only. No user action is required.
2017-01-04 01:05:08.57 spid26s     The recovery LSN (1840:1343:1) was identified for the database with ID 9. This is an informational message only. No user action is required.
2017-01-04 01:05:08.57 spid75s     The recovery LSN (198475:106583:1) was identified for the database with ID 12. This is an informational message only. No user action is required.
2017-01-04 01:05:08.57 spid38s     The recovery LSN (13588:57546:1) was identified for the database with ID 11. This is an informational message only. No user action is required.
2017-01-04 01:05:11.08 spid71      I/O was resumed on database ReportServerTempDB. No user action is required.
2017-01-04 01:05:11.08 spid60      I/O was resumed on database CTData. No user action is required.
2017-01-04 01:05:11.09 spid62      I/O was resumed on database CTAudit. No user action is required.
2017-01-04 01:05:11.22 Backup      Error: 3041, Severity: 16, State: 1.
2017-01-04 01:05:11.22 Backup      BACKUP failed to complete the command BACKUP DATABASE ORY. Check the backup application log for detailed messages.
2017-01-04 01:05:11.23 spid69      Error: 18210, Severity: 16, State: 1.
2017-01-04 01:05:11.23 spid69      BackupVirtualDeviceFile::SendFileInfoBegin:  failure on backup device '{2C0B396F-97A0-4489-ACFE-709AF03F1061}7'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2017-01-04 01:05:11.23 Backup      Error: 3041, Severity: 16, State: 1.
2017-01-04 01:05:11.28 spid50s     Availability database 'AHP', which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.
2017-01-04 01:05:11.28 spid78s     Nonqualified transactions are being rolled back in database AHP for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
2017-01-04 01:05:11.28 spid78s     State information for database 'AHP' - Hardended Lsn: '(6462:178090:1)'    Commit LSN: '(6462:178088:5)'    Commit Time: 'Jan  4 2017  1:04AM'
2017-01-04 01:05:11.29 spid81s     AlwaysOn Availability Groups connection with primary database terminated for secondary database 'AHP' on the availability replica 'PRD-SQL-01\DB01' with Replica ID: {c76f5774-e079-4351-b9c8-c99a92745e7b}. This is an informational message only. No user action is required.
2017-01-04 01:05:11.30 spid75s     Error: 35278, Severity: 17, State: 1.

I see these connection timeout errors on primary node as well, it's backup works fine. Do you know if there is a way to increase the connection timeout between secondary and primary replica?

Thanks
Sushil

skrause · Post by **skrause** » Jan 11, 2017 1:48 pm this post

There are a couple of ways to adjust the timeouts. The KB below was linked in an old thread about backing up AlwaysOn.

https://www.veeam.com/kb1744

R&D Forums

VSS SQL Writer Failing

Re: VSS SQL Writer Failing

Re: VSS SQL Writer Failing

Re: VSS SQL Writer Failing

Re: VSS SQL Writer Failing

Re: VSS SQL Writer Failing

Re: VSS SQL Writer Failing

Re: VSS SQL Writer Failing

Re: VSS SQL Writer Failing

Re: VSS SQL Writer Failing

Who is online