Due to catastrophic local storage failure, all performance extents suffered "unplanned removal" from the SOBR. We added a temporaty extent and relied on the capacity tier while the local storage was rebuilt. Once rebuilt, the three local performance extents were re-added to the SOBR, rescanned, the temp extent was removed, and the SOBR rescanned again. At this point the DB / indexes etc should all have been in sync - that's the purpose of a rescan; "go find what's there and make it work".
We've been plagues by ongoing "Local index is not synchronized with object storage, please rescan the scale-out backup repository" and offloads having high failure rates.
DB changes like
Code: Select all
UPDATE [dbo].[Backup.Model.BackupArchiveIndices] set version = '210' where archive_index_id = '2a1aa577-b9d7-4043-9662-fd4fd933674a'
The issues persist. Today I noticed this
Code: Select all
Index has been recently resynchronized and can only be accessed in 4 hours 45 minutes
Does that mean "I see need a resync, and one was done recently so I'm going to ignore the new results for 4.75 hours"? Why?
Generally - SOBR offloads generate errors in normal operation, they should not. If there's "normal" level of errors - perhaps there should also be a threshold at which this does something MUCH more vocal than log each error to Windows event log and put the results in a semi-hidden job that no-one looks at?
Please remember that for many use cases of SOBR + Object, the object is "the" offsite copy. It's important it happens in a timely manner and real failures are reported loudly.
Without wanting to be unreasonable here - SOBR offload and index sync etc just needs to be more resilient and more reliable. Errors should not be "expected" in normal production operation. I see there's changes to object storage in v12 - but I've not had a chance yet to catch up on if that affects SOBR operation, and what the migration path to "the new world" is especially with immutability configured in S3.
Thanks guys