Comprehensive data protection for all workloads
Post Reply
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

REFS data corruption detection

Post by mjohnsonn »

Who here has actually tested the ability of ReFS to detect corruption and repair it? For my small team's large amount of scientific data we can't have bit-rot. Tests show ReFS fails at the tasks miserably. We create corruption and almost nothing shows in the event log and nothing gets fixed on Windows 10 Pro for Workstations.
Gostev
Chief Product Officer
Posts: 31457
Liked: 6647 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev »

ReFS by itself cannot repair corruptions, it is a feature of Storage Spaces Direct. However, it will detect them when on files with data integrity stream enabled when reading the corrupted block, or when ReFS data scrubber encounters the corrupted block during its patrol reads.
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by mjohnsonn »

While this is what ReFS is supposed to do, my tests show it fails miserably. Again, my question is: has anybody else here actually generated some corruption and seen ReFS do what it is supposed to do?

Here is an overview of some tests done on W10 Pro for Workstations:
https://social.technet.microsoft.com/Fo ... erverfiles

ReFS corruption detection does not work properly. This problem gets no attention because:

(A) Nobody is looking for it--everyone assumes the product would have to be performing this most basic function.

(B) An error is caught and reported every now and then which makes ReFS look like it works.

(C) It’s difficult for some to verify proper corruption detection through testing so nobody tests.
Gostev
Chief Product Officer
Posts: 31457
Liked: 6647 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev »

For Veeam users its likely A, because Veeam's backup file format has built-in checksumming that allows us to detect backup file corruption regardless of the underlying file system capabilities. ReFS corruption detection only becomes important on S2D deployments due to self-healing capabilities, but few use S2D for backup repositories today for various reasons.

In my own similar tests with simple volumes back when we were working on ReFS integration, the corruption was detected correctly. The reason I was doing it was getting a screenshot of the corresponding System Event log even for my presentation.

I'm going to spit this discussion in a separate thread not to derail the original one, which talks about different types of issues.

Please let me know Microsoft support case ID for this if you end up opening one. I have some contacts who can potentially provide insight, but support case ID is the first thing they're going to ask.
tsightler
VP, Product Management
Posts: 6009
Liked: 2842 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS data corruption detection

Post by tsightler »

I've actually seen ReFS detect corruption and report it in the event log. The corruption was caused by an inopportune power loss and was caught during a Veeam health check, ReFS reported errors in the event log and Veeam reported corruption in the backup as expected. There was a 1:1 relationship between what Veeam reported and the corrupt data. In my case it was a simple volume so no repair was possible, so I forced active full backups for the impacted backups.
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn »

That's good to hear tsightler. Sounds like you have encountered Item B in my previous post. Have you actually performed tests where you have corrupted the data on a mirrored drive and then seen how ReFS/Storage Spaces handles it? I'm performing tests with Windows 10 Pro for Workstations (my IT guy will bite off his cigar if I mess with our servers any more). I can have dozens of corrupt files and only see one being detected as such in the event log. Nothing gets repaired. Integrity is enabled for all files.

I know that ReFS should be detecting the corruption I generated because if I turn on Enforce, checksums are reported to the display during a copy operation. Still nothing in the event log though and files don't get repaired from the mirror.

Nobody but me has actually run corruption tests? We are all just supposed to assume this stuff works?
Gostev
Chief Product Officer
Posts: 31457
Liked: 6647 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS data corruption detection

Post by Gostev »

I think you missed my earlier reply. Again, ReFS intergration for repairing data corruptions (inline during reads, and with the periodic patrol reads) is a Storage Spaces Direct feature. But Windows 10 includes classic (legacy) Storage Spaces only. You should be testing with Windows Server 2016, since Storage Spaces Direct is only available there. Thanks!
tsightler
VP, Product Management
Posts: 6009
Liked: 2842 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS data corruption detection

Post by tsightler »

Actually, I have done some testing with this back when Windows 2016 was released and testing always showed proper detection and correction when the setup allowed. I'll admit my testing was not extensive and it was with Windows 2016 vs Windows 10, so Anton's comment could apply.
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn »

I was being dense! Thanks Gostev.

Storage Spaces: repair
ReFS: detection

We paid extra for the "for Workstations" SKU of 10 Pro which advertises ReFS and which should at least detect corruption and make log entries. On our workstation desktops we could do without the repair features that require S2D and simply use a manual backup if we encounter bad checksums with integrity streams enabled and enforced. Problem is if we run a background process with enable on and enforce off, my tests show we're unlikely to get any indication in the log.

IMO it's unacceptable that most errors during testing went unlogged given that all the installation procedures for classic Storage Spaces and ReFS went smoothly on the "for Workstations" version. Again, not even expecting repair, but detection is required.

tsightler, how did you create corrupt files not seen as legitimate to the OS during tests? Seems odd that Server would detect better than 10 Pro for Workstations. I get it that the repair isn't there on 10.

Thanks folks
DonZoomik
Service Provider
Posts: 368
Liked: 120 times
Joined: Nov 25, 2016 1:56 pm
Full Name: Mihkel Soomere
Contact:

Re: REFS data corruption detection

Post by DonZoomik »

Gostev wrote:I think you missed my earlier reply. Again, ReFS intergration for repairing data corruptions (inline during reads, and with the periodic patrol reads) is a Storage Spaces Direct feature. But Windows 10 includes classic (legacy) Storage Spaces only. You should be testing with Windows Server 2016, since Storage Spaces Direct is only available there. Thanks!
This used to be a feature already back with Windows 2012 (Storage Spaces classic/legacy + ReFS v1), so unless Storage Spaces (non-Direct) has been neutered in newer releases, it should still work.
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn »

so unless Storage Spaces (non-Direct) has been neutered in newer releases, it should still work.
According to Microsoft’s description of Windows 10 Pro for Workstations, that SKU is supposed to include Storage Spaces/ReFS that will repair corrupted data across drives. Here is there exact wording on the page where they are selling the Workstations SKU:

“Microsoft's Resilient File System (ReFS) combined with Storage Spaces provides highly resilient storage for large volumes of data that can be automatically backed up to multiple mirrored drives. ReFS detects if data becomes corrupt on any one of them, and then repairs it across all drives, which helps ensure you're working with clean data.”

https://www.microsoft.com/en-us/p/windo ... 7GMGF0DW9S

My tests show it doesn’t work. The data integrity scan also fails.
Gostev
Chief Product Officer
Posts: 31457
Liked: 6647 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS data corruption detection

Post by Gostev »

I recommend you open a support case with Microsoft then, and let them review your deployment. Thanks!
tsightler
VP, Product Management
Posts: 6009
Liked: 2842 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS data corruption detection

Post by tsightler »

mjohnsonn wrote:tsightler, how did you create corrupt files not seen as legitimate to the OS during tests? Seems odd that Server would detect better than 10 Pro for Workstations. I get it that the repair isn't there on 10.
Pretty basic, used a disk editor to change bytes on disk. Basically, unmount disk, change bytes on disk within files, mount disk (unmount is important in case the files are cached in memory). I don't remember what specific disk editor I used, just some freeware one I ran across that had ReFS support. I'm assuming you had to do something similar as writing corruption to the file itself wouldn't really do anything, you have to corrupt the disk data without the filesystem being aware.
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn »

used a disk editor to change bytes on disk.
The first two editors I tried made changes that were seen as legitimate to the OS so I did something pretty brutish. Put a mirrored drive in another machine and ran a destructive write test from Hard Disk Sentinel Pro for about thirty seconds (set to hit randomly all over the disk). I could see a few spots get hit at various places on the displayed map. I then put the drive back into the machine with the storage space where another disk had an uncorrupted mirror for testing

I'll quiet down here and leave you IT experts to it. I'm just a physicist with a ton of data and I'll of course move it over to servers. I still don't think enough people bother to test and with just one error it looks like it works. I certainly hope this is limited to the Workstations version.
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn »

with just one error it looks like it works
That should have read "with one error reported from time to time, even though there are really dozens, people get the idea it works--well it certainly doesn't."
NTmatter
Influencer
Posts: 21
Liked: 8 times
Joined: Mar 14, 2014 11:16 am
Full Name: Thomas Johnson
Contact:

Re: REFS data corruption detection

Post by NTmatter » 1 person likes this post

mjohnsonn wrote:Who here has actually tested the ability of ReFS to detect corruption and repair it?
I've done some basic testing of Windows 10 build 1709 and Server 2016. The Server editions will repair data during periodic scrubs (scheduled task) or when specifically directed to (Powershell Repair-FileIntegrity). The Windows 10 versions (Pro, Workstation, Enterprise) don't repair anything at all, and will only fail once your data is no longer recoverable. I'm not certain if there are any logs prior to the failure, however.

Testing Methodology - Could be improved by using a Linux VM to modify data, allowing for snapshots and rollbacks.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS data corruption detection

Post by ejenner » 1 person likes this post

This is interesting stuff.

It's fun when non-computer guys start asking questions about the way the technology works as we look at problems from different perspectives.

I think I can see a problem with your testing method which might explain why you're not seeing the results you think you want to see.

Because you're randomly corrupting data and not sure of what you're corrupting it's not possible to verify whether or not the data is in fact corrupt. You might be corrupting empty space on the disk. You might be corrupting unimportant areas of a file which aren't critical to it's usability.

Apologies if I'm not understanding your process fully, but I think you have to know 'WHAT' you're corrupting rather than just randomly corrupting anything.

So the process should be, to find some content to test with, images, videos, text files... whatever it is. Corrupt it. Try to use the file by opening it and receiving an error or seeing degradation in the file so you KNOW it is corrupt. Then check to see if ReFS can detect and repair it.

Just my immediate impression of what's going wrong here. I might be wrong about what I think looks wrong. :wink:
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn »

You might be corrupting empty space on the disk. You might be corrupting unimportant areas of a file which aren't critical to it's usability.
Thanks for the suggestion, but I am definitely creating corruption right where the files are.

1. If I set integrity to "Enforced" a mass copy operation stops at each of several dozen files and ReFS reports a "checksum error" to the display each time. This shows exactly which files were corrupted.

2. I test with a set of mirrored drives and everything is okay. Drives are about 80% full of multi-GB files. I remove one of the drives and randomly corrupt a whole bunch of spots using another machine. The utility map shows the drive peppered with destructive writes all over the place. No way I missed every file--just no way. This is proven by Item 1 above when the drive is put back in the original storage space.

Again, all on the Workstations SKU.
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn »

I'm typing kind of quickly. I hope I'm not leaving something out. My testing was more rigorously done than this but basically:

Set up two-way mirrored storage space.
Load with files.
All files integrity enabled and NOT enforced.
Remove one drive from mirror and corrupt many of its files using different machine.
Put drive back in storage space.
Now have mirrored storage space with one good drive and one with corrupted files.
Copy all the files from the storage space over to another known good volume.
Examine event log and see maybe one error and no repairs. Perhaps only had one error? No way (see previous post) because...

Set integrity enforce to $True for all the files in the storage space.
Again copy all the files from the storage space over to another known good volume. Copy stops dozens of times showing checksum errors for files. Hit skip at each stop. Examine log. Maybe one error logged. Nothing repaired.

I welcome an examination of my testing methodology, but nothing I did should lead to almost nothing being in the event log yet a whole bunch of checksum errors on my screen.

The scrubber is a failure as well.
kjo@deif.com
Influencer
Posts: 13
Liked: 1 time
Joined: Feb 21, 2019 4:00 pm
Full Name: Kim Johansen
Contact:

Re: REFS data corruption detection

Post by kjo@deif.com » 1 person likes this post

Sorry for replying to an old post, however i could not find any conclusive evidence online, everyone just says it "should" work. So i decided to make my own test.

Here is my PowerShell code:

Code: Select all

# Create 2 VHDs and mount them
1..2 |%{ New-VHD -Path C:\$_.vhdx -SizeBytes 1TB } | Mount-VHD

# Create a storage pool with them
New-StoragePool -FriendlyName Test -PhysicalDisks (Get-PhysicalDisk |? Size -eq 1TB) -StorageSubsystemFriendlyName "Windows Storage*"

# Create ReFS volume
New-Volume -FriendlyName Test -DriveLetter T -FileSystem ReFS -StoragePoolFriendlyName Test -ResiliencySettingName Mirror -UseMaximumSize

# Enable file integrity
Set-FileIntegrity T: -Enable $true

# Create some data
$data = [system.Text.Encoding]::Default.GetBytes("Corrupt me.")
[io.file]::WriteAllBytes("T:\test.txt", $data)

# Dismount VHDs
1..2 |%{ Dismount-VHD C:\$_.vhdx }

# Find the data in VHD 1
$chunk = New-Object byte[] 4MB
$found = $false
$stream = [System.IO.File]::OpenRead("C:\1.vhdx")
$stream.Position = 507510784 # The data was here for me. This was to speed things up, you might have to remove this line.
while ($n = $stream.Read($chunk, 0, $chunk.Length)) {
    Write-Host Position: ($stream.Position - $chunk.Length)
    for ($i = 0; $i -lt $chunk.Length; $i++) {
        for ($j = 0; $j -lt $data.Length -and $chunk[$i+$j] -eq $data[$j]; $j++) {}
        if ($j -eq $data.Length) {
            $position = $stream.Position - $chunk.Length
            $offset = $i
            $found = $true
            break
        }
    }
    if ($found) { break }
}
$stream.Close()

# Corrupt the data
$data = [system.Text.Encoding]::Default.GetBytes("Corrupted.")
for ($i = 0; $i -lt $data.Length; $i++) {
    $chunk[$offset + $i] = $data[$i]
}

# Write the corrupt data to the VHD
$stream = [System.IO.File]::OpenWrite("C:\1.vhdx")
$stream.Position = $position
$stream.Write($chunk, 0, $chunk.Length)
$stream.Close()

# Mount VHDs
1..2 |%{ Mount-VHD C:\$_.vhdx }

# Run manual scrub/repair
Repair-FileIntegrity "T:\test.txt"

# Verify that it has been fixed
1..2 |%{ Dismount-VHD C:\$_.vhdx }
$data = [system.Text.Encoding]::Default.GetBytes("Corrupt me.")
$chunk = New-Object byte[] 4MB
$found = $false
$stream = [System.IO.File]::OpenRead("C:\1.vhdx")
$stream.Position = 507510784 # The data was here for me. This was to speed things up, you might have to remove this line.
while ($n = $stream.Read($chunk, 0, $chunk.Length)) {
    Write-Host Position: ($stream.Position - $chunk.Length)
    for ($i = 0; $i -lt $chunk.Length; $i++) {
        for ($j = 0; $j -lt $data.Length -and $chunk[$i+$j] -eq $data[$j]; $j++) {}
        if ($j -eq $data.Length) {
            Write-Host The corruption was fixed!
            $found = $true
            break
        }
    }
    if ($found) { break }
}
$stream.Close()

# Clean up
1..2 |%{ Dismount-VHD C:\$_.vhdx; del C:\$_.vhdx }

I performed my test on Windows Server 2019 Standard. The steps:
- Created 2 VHDs
- Created a storage pool
- Formatted a mirror with ReFS
- Created a test file
- Corrupted 1 vhd
- I could open the file without problem
- Verified that the corrupted bits had been fixed

So ReFS file integrity works as expected.

I also tried corrupting both VHDs, that resulted in not being able to open the file.

If you want to test this yourself, I recommend copying the code into PowerShell ISE and running the steps one by one so you understand what happens.
aich365
Service Provider
Posts: 296
Liked: 23 times
Joined: Aug 10, 2016 11:10 am
Full Name: Clive Harris
Contact:

Re: REFS data corruption detection

Post by aich365 » 1 person likes this post

Has anyone used the W2019 inbuilt utility "refsutil.exe"

We had a couple of volumes go offline and show as RAW. We could still see the data with "refsutil salvage" and managed to copy this to another volume.
We were unable to recover the volume though and had to reformat and copy the data back.

has anyone used refsutil.exe to fix ReFS volumes

The error we were seeing in EV was:
Volume I: is formatted as ReFS but ReFS is unable to mount it; ReFS encountered status The volume repair was not successful.

thanks
Gostev
Chief Product Officer
Posts: 31457
Liked: 6647 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS data corruption detection

Post by Gostev »

It's great to hear you were able to successfully salvage the data with "refsutil salvage". I've been wondering how useful it really is since Microsoft released this tool. Based on your experience, it sounds like it just works!
aich365
Service Provider
Posts: 296
Liked: 23 times
Joined: Aug 10, 2016 11:10 am
Full Name: Clive Harris
Contact:

Re: REFS data corruption detection

Post by aich365 » 1 person likes this post

Hi Gostev - it worked on one volume but for the other one we have had to use Reclaime.

refsutil salvage -D I: C:\salvage -x -v => this indicated that the volume was OK just not accessible

A full scan produce a file list = foundfiles.667112FD.txt
refsutil salvage -FS I: C:\salvage\ -v -x

This command recovered the files
refsutil salvage -SL <source volume> <working directory> <target directory> <options>
e.g.
refsutil salvage -SL I: C:\salvage\ F:\Backups\agenda C:\salvage\foundfiles.667112FD.txt -v -x

Reclaime is a lot simpler to use and has a GUI interface
aich365
Service Provider
Posts: 296
Liked: 23 times
Joined: Aug 10, 2016 11:10 am
Full Name: Clive Harris
Contact:

Re: REFS data corruption detection

Post by aich365 » 2 people like this post

Oh and it took nearly 24 hours to recover 3TB
aich365
Service Provider
Posts: 296
Liked: 23 times
Joined: Aug 10, 2016 11:10 am
Full Name: Clive Harris
Contact:

Re: REFS data corruption detection

Post by aich365 »

We have had to use Reclaime and this has recovered data that Refsutil.exe could not.
This data we saved on a New Volume. It ran most of the weekend to recover 33TB.

One thing we have observed with both recoveries is a lot of tmp files, and the space is reported differently. e.g. a 33TB backup is showing as 150TB in Reclaime.

We have recovered just the files listed in the tenant's VBR Console under Backups - Cloud.
When we mapped the New Volume to the tenant we had to edit the file path and also use one of the tmp vbm files as the runtime vbm to get Veeam to recognise the chain

So far it has been 8 days and we are just about there.

We are now seriously considering replacing Windows ReFS Volumes with Linux XFS which on V11 has immutability

Phew!
Borgquite
Novice
Posts: 4
Liked: never
Joined: Jan 25, 2022 1:16 pm
Full Name: Chris Hill
Location: Ashford, Kent, UK
Contact:

Re: REFS data corruption detection

Post by Borgquite »

kjo@deif.com wrote: Oct 16, 2019 12:49 pm Sorry for replying to an old post, however i could not find any conclusive evidence online, everyone just says it "should" work. So i decided to make my own test.
Thank you for this code, it was really helpful! I took it and modified it to test some more scenarios, based on some extensive testing previously performed on Reddit:

You can download my updated script here (have given you credit!)
https://github.com/Borgquite/Test-ReFSD ... uption.ps1

The results from my testing are below - while ReFS repair works occasionally, on the latest builds of Windows 10 21H2 ReFS it is still demonstrably buggy. It seems to work some of the time but error reporting & self-repair still has serious bugs. Feel free to use this script for reporting the issues as part of a Microsoft Professional support case:

Testing ReFS data integrity streams / corrupt data functionality automatically using PowerShell
Borgquite
Novice
Posts: 4
Liked: never
Joined: Jan 25, 2022 1:16 pm
Full Name: Chris Hill
Location: Ashford, Kent, UK
Contact:

Re: REFS data corruption detection

Post by Borgquite »

To follow up on my previous post, if you're concerned about this, I have now posted the issue to the Windows Feedback hub to get Microsoft's attention - please upvote it here: https://aka.ms/AAice7g
mkaec
Veteran
Posts: 462
Liked: 133 times
Joined: Jul 16, 2015 1:31 pm
Full Name: Marc K
Contact:

Re: REFS data corruption detection

Post by mkaec »

I followed the link, signed in, then got an error that "Your Account Doesn't Have Access to This Feedback". I'm not sure why some feedback would be hidden from some accounts.
Borgquite
Novice
Posts: 4
Liked: never
Joined: Jan 25, 2022 1:16 pm
Full Name: Chris Hill
Location: Ashford, Kent, UK
Contact:

Re: REFS data corruption detection

Post by Borgquite »

Hmm... looks like this could be due to whether (or not) your account is an 'Insider' account
https://answers.microsoft.com/en-us/win ... 3278f08501
I don't remember joining in the past, but I may have done. Try this?
https://insider.windows.com/en-us/about ... er-program
Borgquite
Novice
Posts: 4
Liked: never
Joined: Jan 25, 2022 1:16 pm
Full Name: Chris Hill
Location: Ashford, Kent, UK
Contact:

Re: REFS data corruption detection

Post by Borgquite »

mkaec wrote: Oct 10, 2022 6:18 pm I followed the link, signed in, then got an error that "Your Account Doesn't Have Access to This Feedback". I'm not sure why some feedback would be hidden from some accounts.
Just encountered this with my own feedback(!) today. As well as checking that you are a Windows Insider (see above) you can try opening the 'Feedback Hub' in Windows and making sure you can access other feedback before clicking the link.
Post Reply

Who is online

Users browsing this forum: Google [Bot], rk@rnt, Semrush [Bot] and 167 guests