Comprehensive data protection for all workloads
Post Reply
mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

REFS data corruption detection

Post by mjohnsonn » Sep 01, 2018 7:08 am

Who here has actually tested the ability of ReFS to detect corruption and repair it? For my small team's large amount of scientific data we can't have bit-rot. Tests show ReFS fails at the tasks miserably. We create corruption and almost nothing shows in the event log and nothing gets fixed on Windows 10 Pro for Workstations.

Gostev
SVP, Product Management
Posts: 25147
Liked: 3700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 01, 2018 11:23 pm

ReFS by itself cannot repair corruptions, it is a feature of Storage Spaces Direct. However, it will detect them when on files with data integrity stream enabled when reading the corrupted block, or when ReFS data scrubber encounters the corrupted block during its patrol reads.

mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by mjohnsonn » Sep 02, 2018 12:07 pm

While this is what ReFS is supposed to do, my tests show it fails miserably. Again, my question is: has anybody else here actually generated some corruption and seen ReFS do what it is supposed to do?

Here is an overview of some tests done on W10 Pro for Workstations:
https://social.technet.microsoft.com/Fo ... erverfiles

ReFS corruption detection does not work properly. This problem gets no attention because:

(A) Nobody is looking for it--everyone assumes the product would have to be performing this most basic function.

(B) An error is caught and reported every now and then which makes ReFS look like it works.

(C) It’s difficult for some to verify proper corruption detection through testing so nobody tests.

Gostev
SVP, Product Management
Posts: 25147
Liked: 3700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS issues (server lockups, high CPU, high RAM)

Post by Gostev » Sep 02, 2018 2:06 pm

For Veeam users its likely A, because Veeam's backup file format has built-in checksumming that allows us to detect backup file corruption regardless of the underlying file system capabilities. ReFS corruption detection only becomes important on S2D deployments due to self-healing capabilities, but few use S2D for backup repositories today for various reasons.

In my own similar tests with simple volumes back when we were working on ReFS integration, the corruption was detected correctly. The reason I was doing it was getting a screenshot of the corresponding System Event log even for my presentation.

I'm going to spit this discussion in a separate thread not to derail the original one, which talks about different types of issues.

Please let me know Microsoft support case ID for this if you end up opening one. I have some contacts who can potentially provide insight, but support case ID is the first thing they're going to ask.

tsightler
VP, Product Management
Posts: 5476
Liked: 2294 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS data corruption detection

Post by tsightler » Sep 03, 2018 1:38 am

I've actually seen ReFS detect corruption and report it in the event log. The corruption was caused by an inopportune power loss and was caught during a Veeam health check, ReFS reported errors in the event log and Veeam reported corruption in the backup as expected. There was a 1:1 relationship between what Veeam reported and the corrupt data. In my case it was a simple volume so no repair was possible, so I forced active full backups for the impacted backups.

mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn » Sep 03, 2018 1:04 pm

That's good to hear tsightler. Sounds like you have encountered Item B in my previous post. Have you actually performed tests where you have corrupted the data on a mirrored drive and then seen how ReFS/Storage Spaces handles it? I'm performing tests with Windows 10 Pro for Workstations (my IT guy will bite off his cigar if I mess with our servers any more). I can have dozens of corrupt files and only see one being detected as such in the event log. Nothing gets repaired. Integrity is enabled for all files.

I know that ReFS should be detecting the corruption I generated because if I turn on Enforce, checksums are reported to the display during a copy operation. Still nothing in the event log though and files don't get repaired from the mirror.

Nobody but me has actually run corruption tests? We are all just supposed to assume this stuff works?

Gostev
SVP, Product Management
Posts: 25147
Liked: 3700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS data corruption detection

Post by Gostev » Sep 03, 2018 2:36 pm

I think you missed my earlier reply. Again, ReFS intergration for repairing data corruptions (inline during reads, and with the periodic patrol reads) is a Storage Spaces Direct feature. But Windows 10 includes classic (legacy) Storage Spaces only. You should be testing with Windows Server 2016, since Storage Spaces Direct is only available there. Thanks!

tsightler
VP, Product Management
Posts: 5476
Liked: 2294 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS data corruption detection

Post by tsightler » Sep 03, 2018 5:47 pm

Actually, I have done some testing with this back when Windows 2016 was released and testing always showed proper detection and correction when the setup allowed. I'll admit my testing was not extensive and it was with Windows 2016 vs Windows 10, so Anton's comment could apply.

mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn » Sep 03, 2018 9:30 pm

I was being dense! Thanks Gostev.

Storage Spaces: repair
ReFS: detection

We paid extra for the "for Workstations" SKU of 10 Pro which advertises ReFS and which should at least detect corruption and make log entries. On our workstation desktops we could do without the repair features that require S2D and simply use a manual backup if we encounter bad checksums with integrity streams enabled and enforced. Problem is if we run a background process with enable on and enforce off, my tests show we're unlikely to get any indication in the log.

IMO it's unacceptable that most errors during testing went unlogged given that all the installation procedures for classic Storage Spaces and ReFS went smoothly on the "for Workstations" version. Again, not even expecting repair, but detection is required.

tsightler, how did you create corrupt files not seen as legitimate to the OS during tests? Seems odd that Server would detect better than 10 Pro for Workstations. I get it that the repair isn't there on 10.

Thanks folks

DonZoomik
Expert
Posts: 130
Liked: 32 times
Joined: Nov 25, 2016 1:56 pm
Contact:

Re: REFS data corruption detection

Post by DonZoomik » Sep 04, 2018 12:19 pm

Gostev wrote:I think you missed my earlier reply. Again, ReFS intergration for repairing data corruptions (inline during reads, and with the periodic patrol reads) is a Storage Spaces Direct feature. But Windows 10 includes classic (legacy) Storage Spaces only. You should be testing with Windows Server 2016, since Storage Spaces Direct is only available there. Thanks!
This used to be a feature already back with Windows 2012 (Storage Spaces classic/legacy + ReFS v1), so unless Storage Spaces (non-Direct) has been neutered in newer releases, it should still work.

mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn » Sep 04, 2018 4:43 pm

so unless Storage Spaces (non-Direct) has been neutered in newer releases, it should still work.
According to Microsoft’s description of Windows 10 Pro for Workstations, that SKU is supposed to include Storage Spaces/ReFS that will repair corrupted data across drives. Here is there exact wording on the page where they are selling the Workstations SKU:

“Microsoft's Resilient File System (ReFS) combined with Storage Spaces provides highly resilient storage for large volumes of data that can be automatically backed up to multiple mirrored drives. ReFS detects if data becomes corrupt on any one of them, and then repairs it across all drives, which helps ensure you're working with clean data.”

https://www.microsoft.com/en-us/p/windo ... 7GMGF0DW9S

My tests show it doesn’t work. The data integrity scan also fails.

Gostev
SVP, Product Management
Posts: 25147
Liked: 3700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS data corruption detection

Post by Gostev » Sep 04, 2018 10:51 pm

I recommend you open a support case with Microsoft then, and let them review your deployment. Thanks!

tsightler
VP, Product Management
Posts: 5476
Liked: 2294 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: REFS data corruption detection

Post by tsightler » Sep 05, 2018 2:00 am

mjohnsonn wrote:tsightler, how did you create corrupt files not seen as legitimate to the OS during tests? Seems odd that Server would detect better than 10 Pro for Workstations. I get it that the repair isn't there on 10.
Pretty basic, used a disk editor to change bytes on disk. Basically, unmount disk, change bytes on disk within files, mount disk (unmount is important in case the files are cached in memory). I don't remember what specific disk editor I used, just some freeware one I ran across that had ReFS support. I'm assuming you had to do something similar as writing corruption to the file itself wouldn't really do anything, you have to corrupt the disk data without the filesystem being aware.

mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn » Sep 05, 2018 5:26 am

used a disk editor to change bytes on disk.
The first two editors I tried made changes that were seen as legitimate to the OS so I did something pretty brutish. Put a mirrored drive in another machine and ran a destructive write test from Hard Disk Sentinel Pro for about thirty seconds (set to hit randomly all over the disk). I could see a few spots get hit at various places on the displayed map. I then put the drive back into the machine with the storage space where another disk had an uncorrupted mirror for testing

I'll quiet down here and leave you IT experts to it. I'm just a physicist with a ton of data and I'll of course move it over to servers. I still don't think enough people bother to test and with just one error it looks like it works. I certainly hope this is limited to the Workstations version.

mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn » Sep 07, 2018 3:30 am

with just one error it looks like it works
That should have read "with one error reported from time to time, even though there are really dozens, people get the idea it works--well it certainly doesn't."

NTmatter
Influencer
Posts: 15
Liked: 7 times
Joined: Mar 14, 2014 11:16 am
Full Name: Thomas Johnson
Contact:

Re: REFS data corruption detection

Post by NTmatter » Sep 10, 2018 8:50 am 1 person likes this post

mjohnsonn wrote:Who here has actually tested the ability of ReFS to detect corruption and repair it?
I've done some basic testing of Windows 10 build 1709 and Server 2016. The Server editions will repair data during periodic scrubs (scheduled task) or when specifically directed to (Powershell Repair-FileIntegrity). The Windows 10 versions (Pro, Workstation, Enterprise) don't repair anything at all, and will only fail once your data is no longer recoverable. I'm not certain if there are any logs prior to the failure, however.

Testing Methodology - Could be improved by using a Linux VM to modify data, allowing for snapshots and rollbacks.

ejenner
Expert
Posts: 437
Liked: 69 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: REFS data corruption detection

Post by ejenner » Sep 10, 2018 10:22 am 1 person likes this post

This is interesting stuff.

It's fun when non-computer guys start asking questions about the way the technology works as we look at problems from different perspectives.

I think I can see a problem with your testing method which might explain why you're not seeing the results you think you want to see.

Because you're randomly corrupting data and not sure of what you're corrupting it's not possible to verify whether or not the data is in fact corrupt. You might be corrupting empty space on the disk. You might be corrupting unimportant areas of a file which aren't critical to it's usability.

Apologies if I'm not understanding your process fully, but I think you have to know 'WHAT' you're corrupting rather than just randomly corrupting anything.

So the process should be, to find some content to test with, images, videos, text files... whatever it is. Corrupt it. Try to use the file by opening it and receiving an error or seeing degradation in the file so you KNOW it is corrupt. Then check to see if ReFS can detect and repair it.

Just my immediate impression of what's going wrong here. I might be wrong about what I think looks wrong. :wink:

mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn » Sep 17, 2018 1:15 am

You might be corrupting empty space on the disk. You might be corrupting unimportant areas of a file which aren't critical to it's usability.
Thanks for the suggestion, but I am definitely creating corruption right where the files are.

1. If I set integrity to "Enforced" a mass copy operation stops at each of several dozen files and ReFS reports a "checksum error" to the display each time. This shows exactly which files were corrupted.

2. I test with a set of mirrored drives and everything is okay. Drives are about 80% full of multi-GB files. I remove one of the drives and randomly corrupt a whole bunch of spots using another machine. The utility map shows the drive peppered with destructive writes all over the place. No way I missed every file--just no way. This is proven by Item 1 above when the drive is put back in the original storage space.

Again, all on the Workstations SKU.

mjohnsonn
Novice
Posts: 9
Liked: never
Joined: Sep 01, 2018 6:56 am
Contact:

Re: REFS data corruption detection

Post by mjohnsonn » Sep 17, 2018 7:36 am

I'm typing kind of quickly. I hope I'm not leaving something out. My testing was more rigorously done than this but basically:

Set up two-way mirrored storage space.
Load with files.
All files integrity enabled and NOT enforced.
Remove one drive from mirror and corrupt many of its files using different machine.
Put drive back in storage space.
Now have mirrored storage space with one good drive and one with corrupted files.
Copy all the files from the storage space over to another known good volume.
Examine event log and see maybe one error and no repairs. Perhaps only had one error? No way (see previous post) because...

Set integrity enforce to $True for all the files in the storage space.
Again copy all the files from the storage space over to another known good volume. Copy stops dozens of times showing checksum errors for files. Hit skip at each stop. Examine log. Maybe one error logged. Nothing repaired.

I welcome an examination of my testing methodology, but nothing I did should lead to almost nothing being in the event log yet a whole bunch of checksum errors on my screen.

The scrubber is a failure as well.

kjo@deif.com
Lurker
Posts: 1
Liked: 1 time
Joined: Feb 21, 2019 4:00 pm
Full Name: Kim Johansen
Contact:

Re: REFS data corruption detection

Post by kjo@deif.com » Oct 16, 2019 12:49 pm 1 person likes this post

Sorry for replying to an old post, however i could not find any conclusive evidence online, everyone just says it "should" work. So i decided to make my own test.

Here is my PowerShell code:

Code: Select all

# Create 2 VHDs and mount them
1..2 |%{ New-VHD -Path C:\$_.vhdx -SizeBytes 1TB } | Mount-VHD

# Create a storage pool with them
New-StoragePool -FriendlyName Test -PhysicalDisks (Get-PhysicalDisk |? Size -eq 1TB) -StorageSubsystemFriendlyName "Windows Storage*"

# Create ReFS volume
New-Volume -FriendlyName Test -DriveLetter T -FileSystem ReFS -StoragePoolFriendlyName Test -ResiliencySettingName Mirror -UseMaximumSize

# Enable file integrity
Set-FileIntegrity T: -Enable $true

# Create some data
$data = [system.Text.Encoding]::Default.GetBytes("Corrupt me.")
[io.file]::WriteAllBytes("T:\test.txt", $data)

# Dismount VHDs
1..2 |%{ Dismount-VHD C:\$_.vhdx }

# Find the data in VHD 1
$chunk = New-Object byte[] 4MB
$found = $false
$stream = [System.IO.File]::OpenRead("C:\1.vhdx")
$stream.Position = 507510784 # The data was here for me. This was to speed things up, you might have to remove this line.
while ($n = $stream.Read($chunk, 0, $chunk.Length)) {
    Write-Host Position: ($stream.Position - $chunk.Length)
    for ($i = 0; $i -lt $chunk.Length; $i++) {
        for ($j = 0; $j -lt $data.Length -and $chunk[$i+$j] -eq $data[$j]; $j++) {}
        if ($j -eq $data.Length) {
            $position = $stream.Position - $chunk.Length
            $offset = $i
            $found = $true
            break
        }
    }
    if ($found) { break }
}
$stream.Close()

# Corrupt the data
$data = [system.Text.Encoding]::Default.GetBytes("Corrupted.")
for ($i = 0; $i -lt $data.Length; $i++) {
    $chunk[$offset + $i] = $data[$i]
}

# Write the corrupt data to the VHD
$stream = [System.IO.File]::OpenWrite("C:\1.vhdx")
$stream.Position = $position
$stream.Write($chunk, 0, $chunk.Length)
$stream.Close()

# Mount VHDs
1..2 |%{ Mount-VHD C:\$_.vhdx }

# Run manual scrub/repair
Repair-FileIntegrity "T:\test.txt"

# Verify that it has been fixed
1..2 |%{ Dismount-VHD C:\$_.vhdx }
$data = [system.Text.Encoding]::Default.GetBytes("Corrupt me.")
$chunk = New-Object byte[] 4MB
$found = $false
$stream = [System.IO.File]::OpenRead("C:\1.vhdx")
$stream.Position = 507510784 # The data was here for me. This was to speed things up, you might have to remove this line.
while ($n = $stream.Read($chunk, 0, $chunk.Length)) {
    Write-Host Position: ($stream.Position - $chunk.Length)
    for ($i = 0; $i -lt $chunk.Length; $i++) {
        for ($j = 0; $j -lt $data.Length -and $chunk[$i+$j] -eq $data[$j]; $j++) {}
        if ($j -eq $data.Length) {
            Write-Host The corruption was fixed!
            $found = $true
            break
        }
    }
    if ($found) { break }
}
$stream.Close()

# Clean up
1..2 |%{ Dismount-VHD C:\$_.vhdx; del C:\$_.vhdx }

I performed my test on Windows Server 2019 Standard. The steps:
- Created 2 VHDs
- Created a storage pool
- Formatted a mirror with ReFS
- Created a test file
- Corrupted 1 vhd
- I could open the file without problem
- Verified that the corrupted bits had been fixed

So ReFS file integrity works as expected.

I also tried corrupting both VHDs, that resulted in not being able to open the file.

If you want to test this yourself, I recommend copying the code into PowerShell ISE and running the steps one by one so you understand what happens.

Post Reply

Who is online

Users browsing this forum: Google [Bot] and 13 guests