REFS issues (server lockups, high CPU, high RAM)

graham8 · Apr 03, 2017 8:17 pm

Gostev wrote:Do you have a support case open with Microsoft? I would like to forward case ID to the ReFS team as it looks like your servers might be the good subject for investigation with the issue consistently reproduced even with the patch installed.

My current problem is that I still don't have a single good example to show to them. But even I myself is not convinced at this time if the issue is real - or is some corner case that has to deal with special settings, special hardware, lack of certain system resource or something along these lines (for example, the issue Nate has just mentioned). The ratio of customers having great success using ReFS vs. customers having this deadlock issue actually suggests it might be the corner case.

Will-do, thanks. I opened a case earlier. I'll PM you the case ID.

My only visibility into this issue is my own experiences and this thread. Are there lots of people out there doing synthetic fulls via block clone of 5-10ish+TB backups with 16-32GB ram (since it seems that throwing extreme amounts of ram for this use case seems to at least often work around the issue...) that are having no sporadic issues whatsoever? That would be interesting to know.

Oh, and another question to everyone - has anyone else observed that your ReFS volumes seem to have ludicrously high disk fragmentation statistics? I checked on this server of ours that keeps locking the most frequently and it reported 100% fragmentation (which would imply that not one single block of data is contiguous, which sounds statistically near-impossible...). The default defrag scheduled task has been successfully running, so I thought it was fine till I manually checked via the UI.

Also something I've noticed - I think the deadlocks only occur when the scheduled data scrub tasks under TaskScheduler->Microsoft->Windows->"Data Integrity Scan" are running while a block clone operation is issued. If I don't forcibly prevent the Veeam services from running on startup when the crash-recovery scrub is running, it IO-locks, 100% of the time. I'll work with MS to confirm, but it would be helpful if other people could check the History tab of those two tasks to see if your own deadlocks coincide with the time period between Task Started's (EventID 100) and Task Completed's (EventID 102).

kubimike · Post by **kubimike** » Apr 03, 2017 9:09 pm this post

can you shutdown the integrity scans for now ?

mickyv · Post by **mickyv** » Apr 04, 2017 2:42 am this post

Is this ReFS issue still a major problem for everyone? I have a 2016 server running Veeam with a 40 TB cluster on it, local RAID controller on a X3650 M5 server. Server had been running fine for a few weeks, then BAM, randomly kept freezing with huge memory spikes. Used RAMmap and found that the Metafile usage was the cause... Only way to fix was to wipe the server, reinstall 2016 and then it was fine. Now it has happened again.... No matter what I do to try fix, all the registry edits and KB installs, nothing fixes it......

I am unsure what to do at this point besides just wipe and reinstall as this is our backup server and hasn't been running for 10 days due to this stupid issues with Server 2016... My next resort is just to go back to 2012 R2.....

Post by **Mike Resseler** » Apr 04, 2017 5:59 am this post

Hi Michael,

First: Welcome to the forums!

About your issue. It seems that most issues are fixed but since it is not in your case... Do you have a support case with Microsoft? (And obviously installed all the necessary updates...)

Mike

mickyv · Post by **mickyv** » Apr 04, 2017 6:26 am this post

Mike Resseler wrote:Hi Michael,

First: Welcome to the forums!

About your issue. It seems that most issues are fixed but since it is not in your case... Do you have a support case with Microsoft? (And obviously installed all the necessary updates...)

Mike

Thanks Mike

It is a rather frustrating issue to be having. I've actually just wiped and done a fresh install of the server today (currently installing Veeam B&R as I type this) so the issue is not occurring right now... It appeared to only start once I installed KB 4013429 but this could be a coincidence ... Odd since it happened after installing this, presuming all was fine until a few days later noticed backups stopped over the weekend and Monday morning found it frozen with 100% RAM usage.

Currently I have no open cases with Microsoft and frankly unsure of how best to do so as I am from Australia as well.

Post by **Mike Resseler** » Apr 04, 2017 6:31 am this post

Hi Michael,

I am sure Microsoft has support department in Australia

What I certainly would do (now that you just did a fresh install) is to go through this thread again and make sure you have applied the right stuff (and obviously go to 64k but I assume you did that). Check the memory also. If it is a physical one, not much you can do immediately but there is also report (in this thread) that adding a bit of additional memory can solve a lot!

mickyv · Post by **mickyv** » Apr 04, 2017 6:37 am this post

I am sure Microsoft have a good support team here in Australia, just never logged an issue with them directly before!

As for the actual server itself, it is a physical server, X3650 M5, brand new, 32 GB RAM on this machine. Currently the ReFS drive is still on 4k cluster sizes. Should I reformat this partition to be 64k cluster given in it > 40 TB in size?

Post by **Mike Resseler** » Apr 04, 2017 6:48 am this post

That would certainly be our recommendation!

mickyv · Post by **mickyv** » Apr 04, 2017 6:53 am this post

Thanks Mike, I have done this now so will get everything back up and running and report if I have any further issues

Post by **Mike Resseler** » Apr 04, 2017 6:54 am this post

Great! We are looking forward to your experience!

Delo123 · Apr 04, 2017 7:13 am

So are all reported issues related with Servers having very little physical ram? Do you guys have setup unlimited swapfiles? Or do the swapfiles get used very much thus slowing things down?

adruet · Post by **adruet** » Apr 04, 2017 8:50 am this post

mkretzer wrote:
- 128 -> 384 GB RAM
- Patch with Option 1
- No per-VM

And backups and merges are FAST....

Hi mkretzer,

If you would be kind enough I have a couple of questions:

1) You are using 64K Allocation Unit Size for the ReFS volume ?

2) Even though you have 384 GB of ram you implemented the option 1 (RefsEnableLargeWorkingSetTrim = 1) which should be only useful in case of heavy Ram usage?

3) From what I read you only have 3 big jobs running. Do you have per VM backup chains ? How many task in parallel do you allow on your repository ?

4) What is you raid controller cache ratio ? (if you use DAS as repository storage)

I am asking because I'm struggling with an infrastructure that should have worked ok:
- HP DL380 Gen9 with dual CPU Intel E5-2660 v4 2Ghz, 64 GB of RAM, raid 1 SSD for the OS
- Dual 10 Gbit network cards (HP 560FLR) supporting the offloading of SMB v3 (RDMA capabale)
- Per server : 2 x DAS HP D3700 with 1.8TB 12G 10K SAS disks configured as Raid 6 with HP p441 controller
- OS : Windows Server 2016 Standard with March CU KB4013429
- ReFS File system for the backup repository partition with 4K (in progress to move to 64K or back to NTFS)
- Each 75TB Repository configured as SMB Share to use Fast Cloning

We did not implement any of the registry keys yet from https://support.microsoft.com/nb-no/hel ... windows-10
As we have 64GB of RAM, and at most the metafile uses 20 GB according to RamMap.

My difference with you setup is that we have one job per VM, with about 300 jobs running as forever incremental.
We had no problem until merge started. We have some jobs with 30 recovery points, some with 7.
We used to run 20 jobs in parallel on our former infrastructure (with only two backup servers with 16 GB of RAM and everything was fine but the storage was a small SAN and NTFS file system)

Before implementing all this we did quite some benchmarking using diskspd, and the overall throughput was reaching 2GB/s using a forward incremental profile (100% writes) and 270MB/s using a transform profile (50%writes). So we can see that the random reads kill the overall throughput, but that's logical with our setup. But that should be ok thanks for fast cloning?

Now that we hit the merge critical point (the 30 recovery points in our case), every night is a backup nightmare, they do complete but it takes almost 20 hours, when it used to take only 8 hours before this new setup. We do not experience major BSOD or OS lockup, even if browsing the ReFS volume is almost impossible during backups and don't even try to rescan the repository.

So I am down to 5 jobs in parallel on the repository, and I can barely see any job going higher than 35 MB/s.

I have a case opened with Veeam (02092423) and I am not sure I will be willing to involve Microsoft in this as long as we are not running on 64K. And it takes a very long time to move 75 TB of backups when you only have 4 hours a day without the backups running...

ivordillen · Post by **ivordillen** » Apr 04, 2017 8:56 am this post

@Gostev

So I don't think most of us have problems because

NOTE: RSS is disabled by default on VMXNet3 virtual NICs in Windows. VMware encourages enabling RSS only for applications requiring high network throughput and large bursts.

adruet · Post by **adruet** » Apr 04, 2017 9:05 am this post

Also, I have migrated backup to a 64K ReFS repository yesterday, and even with one backup at the same time I get SMB Server event ID 1020:

File system operation has taken longer than expected.

Client Name: \\1.1.1.1
Client Address: 1.1.1.1:55764
User Name: DOMAIN\VeeamUser
Session ID: 0x700010000485
Share Name: \\*\VeeamBackupRepository
File Name: BACKUPNAME\BACKUPNAMED2017-03-03T191009.vbk
Command: 9
Duration (in milliseconds): 72796
Warning Threshold (in milliseconds): 15000

Guidance:

The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB.

The vbk backup file is 420GB big, with 30 recovery points, each vib file is between 10 GB and 40 GB. The backup operation was quick: 69MB/s processing rate (well, it only transfered 19GB), and the merge was really fast: 12 minutes. This is why I was very supprised to see the SMB error above in that case, I even saw one with a job who took only 58 seconds to merge.

So my next move will probably be to migrate all my data to a NTFS repository.

graham8 · Post by **graham8** » Apr 04, 2017 1:35 pm this post

kubimike wrote:can you shutdown the integrity scans for now ?

I'm not sure; I haven't tried disabling them. It's set up by default, so I figured it's best to assume it's necessary to run...

I know with ZFS, scrubs are there just to proactively discover problems, but if those same issues are encountered during the course of normal file IO, they'll be reported and resolved then - so it's not critical a full scrub is run after unexpected power loss... not sure if I can make the same assumption with ReFS though.

I wish the crash recovery scan only scanned metadata and left the grueling slog through all the integrity-enabled data alone.

kubimike · Apr 04, 2017 2:54 pm

@graham8 I found this on another forum, try disabling them see if it helps. At this point what do we have to lose.
https://www.reddit.com/r/Windows10/comm ... lesystems/

Also, I have a fresh install of windows 2016 + Fresh HP hardware, I looked in task scheduler I haven't had that process kick off yet. I think I'll disable it until all of this is figured out. I think you stumbled into something though. Going to go look at my other HP server that was running veeam at the time and see if these events kicked off during backups. It might be the cause of the problem.

kubimike · Post by **kubimike** » Apr 04, 2017 3:55 pm this post

Looked at my old veeam server, the "Data Integrtiy scan" date lines up with the date I opened my microsoft ticket on. Coincidence ?

FWIW

Code: Select all

 <Actions Context="LocalSystem">
    <ComHandler>
      <ClassId>{DCFD3EA8-D960-4719-8206-490AE315F94F}</ClassId>

= DISCAN.DLL

I exported the task in XML

graham8 · Post by **graham8** » Apr 04, 2017 5:34 pm this post

Just got off the line with Microsoft. They're needing to go talk to the ReFS team to ask them how they want to gather logs/memory dumps to best assist. One thing, though - regarding the following article:

https://support.microsoft.com/en-us/hel ... windows-10

They indicated that Options 1,2, and 3 should be used in combination to most aggressively target this. I was just enabling them exclusive of one another.

Also, I've noticed that as this has been occurring, my drive has been filling up, because it's been producing large synthetic fulls, but because the block clone operation doesn't finish and the server crashes, the file is treated as an independent file which consumes the full amount of space. Going to call Veeam support to get clarification regarding how best to clean this up.

kubimike · Post by **kubimike** » Apr 04, 2017 6:13 pm this post

@graham8 cool yeah I asked that question awhile back. The instructions didn't say if you needed to apply the registry keys by themselves or together. Someone mentioned in the past they they did them individually.

Don't you have per-vm turned off?? perhaps thats why its filling ?? Did you ask about the integrity checks ???

graham8 · Post by **graham8** » Apr 04, 2017 6:23 pm this post

kubimike wrote:Don't you have per-vm turned off?? perhaps thats why its filling ??

Per-VM is turned off. The issue is that new synthetic fulls are supposed to issue block clone operations referencing the shared blocks between previous fulls and incrementals, making the newly-produced full only consume a small amount of space. When the server goes down during the block clone operation, however, the newly-created multi-TB *.VBK is treated as though it has no shared blocks by the filesystem, resulting in the overall volume space usage growing and growing. I'll work with support to run whatever consistency checks/etc veeam supports to see what, if any, of these points in time I have to nuke if they're corrupted.

kubimike wrote:Did you ask about the integrity checks ???

I mentioned it, but it sounded like they needed to talk to the ReFS people to get more information before being able to answer anything detailed about refs operation. Waiting on them to get together and call me back.

kubimike · Post by **kubimike** » Apr 04, 2017 6:44 pm this post

nice OK! My 1st integrity check is supposed to occur 4/19.. Ticking bomb me thinks .... So far my new setup has been running for two weeks on 32 gigs of ram and hasn't bombed. I have not patched with KB401xxxx

Post by **mkretzer** » Apr 04, 2017 6:50 pm this post

@adruet:

1) You are using 64K Allocation Unit Size for the ReFS volume ?

Yes! We started this thread when we had 4 K blocksize and regular crashes

2) Even though you have 384 GB of ram you implemented the option 1 (RefsEnableLargeWorkingSetTrim = 1) which should be only useful in case of heavy Ram usage?

We had 128 GB RAM before and only had a small portion of our backup storage used. I figured that if we want to use our full backup storage not even 384 GB might be enough (we are currently on our REFS test storage (105 TB) and will transfer to our production storage with ~200 TB if everything works.

3) From what I read you only have 3 big jobs running. Do you have per VM backup chains ? How many task in parallel do you allow on your repository ?

No we have about 30 primary backup jobs, 30 copy jobs, 1 Tape job, 1650 VMs. We had per-VM but as i said this created the first issues with our 4K REFS repo (deletions took FOREVER). Concurrent Tasks 18

4) What is you raid controller cache ratio ? (if you use DAS as repository storage)

We currently use a Fujitsu DX60 S3 Fibre Storage, what do you mean with cache ratio? There is no fixed cache percentage with this as with HP controllers.

kubimike · Post by **kubimike** » Apr 04, 2017 7:56 pm this post

Found some good stuff on that data integrity scanning. Looks like its for servers setup with ReFS with mirrored spaces. Looks like if you're not using any of that you can safely disable this sweep.

The previous response isn't that helpful. The "Data Integrity Scan" scheduled task actually is specifically for ReFS resilient volumes:
https://technet.microsoft.com/en-us/lib ... _WScmdlets
The Data Integrity Scan scans through all disk sectors that back a ReFS volume. If any data corruption has occurred (i.e. bit rot), it will attempt to repair the corruption; this only works if the volume is a mirrored storage space (it may work on parity spaces, too).
It won't do anything if you aren't using ReFS. Because the task has to read all of the data in a ReFS volume, this can take a very long time and be quite disruptive if you are doing other tasks that access the disk heavily. I believe this is why the task is disabled by default on Windows client systems.

https://answers.microsoft.com/en-us/win ... 4e96271bc9

https://technet.microsoft.com/en-us/lib ... _WScmdlets

Apr 04, 2017 7:59 pm

ReFS data integrity scans are not specific to Storage Spaces - they are available for simple volumes too, and are in fact one of the big benefits of ReFS as it comes to backup storage.

mickyv · Post by **mickyv** » Apr 04, 2017 11:29 pm this post

Gostev wrote:ReFS data integrity scans are not specific to Storage Spaces - they are available for simple volumes too, and are in fact one of the big benefits of ReFS as it comes to backup storage.

I had no idea about the data integrity scans until just now... Looking at our backup server it is scheduled for every 4th Saturday at 11 PM... This sort of ties in with the server becoming unresponsive on the weekend late at night and coming in on Monday morning to a frozen server...

kubimike · Post by **kubimike** » Apr 05, 2017 12:10 am this post

@mickyv disable it see if it stops

mickyv · Post by **mickyv** » Apr 05, 2017 12:12 am this post

kubimike wrote:@mickyv disable it see if it stops

I will see how it goes this weekend, if it freezes then I know that the integrity scan could be the cause! Fingers crossed it is all fixed now since wiped fresh and using 64k cluster instead of 4k.

kubimike · Post by **kubimike** » Apr 05, 2017 12:20 am this post

Mickyv when's it scheduled to run next ?

mickyv · Post by **mickyv** » Apr 05, 2017 12:27 am this post

kubimike wrote:Mickyv when's it scheduled to run next ?

Looking at the schedule, the next run date is 19th April at 1 PM. Last run time was yesterday however.

kubimike · Post by **kubimike** » Apr 05, 2017 1:01 am this post

And your server blew up yesterday ?

R&D Forums

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Who is online