jimmycartrette wrote:I have a 4k 2016 ReFS repo and am experiencing the issues.... Applied the hotfix, set the RefsEnableLargeWorkingSetTrim to 1 ... Saturday ... the repo was locked. Reset the repo, dead shortly after. .. An IO took more than 30000 ms to complete
I'm glad you posted this. Almost exactly the same story here - I applied that update to our 2016 backup copy repo, applied RefsEnableLargeWorkingSetTrim = 1, rebooted, and almost immediately had a raid card reset be issued (never happened before) and the array drop out. After rebooting, everything was back online, but this weekend, I again got the "IO took more than 30000 ms to complete". As I mentioned in a previous post, I've been getting these on all these 2016/ReFS/Veeam servers. Those events don't always correlate to the lockups.
This weekend, the copy repo in question with the patch went into an IO/memory(presumably)-starvation state again and is inaccessible. Had someone on site repower it, and it checked back in briefly and is now inaccessible again...going to wait a few hours, since from past experience the post-crash scan can cause long IO lockups of the servers as well (Task Scheduler -> Microsoft -> Windows -> Data Integrity Scan -> Data Integrity Scan for Crash Recovery). Every time this happens it's incredibly nerve-wracking and makes me afraid the whole setup is going to burn down
The KB article about this (https://support.microsoft.com/en-us/help/4016173/fix-heavy-memory-usage-in-refs-on-windows-server-2016-and-windows-10
) does mention trying the other two options. Maybe I should try Option3 ("RefsEnableInlineTrim")...
The fact that the KB lists different things to "try" makes me suspect that Microsoft really has no idea what's going on here. If they did, there wouldn't be any need for end-user tuning...if the system was about to go into an memory-starvation-spiral-of-death, it would detect as much and back off. The article mentions "memory pressure that can cause poor performance", but that's a gross misrepresentation since this essentially kills a server (ie, waiting 12 hours doesn't matter) when it occurs. Even a blue-screen would be better than this...easier to debug, certainly.
I'd be happy to send more logs to the ReFS team, but we don't want to take the risk that Microsoft will declare that they "have a 'fix'" (KB4013429) and charge for the incident...since I haven't tried all the options it listed yet.