Will-do, thanks. I opened a case earlier. I'll PM you the case ID.Gostev wrote:Do you have a support case open with Microsoft? I would like to forward case ID to the ReFS team as it looks like your servers might be the good subject for investigation with the issue consistently reproduced even with the patch installed.
My current problem is that I still don't have a single good example to show to them. But even I myself is not convinced at this time if the issue is real - or is some corner case that has to deal with special settings, special hardware, lack of certain system resource or something along these lines (for example, the issue Nate has just mentioned). The ratio of customers having great success using ReFS vs. customers having this deadlock issue actually suggests it might be the corner case.
My only visibility into this issue is my own experiences and this thread. Are there lots of people out there doing synthetic fulls via block clone of 5-10ish+TB backups with 16-32GB ram (since it seems that throwing extreme amounts of ram for this use case seems to at least often work around the issue...) that are having no sporadic issues whatsoever? That would be interesting to know.
Oh, and another question to everyone - has anyone else observed that your ReFS volumes seem to have ludicrously high disk fragmentation statistics? I checked on this server of ours that keeps locking the most frequently and it reported 100% fragmentation (which would imply that not one single block of data is contiguous, which sounds statistically near-impossible...). The default defrag scheduled task has been successfully running, so I thought it was fine till I manually checked via the UI.
Also something I've noticed - I think the deadlocks only occur when the scheduled data scrub tasks under TaskScheduler->Microsoft->Windows->"Data Integrity Scan" are running while a block clone operation is issued. If I don't forcibly prevent the Veeam services from running on startup when the crash-recovery scrub is running, it IO-locks, 100% of the time. I'll work with MS to confirm, but it would be helpful if other people could check the History tab of those two tasks to see if your own deadlocks coincide with the time period between Task Started's (EventID 100) and Task Completed's (EventID 102).