Indeed, it seems stable enough for most users with 64KB volumes. Now we can be sure about that, because after nearly two months of purposely killing multiple ReFS repositories in our labs with dozens of continuous jobs producing and deleting hundreds of restore points, and a specially created Clonezilla tool on top of those, it's been stable for all of our test backup repositories but one. This is all without any private fixes or tweaks, just vanilla Windows Server 2016 with all updates.
Even the repository for which we've finally reproduced the issue has been working fine for weeks, until a few days ago. Which is a good news, as this finally lets us analyze one, trying to understand what is different about this particular repository - as well as test some possible workarounds we can implement from our side (we have some ideas on what can potentially help). Of course, our main hope is that Microsoft fixes this on their side - I know they've been working hard investigating this one (I asked them for another update a few days ago).
The issue for sure happens only during the retention processing, when backup files are being deleted from the disk - some ReFS metadata update operation seems to be "long running" and preventing other I/O to the same volume - this is the essence of the issue.
One recommendation we can give based on our observations so far is to avoid scheduling synthetic fulls too often (or disable them completely), and don't use per-VM chains. Both measures allow to reduce the amount of files with cloned blocks that are deleted at once. In fact, one of the workarounds we're testing right now is simply "throttling" backup file deletions by adding a timeout after issuing each file deletion command. And I've already heard that the first results were promising.