REFS issues (server lockups, high CPU, high RAM)

adapterer · May 05, 2017 8:01 am

FWIW, we are a Cloud Connect provider and found 'Option 1' of the memory usage fix didn't work for us, as our storage is never idle.

Option 2 is looking good so far with a value of 128.

kb1ibt · Post by **kb1ibt** » May 06, 2017 4:17 pm this post

tsightler wrote:Deleting files definitely seems to be one of the big triggers. In my testing that was always the point where Windows seemed to get crazy, either when Veeam was deleting lots of files or even if I just started deleting lots of block cloned files manually. I've almost wondered if it would be worthwhile to throttle file deletions on ReFS until Microsoft gets to the root of this problem.

This is exactly the timing of my 4K lock up issue. My jobs are Backup Copies and during normal synthetics it doesn't lockup but when it is performing the GFS delete then weekly roll then it only gets through a certain percent (44% on one job in particular) before the OS goes to 100% CPU. After disabling the job, rebooting the OS, letting it sit there for about an hour to let it think and clean up, then re-enabling the job it is able to finish successfully.

Gostev wrote:One theory I have that would explain why some users have issues and other don't is the difference in the amount of concurrent tasks. So one troubleshooting step for those experiencing lockups would be to reduce the amount of concurrent tasks on the repository by half and see if that makes any difference to stability. Perhaps even change it to 1 task if you have a lab environment where this issue reproduces. Thanks!

I have 2 different 4K repos and one is set to 2 concurrent, the other is set to 4 concurrent, and both are experiencing the 100% lockup. Also the one repository I mentioned in my other post that is limited to 2 concurrent tasks only has 1 job writing to it, and that job only has 3 VMs included in it. Though 2 of those 3 VMs are huge file stores, over 3TB each.

pinkerton · Post by **pinkerton** » May 08, 2017 1:53 pm this post

Hi Guys,

just found the thread after opening a new one. Seems we're affected by the same issue:

vmware-vsphere-f24/slow-active-fulls-on ... 42775.html

Will install more RAM and reduce proxy slots now to see whether this helps.

Regards,
Michael

Delo123 · Post by **Delo123** » May 08, 2017 2:01 pm this post

Good...

Please keep us informed

Post by **richardkraal** » May 08, 2017 9:30 pm this post

richardkraal wrote:I've upgraded my dedicated backend (SMB share server hosting the ReFS volume) from 64GB to 384GB and backups are running fine now.
The gateway server is running on a different dedicated machine.
Also the perfmon logs seems to be fine now, no gaps in de logs, also the system does not lockup anymore. RamMap shows a Metafile usage of 50GB (!).
let's see what's going to happen te next weeks.

fingers crossed

used hardware
DL380 Gen9, dual cpu, 384gb ram, 12x 8TB, P841/4GB (64k stripe, Raid6). Win2016 ReFS 64k

this evening one of the jobs had the same issue again...

8-5-2017 22:53:48 :: Synthetic full backup creation failed Error: Agent: Failed to process method {Transform.CompileFIB}: The handle is invalid.
Failed to duplicate extent. Target file: \\xxx\xxx\xxx.vbk, Source file: \\xxx\xxx\xxx.vbk, TgtOffset: 38785449984, SrcOffset: 38812909568, DataSize: 327680

at the moment that the issue occured.
-the disk latencies are cool (~20ms average)
-cpu of the fileserver is cool
-256gb free memory (yes, 256gb)
-other backup jobs are running at that moment, combined throughput ~ 500MB/S
-ReFS/Explorer is respondig slow

my feeling says ReFS is doing some strange things...
updated my Veeam case, hope they will contact me now, last contact was at 27-april (and 1-may from a automated message)

case ID# 02134458

Post by **Gostev** » May 09, 2017 12:17 pm this post

@Richard, honestly it does not seem like your issue has anything to deal with what the issue discussed in this thread. Could be a simple I/O error due to heavy concurrent load on the target volume. By the way, you may consider increasing your RAID stripe size to 128KB or 256KB to better match Veeam workload (avg. block size is 512KB), this will cut IOPS from backup job significantly (and your backup storage IOPS capacity is not fantastic, so it could really use this).

alesovodvojce · Post by **alesovodvojce** » May 10, 2017 5:41 pm this post

We opened MS support ticket (#117050215676939) with goal to increase the overall bug priority (and as a sideffect we can give ReFS team whatever they would need for investigation).
But our ticket seems to have no effect on prioritization, nor on possibility for being contact sample for ReFS team. And in our case, maybe no effect at all.

Today's reply

Unfortunately, what you are expecting from our support is something we cannot perform. Third parties has access to TSAnet, which is an specific support for integration between Microsoft products and external software. This team is really above our level and we do not have a communication line with them.

Also, if developers/production team is investigating an issue like this one (which is another team to which we do not have access), there is nothing we can do. Normally, when we identify a bug, we escalate this to a Technical Advisor so they can share this with developers through internal tools. In case there is already an opened investigation for a bug, nothing else is done. What we do for our customers when we reach that point (which is the status on what we are), is to inform them that developers are aware of it. Then customers must wait for a hotfix to the incident which will be delivered through regular updates (if there is any possible solution).

That being said, I have been in touch with my Technical Lead who has instructed me to archive this Service Request as the investigation is on the move. Of course that would be done without charging you any costs.

The question is
- does Veeam have bigger power (TSAnet)?
- if one or more of this thread members are in contact with MS, is it worth any more trying?

graham8 · May 10, 2017 6:15 pm

My ticket's (117040315547198) current status is that the manual memory dumps I took and uploaded to MS are being examined by the "debug team". At this point the tier2 support engineer who's on the ticket "will not be able to comment if this is the Bug With REFS, till we hear back from the debug team" (though obviously it is since I see the ReFS kernel driver consume all the memory by using poolmon and looking at the tags when the issue occurs, I understand they have to confirm these things).

So from what it sounds like, my memory dumps aren't being looked at by the ReFS team yet, pending another team to run the dumps through windbg/etc. There have been long, long (weeks) delays in correspondence on my ticket. It took a few weeks to get them to even send me a link for someplace to upload the memory dumps. I'd hope that the priority on the debug portion of this is high, considering we're talking about a massive problem with underlying storage filesystem code itself, but... Anyway, they said they'll keep me updated on the results from the debug team. I'll keep everyone posted.

Nilsn · Post by **Nilsn** » May 11, 2017 7:18 am this post

Hi

i have some questions i´m hoping someone can answer.

1. On the page for the patchnotes from MS they have 3 options.
Are the options addative so i can start with Option 1 then add Option 2 and then the most aggressive Option 3 ?
Is there any way to check if the actual patch is in place and that the REG values are being "used" ?

2. Is there any way to check the internals for REFS and when its doing the above tasks ?

As it is now we are experiencing bad performance on our REFS volumes, have ran windows update in the "HOPE" that the patch is installed and that its accepting the regvalues.
Its a bit like walking in a dark room trying to find the light-switch

graham8 · May 11, 2017 12:32 pm

Nilsn wrote:Are the options addative

Yes, I'm told they are.

Nilsn wrote:As it is now we are experiencing bad performance on our REFS volumes

If you're only experiencing bad performance, I don't think you have the issue that that patch is trying (unsuccessfully) to resolve and that this thread is describing. If anything, my understanding is that those options would reduce your performance.

antipolis · Post by **antipolis** » May 15, 2017 2:47 pm this post

hey there,

I've been experiencing decreasing weekly synthetic fulls performance... example on the exchange vm job :

4 weeks ago : took 1 hour
3 weeks ago : took 4 hours
2 weeks ago : took 8 hours
last weekend : 13 hours

will try some of the reg thingies but I don't have a good feeling about this

JimmyO · May 16, 2017 8:44 am

Exactly the same scenario as I have! (About the same times also). The only difference is that I do Forward incremental forever and merge daily (with the same time for daily merge as you have for the weekly synthetic full)

The only difference from 4 weeks ago until now is that I have installed the latest Server2016 updates (May update). Of course, we can also expect fragmentation after many runs, but acc. to MS this should´t be an issue with ReFS.

It was a huge job for me to go from NTFS to ReFS since I have a lot of data (350TB). Now ReFS seems to mess everything up. I have 200GB RAM in my server, and about half of it available, so it´s not the ReFS memory issue (also - I´m using 64KB clusters).

What´s happening here? Do Veeam work closely with MS to resolve this? Who knows where it may end up...

antipolis · Post by **antipolis** » May 16, 2017 8:50 am this post

after installing this month's updates and setting RefsEnableLargeWorkingSetTrim to 1 I've triggered a synthetic full which took only 2 hours

not sure if there's correlation here, from my understanding the reg tweaks are supposed to reduce memory usage not have an impact on performance

JimmyO · Post by **JimmyO** » May 16, 2017 8:53 am this post

From my experience - restarting the server speeds up performance for a day or two, then it get´s worse again.

Nilsn · Post by **Nilsn** » May 16, 2017 9:10 am this post

Exactly the same behavior we are seeing at the moment.
The REFS volume is almost unreachable at the morning, when restarting the proxy the volume is more responsive.

64k blocks.

antipolis · Post by **antipolis** » May 16, 2017 9:13 am this post

JimmyO wrote:From my experience - restarting the server speeds up performance for a day or two, then it get´s worse again.

not really reassuring

btw 64k clusters here as well

May 16, 2017 9:39 am

In our case the synthetics are still very fast after 4 weeks - it really feels like a faster backend storage and much RAM solved the issue for us!

Nilsn · Post by **Nilsn** » May 16, 2017 9:42 am this post

How big volumes are you running if i might ask?

antipolis · Post by **antipolis** » May 16, 2017 9:44 am this post

72 TB here (27 TB Used)

Veeamone shows 88 TB worth of full backups and 7.4 TB of increments

antipolis · Post by **antipolis** » May 16, 2017 9:45 am this post

mkretzer wrote:In our case the synthetics are still very fast after 4 weeks - it really feels like a faster backend storage and much RAM solved the issue for us!

saw your previous posts about this and I plan to buy more ram for the veeam boxes (currently both at 64 GB of ram), but in the mean time I'm wondering if I should switch synthetics to monthly instead of weekly

antipolis · Post by **antipolis** » May 16, 2017 3:08 pm this post

after re-reading the whole thread I see that some people suggests running jobs in parallel makes things worser

until now I had all my synthetics set on the same day, and a few weeks ago I added a big nasty (10 TB) file sharing VM (which was previously handled by a netapp box) so it could have been a trigger to my degrading performance...

I've now changed my jobs to create synthetics on different days, and for the sake of testing I removed the RefsEnableLargeWorkingSetTrim that I enabled yesterday...

will report back asap

Post by **mkretzer** » May 16, 2017 7:13 pm this post

192 TB, 100 TB used

dmayer · Post by **dmayer** » May 18, 2017 2:10 am this post

We are going to be speccing out our first VBR system for a customer, and I'm not really sure ReFS will be the way to go due to the issues, it would be for the savings on synthetics. We're looking at between 8TB-12TB repo using one concurrent task probably around 1.5TB to 2TB full backup with maybe 25-30GB daily changes, the current solution doesn't depupe windows files and such, so it will probably be lower for a full. Would 32GB of RAM work for this, or should we just stick to good old NTFS for now? We haven't specced out the hardware, but i'm not keen in tossing a ton of RAM into the solution either to fix something with Microsoft.

DaveWatkins · Post by **DaveWatkins** » May 18, 2017 5:39 am this post

Our server only has 32GB and we've got about 180TB total space. Only about 60-70TB of that is used but our daily rate is more than 20-30GB so you'll probably be fine. Hard to say definitively of course but we don't have any blue screen issues anymore with all the 2016 updates applied and the reg key set

dmayer · Post by **dmayer** » May 18, 2017 4:02 pm this post

Dave,

Thanks for the reply. I was going to spec the machines out originally with 16GB but I might do 32GB to be safe and was for sure going to use the reg tweaks. This would be our first VBR deployment to a customer and I don't want it to go south, I was hoping to use ReFS for the space saving and not have to load up a lot of drives, we deal mostly with SMB so budgets can be tight.

antipolis · Post by **antipolis** » May 19, 2017 3:36 pm this post

antipolis wrote: I've now changed my jobs to create synthetics on different days, and for the sake of testing I removed the RefsEnableLargeWorkingSetTrim that I enabled yesterday...

will report back asap

creating my synthetics on different days seems to have solved the issue: yesterday night the fast cloning part of my exchange job only lasted 1 hour, which is roughly the same as what I had a few months ago

as a side note, if I set RefsEnableLargeWorkingSetTrim to 1 then fastcloning lasts 2 hours

May 23, 2017 11:17 pm

All, I've received the second set of test fixes from Microsoft. According to the description, one of them hits the scenario I was suspecting to be potentially causing the issue. Since the only lab where we've managed to reproduce the issue at (personal lab of one of our engineers) is no longer available, our support will be reaching to some of those with Veeam support cases open to see if it helps. If you want to participate, post your Veeam support case ID here (I will delete your post once I forward your case ID to support). Thanks!

Nilsn · Post by **Nilsn** » May 24, 2017 6:47 am this post

Got some more weerd behavior that happened today.
Yesterday i extended a REFS volume from 70TB to 80TB, today when i checked the nightly backups 19 jobs where still running.
The volume is "sluggish" to access in windows on the proxy.
60gb ram and 4 vcpus assigned to the proxy vm.

Anyone else that have "slow" performance and have recently extended the size of the volume?

We are planning to begin the extremely tedious job of rolling back 140tb of backup data to NTFS...

rfn · May 24, 2017 7:38 am

We switched to Veeam Backup & Replication 9.5 update 1 (from another product) in march/april and opted to used ReFS for the space saving. We installed it on a HPE DL380 Gen9 with 32 GB RAM and it has around 55 TB of storage. We're backup up around 100 VM's from vSphere 6.0 Update 3. We installed the Windows update that was supposed to fix the ReFS problems and implemented solution 1.

It ran well for a week or two, but then the server began to become unresponsive. It would still ping so at first we didn't know there was a problem, but Veeam One did report that it didn't get any backup information from the server. I tried to use the server through ILO and being there physically, but there was no video signal and it didn't respond to anything. A hard reset was the only solution. Ever since that it has done that a lot of times with varying intervals, but most of the time it was running for some days, and maybe a week, before it became unresponsive again. It could also happen in just one day. We tried to upgrade the RAM to 192 GB, but it didn't really help.

What has now helped is updating the server with the latest Service Pack for ProLiant version 2017.04.0(21 Apr 2017). It updated the BIOS, some firmware and some drivers. Since I did that it has been rock solid and it has now been running for two weeks and history shows that it would have crashed now, so I'm really hoping that this has fixed our issue. We originally suspected that the issue was ReFS, but it might not have been, or else HPE has fixed some incompatibility with their hardware/software and ReFS.

This is just meant as a heads up to other people with a similiar setup. If you're runnning HPE Proliant Gen9 servers and experience problems,then try this update.

EDIT: We using 64K blocks on ReFS.

May 24, 2017 12:42 pm

Now after 4 weeks of a very stable filesystem we also got the problem that the system got extremly slow - so slow it only wrote with 1-5 MB/s instead of 200+ MB/s. At first we thougt Veeam Update 2 is the problem and opened a case but then we found out that a simple copy operation showed the same issue. A reboot "solved" that for now.

@gostev: Is the fix also for the "Filesystem getting slow over time" issue?

R&D Forums

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Re: REFS 4k horror story

Who is online