Comprehensive data protection for all workloads
Locked
adapterer
Expert
Posts: 227
Liked: 46 times
Joined: Oct 12, 2015 11:24 pm
Contact:

Re: REFS 4k horror story

Post by adapterer » 1 person likes this post

FWIW, we are a Cloud Connect provider and found 'Option 1' of the memory usage fix didn't work for us, as our storage is never idle.

Option 2 is looking good so far with a value of 128.
kb1ibt
Influencer
Posts: 14
Liked: never
Joined: Apr 24, 2015 1:40 pm
Contact:

Re: REFS 4k horror story

Post by kb1ibt »

tsightler wrote:Deleting files definitely seems to be one of the big triggers. In my testing that was always the point where Windows seemed to get crazy, either when Veeam was deleting lots of files or even if I just started deleting lots of block cloned files manually. I've almost wondered if it would be worthwhile to throttle file deletions on ReFS until Microsoft gets to the root of this problem.
This is exactly the timing of my 4K lock up issue. My jobs are Backup Copies and during normal synthetics it doesn't lockup but when it is performing the GFS delete then weekly roll then it only gets through a certain percent (44% on one job in particular) before the OS goes to 100% CPU. After disabling the job, rebooting the OS, letting it sit there for about an hour to let it think and clean up, then re-enabling the job it is able to finish successfully.
Gostev wrote:One theory I have that would explain why some users have issues and other don't is the difference in the amount of concurrent tasks. So one troubleshooting step for those experiencing lockups would be to reduce the amount of concurrent tasks on the repository by half and see if that makes any difference to stability. Perhaps even change it to 1 task if you have a lab environment where this issue reproduces. Thanks!
I have 2 different 4K repos and one is set to 2 concurrent, the other is set to 4 concurrent, and both are experiencing the 100% lockup. Also the one repository I mentioned in my other post that is limited to 2 concurrent tasks only has 1 job writing to it, and that job only has 3 VMs included in it. Though 2 of those 3 VMs are huge file stores, over 3TB each.
pinkerton
Enthusiast
Posts: 82
Liked: 4 times
Joined: Sep 29, 2011 9:57 am
Contact:

Re: REFS 4k horror story

Post by pinkerton »

Hi Guys,

just found the thread after opening a new one. Seems we're affected by the same issue:

vmware-vsphere-f24/slow-active-fulls-on ... 42775.html

Will install more RAM and reduce proxy slots now to see whether this helps.

Regards,
Michael
Delo123
Veteran
Posts: 361
Liked: 109 times
Joined: Dec 28, 2012 5:20 pm
Full Name: Guido Meijers
Contact:

Re: REFS 4k horror story

Post by Delo123 »

Good... :) Please keep us informed :)
richardkraal
Service Provider
Posts: 13
Liked: 2 times
Joined: Apr 05, 2017 10:48 am
Full Name: Richard Kraal
Contact:

Re: REFS 4k horror story

Post by richardkraal »

richardkraal wrote:I've upgraded my dedicated backend (SMB share server hosting the ReFS volume) from 64GB to 384GB and backups are running fine now.
The gateway server is running on a different dedicated machine.
Also the perfmon logs seems to be fine now, no gaps in de logs, also the system does not lockup anymore. RamMap shows a Metafile usage of 50GB (!).
let's see what's going to happen te next weeks.

fingers crossed

used hardware
DL380 Gen9, dual cpu, 384gb ram, 12x 8TB, P841/4GB (64k stripe, Raid6). Win2016 ReFS 64k
this evening one of the jobs had the same issue again...

8-5-2017 22:53:48 :: Synthetic full backup creation failed Error: Agent: Failed to process method {Transform.CompileFIB}: The handle is invalid.
Failed to duplicate extent. Target file: \\xxx\xxx\xxx.vbk, Source file: \\xxx\xxx\xxx.vbk, TgtOffset: 38785449984, SrcOffset: 38812909568, DataSize: 327680

at the moment that the issue occured.
-the disk latencies are cool (~20ms average)
-cpu of the fileserver is cool
-256gb free memory (yes, 256gb)
-other backup jobs are running at that moment, combined throughput ~ 500MB/S
-ReFS/Explorer is respondig slow


my feeling says ReFS is doing some strange things...
updated my Veeam case, hope they will contact me now, last contact was at 27-april (and 1-may from a automated message)

case ID# 02134458
Gostev
Chief Product Officer
Posts: 31524
Liked: 6700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS 4k horror story

Post by Gostev »

@Richard, honestly it does not seem like your issue has anything to deal with what the issue discussed in this thread. Could be a simple I/O error due to heavy concurrent load on the target volume. By the way, you may consider increasing your RAID stripe size to 128KB or 256KB to better match Veeam workload (avg. block size is 512KB), this will cut IOPS from backup job significantly (and your backup storage IOPS capacity is not fantastic, so it could really use this).
alesovodvojce
Enthusiast
Posts: 61
Liked: 9 times
Joined: Nov 29, 2016 10:09 pm
Contact:

Re: REFS 4k horror story

Post by alesovodvojce »

We opened MS support ticket (#117050215676939) with goal to increase the overall bug priority (and as a sideffect we can give ReFS team whatever they would need for investigation).
But our ticket seems to have no effect on prioritization, nor on possibility for being contact sample for ReFS team. And in our case, maybe no effect at all.

Today's reply
Unfortunately, what you are expecting from our support is something we cannot perform. Third parties has access to TSAnet, which is an specific support for integration between Microsoft products and external software. This team is really above our level and we do not have a communication line with them.

Also, if developers/production team is investigating an issue like this one (which is another team to which we do not have access), there is nothing we can do. Normally, when we identify a bug, we escalate this to a Technical Advisor so they can share this with developers through internal tools. In case there is already an opened investigation for a bug, nothing else is done. What we do for our customers when we reach that point (which is the status on what we are), is to inform them that developers are aware of it. Then customers must wait for a hotfix to the incident which will be delivered through regular updates (if there is any possible solution).

That being said, I have been in touch with my Technical Lead who has instructed me to archive this Service Request as the investigation is on the move. Of course that would be done without charging you any costs.
The question is
- does Veeam have bigger power (TSAnet)?
- if one or more of this thread members are in contact with MS, is it worth any more trying?
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 » 1 person likes this post

My ticket's (117040315547198) current status is that the manual memory dumps I took and uploaded to MS are being examined by the "debug team". At this point the tier2 support engineer who's on the ticket "will not be able to comment if this is the Bug With REFS, till we hear back from the debug team" (though obviously it is since I see the ReFS kernel driver consume all the memory by using poolmon and looking at the tags when the issue occurs, I understand they have to confirm these things).

So from what it sounds like, my memory dumps aren't being looked at by the ReFS team yet, pending another team to run the dumps through windbg/etc. There have been long, long (weeks) delays in correspondence on my ticket. It took a few weeks to get them to even send me a link for someplace to upload the memory dumps. I'd hope that the priority on the debug portion of this is high, considering we're talking about a massive problem with underlying storage filesystem code itself, but... Anyway, they said they'll keep me updated on the results from the debug team. I'll keep everyone posted.
Nilsn
Novice
Posts: 9
Liked: never
Joined: Sep 24, 2015 9:12 am
Full Name: Nils
Contact:

Re: REFS 4k horror story

Post by Nilsn »

Hi

i have some questions i´m hoping someone can answer.

1. On the page for the patchnotes from MS they have 3 options.
Are the options addative so i can start with Option 1 then add Option 2 and then the most aggressive Option 3 ?
Is there any way to check if the actual patch is in place and that the REG values are being "used" ?

2. Is there any way to check the internals for REFS and when its doing the above tasks ?

As it is now we are experiencing bad performance on our REFS volumes, have ran windows update in the "HOPE" that the patch is installed and that its accepting the regvalues.
Its a bit like walking in a dark room trying to find the light-switch :)
graham8
Enthusiast
Posts: 59
Liked: 20 times
Joined: Dec 14, 2016 1:56 pm
Contact:

Re: REFS 4k horror story

Post by graham8 » 2 people like this post

Nilsn wrote:Are the options addative
Yes, I'm told they are.
Nilsn wrote:As it is now we are experiencing bad performance on our REFS volumes
If you're only experiencing bad performance, I don't think you have the issue that that patch is trying (unsuccessfully) to resolve and that this thread is describing. If anything, my understanding is that those options would reduce your performance.
antipolis
Enthusiast
Posts: 73
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS 4k horror story

Post by antipolis »

hey there,

I've been experiencing decreasing weekly synthetic fulls performance... example on the exchange vm job :

4 weeks ago : took 1 hour
3 weeks ago : took 4 hours
2 weeks ago : took 8 hours
last weekend : 13 hours

will try some of the reg thingies but I don't have a good feeling about this :?
JimmyO
Enthusiast
Posts: 55
Liked: 9 times
Joined: Apr 27, 2014 8:19 pm
Contact:

Re: REFS 4k horror story

Post by JimmyO » 1 person likes this post

Exactly the same scenario as I have! (About the same times also). The only difference is that I do Forward incremental forever and merge daily (with the same time for daily merge as you have for the weekly synthetic full)

The only difference from 4 weeks ago until now is that I have installed the latest Server2016 updates (May update). Of course, we can also expect fragmentation after many runs, but acc. to MS this should´t be an issue with ReFS.

It was a huge job for me to go from NTFS to ReFS since I have a lot of data (350TB). Now ReFS seems to mess everything up. I have 200GB RAM in my server, and about half of it available, so it´s not the ReFS memory issue (also - I´m using 64KB clusters).

What´s happening here? Do Veeam work closely with MS to resolve this? Who knows where it may end up...
antipolis
Enthusiast
Posts: 73
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS 4k horror story

Post by antipolis »

after installing this month's updates and setting RefsEnableLargeWorkingSetTrim to 1 I've triggered a synthetic full which took only 2 hours

not sure if there's correlation here, from my understanding the reg tweaks are supposed to reduce memory usage not have an impact on performance
JimmyO
Enthusiast
Posts: 55
Liked: 9 times
Joined: Apr 27, 2014 8:19 pm
Contact:

Re: REFS 4k horror story

Post by JimmyO »

From my experience - restarting the server speeds up performance for a day or two, then it get´s worse again.
Nilsn
Novice
Posts: 9
Liked: never
Joined: Sep 24, 2015 9:12 am
Full Name: Nils
Contact:

Re: REFS 4k horror story

Post by Nilsn »

Exactly the same behavior we are seeing at the moment.
The REFS volume is almost unreachable at the morning, when restarting the proxy the volume is more responsive.

64k blocks.
antipolis
Enthusiast
Posts: 73
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS 4k horror story

Post by antipolis »

JimmyO wrote:From my experience - restarting the server speeds up performance for a day or two, then it get´s worse again.
not really reassuring :shock:


btw 64k clusters here as well
mkretzer
Veeam Legend
Posts: 1145
Liked: 388 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer » 1 person likes this post

In our case the synthetics are still very fast after 4 weeks - it really feels like a faster backend storage and much RAM solved the issue for us!
Nilsn
Novice
Posts: 9
Liked: never
Joined: Sep 24, 2015 9:12 am
Full Name: Nils
Contact:

Re: REFS 4k horror story

Post by Nilsn »

How big volumes are you running if i might ask?
antipolis
Enthusiast
Posts: 73
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS 4k horror story

Post by antipolis »

72 TB here (27 TB Used)

Veeamone shows 88 TB worth of full backups and 7.4 TB of increments
antipolis
Enthusiast
Posts: 73
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS 4k horror story

Post by antipolis »

mkretzer wrote:In our case the synthetics are still very fast after 4 weeks - it really feels like a faster backend storage and much RAM solved the issue for us!
saw your previous posts about this and I plan to buy more ram for the veeam boxes (currently both at 64 GB of ram), but in the mean time I'm wondering if I should switch synthetics to monthly instead of weekly
antipolis
Enthusiast
Posts: 73
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS 4k horror story

Post by antipolis »

after re-reading the whole thread I see that some people suggests running jobs in parallel makes things worser

until now I had all my synthetics set on the same day, and a few weeks ago I added a big nasty (10 TB) file sharing VM (which was previously handled by a netapp box) so it could have been a trigger to my degrading performance...

I've now changed my jobs to create synthetics on different days, and for the sake of testing I removed the RefsEnableLargeWorkingSetTrim that I enabled yesterday...

will report back asap
mkretzer
Veeam Legend
Posts: 1145
Liked: 388 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer »

192 TB, 100 TB used
dmayer
Influencer
Posts: 18
Liked: 9 times
Joined: Apr 21, 2017 6:16 pm
Full Name: Daniel Mayer
Contact:

Re: REFS 4k horror story

Post by dmayer »

We are going to be speccing out our first VBR system for a customer, and I'm not really sure ReFS will be the way to go due to the issues, it would be for the savings on synthetics. We're looking at between 8TB-12TB repo using one concurrent task probably around 1.5TB to 2TB full backup with maybe 25-30GB daily changes, the current solution doesn't depupe windows files and such, so it will probably be lower for a full. Would 32GB of RAM work for this, or should we just stick to good old NTFS for now? We haven't specced out the hardware, but i'm not keen in tossing a ton of RAM into the solution either to fix something with Microsoft.
DaveWatkins
Veteran
Posts: 370
Liked: 97 times
Joined: Dec 13, 2015 11:33 pm
Contact:

Re: REFS 4k horror story

Post by DaveWatkins »

Our server only has 32GB and we've got about 180TB total space. Only about 60-70TB of that is used but our daily rate is more than 20-30GB so you'll probably be fine. Hard to say definitively of course but we don't have any blue screen issues anymore with all the 2016 updates applied and the reg key set
dmayer
Influencer
Posts: 18
Liked: 9 times
Joined: Apr 21, 2017 6:16 pm
Full Name: Daniel Mayer
Contact:

Re: REFS 4k horror story

Post by dmayer »

Dave,

Thanks for the reply. I was going to spec the machines out originally with 16GB but I might do 32GB to be safe and was for sure going to use the reg tweaks. This would be our first VBR deployment to a customer and I don't want it to go south, I was hoping to use ReFS for the space saving and not have to load up a lot of drives, we deal mostly with SMB so budgets can be tight.
antipolis
Enthusiast
Posts: 73
Liked: 9 times
Joined: Oct 26, 2016 9:17 am
Contact:

Re: REFS 4k horror story

Post by antipolis »

antipolis wrote: I've now changed my jobs to create synthetics on different days, and for the sake of testing I removed the RefsEnableLargeWorkingSetTrim that I enabled yesterday...

will report back asap
creating my synthetics on different days seems to have solved the issue: yesterday night the fast cloning part of my exchange job only lasted 1 hour, which is roughly the same as what I had a few months ago

as a side note, if I set RefsEnableLargeWorkingSetTrim to 1 then fastcloning lasts 2 hours
Gostev
Chief Product Officer
Posts: 31524
Liked: 6700 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: REFS 4k horror story

Post by Gostev » 2 people like this post

All, I've received the second set of test fixes from Microsoft. According to the description, one of them hits the scenario I was suspecting to be potentially causing the issue. Since the only lab where we've managed to reproduce the issue at (personal lab of one of our engineers) is no longer available, our support will be reaching to some of those with Veeam support cases open to see if it helps. If you want to participate, post your Veeam support case ID here (I will delete your post once I forward your case ID to support). Thanks!
Nilsn
Novice
Posts: 9
Liked: never
Joined: Sep 24, 2015 9:12 am
Full Name: Nils
Contact:

Re: REFS 4k horror story

Post by Nilsn »

Got some more weerd behavior that happened today.
Yesterday i extended a REFS volume from 70TB to 80TB, today when i checked the nightly backups 19 jobs where still running.
The volume is "sluggish" to access in windows on the proxy.
60gb ram and 4 vcpus assigned to the proxy vm.

Anyone else that have "slow" performance and have recently extended the size of the volume?

We are planning to begin the extremely tedious job of rolling back 140tb of backup data to NTFS...
rfn
Expert
Posts: 141
Liked: 5 times
Joined: Jan 27, 2010 9:43 am
Full Name: René Frej Nielsen
Contact:

Re: REFS 4k horror story

Post by rfn » 1 person likes this post

We switched to Veeam Backup & Replication 9.5 update 1 (from another product) in march/april and opted to used ReFS for the space saving. We installed it on a HPE DL380 Gen9 with 32 GB RAM and it has around 55 TB of storage. We're backup up around 100 VM's from vSphere 6.0 Update 3. We installed the Windows update that was supposed to fix the ReFS problems and implemented solution 1.

It ran well for a week or two, but then the server began to become unresponsive. It would still ping so at first we didn't know there was a problem, but Veeam One did report that it didn't get any backup information from the server. I tried to use the server through ILO and being there physically, but there was no video signal and it didn't respond to anything. A hard reset was the only solution. Ever since that it has done that a lot of times with varying intervals, but most of the time it was running for some days, and maybe a week, before it became unresponsive again. It could also happen in just one day. We tried to upgrade the RAM to 192 GB, but it didn't really help.

What has now helped is updating the server with the latest Service Pack for ProLiant version 2017.04.0(21 Apr 2017). It updated the BIOS, some firmware and some drivers. Since I did that it has been rock solid and it has now been running for two weeks and history shows that it would have crashed now, so I'm really hoping that this has fixed our issue. We originally suspected that the issue was ReFS, but it might not have been, or else HPE has fixed some incompatibility with their hardware/software and ReFS.

This is just meant as a heads up to other people with a similiar setup. If you're runnning HPE Proliant Gen9 servers and experience problems,then try this update.

EDIT: We using 64K blocks on ReFS.
mkretzer
Veeam Legend
Posts: 1145
Liked: 388 times
Joined: Dec 17, 2015 7:17 am
Contact:

Re: REFS 4k horror story

Post by mkretzer » 1 person likes this post

Now after 4 weeks of a very stable filesystem we also got the problem that the system got extremly slow - so slow it only wrote with 1-5 MB/s instead of 200+ MB/s. At first we thougt Veeam Update 2 is the problem and opened a case but then we found out that a simple copy operation showed the same issue. A reboot "solved" that for now.

@gostev: Is the fix also for the "Filesystem getting slow over time" issue?
Locked

Who is online

Users browsing this forum: Bing [Bot], Google [Bot] and 54 guests