-
- Expert
- Posts: 227
- Liked: 46 times
- Joined: Oct 12, 2015 11:24 pm
- Contact:
Re: REFS 4k horror story
FWIW, we are a Cloud Connect provider and found 'Option 1' of the memory usage fix didn't work for us, as our storage is never idle.
Option 2 is looking good so far with a value of 128.
Option 2 is looking good so far with a value of 128.
-
- Influencer
- Posts: 14
- Liked: never
- Joined: Apr 24, 2015 1:40 pm
- Contact:
Re: REFS 4k horror story
This is exactly the timing of my 4K lock up issue. My jobs are Backup Copies and during normal synthetics it doesn't lockup but when it is performing the GFS delete then weekly roll then it only gets through a certain percent (44% on one job in particular) before the OS goes to 100% CPU. After disabling the job, rebooting the OS, letting it sit there for about an hour to let it think and clean up, then re-enabling the job it is able to finish successfully.tsightler wrote:Deleting files definitely seems to be one of the big triggers. In my testing that was always the point where Windows seemed to get crazy, either when Veeam was deleting lots of files or even if I just started deleting lots of block cloned files manually. I've almost wondered if it would be worthwhile to throttle file deletions on ReFS until Microsoft gets to the root of this problem.
I have 2 different 4K repos and one is set to 2 concurrent, the other is set to 4 concurrent, and both are experiencing the 100% lockup. Also the one repository I mentioned in my other post that is limited to 2 concurrent tasks only has 1 job writing to it, and that job only has 3 VMs included in it. Though 2 of those 3 VMs are huge file stores, over 3TB each.Gostev wrote:One theory I have that would explain why some users have issues and other don't is the difference in the amount of concurrent tasks. So one troubleshooting step for those experiencing lockups would be to reduce the amount of concurrent tasks on the repository by half and see if that makes any difference to stability. Perhaps even change it to 1 task if you have a lab environment where this issue reproduces. Thanks!
-
- Enthusiast
- Posts: 82
- Liked: 4 times
- Joined: Sep 29, 2011 9:57 am
- Contact:
Re: REFS 4k horror story
Hi Guys,
just found the thread after opening a new one. Seems we're affected by the same issue:
vmware-vsphere-f24/slow-active-fulls-on ... 42775.html
Will install more RAM and reduce proxy slots now to see whether this helps.
Regards,
Michael
just found the thread after opening a new one. Seems we're affected by the same issue:
vmware-vsphere-f24/slow-active-fulls-on ... 42775.html
Will install more RAM and reduce proxy slots now to see whether this helps.
Regards,
Michael
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: REFS 4k horror story
Good... Please keep us informed
-
- Service Provider
- Posts: 13
- Liked: 2 times
- Joined: Apr 05, 2017 10:48 am
- Full Name: Richard Kraal
- Contact:
Re: REFS 4k horror story
this evening one of the jobs had the same issue again...richardkraal wrote:I've upgraded my dedicated backend (SMB share server hosting the ReFS volume) from 64GB to 384GB and backups are running fine now.
The gateway server is running on a different dedicated machine.
Also the perfmon logs seems to be fine now, no gaps in de logs, also the system does not lockup anymore. RamMap shows a Metafile usage of 50GB (!).
let's see what's going to happen te next weeks.
fingers crossed
used hardware
DL380 Gen9, dual cpu, 384gb ram, 12x 8TB, P841/4GB (64k stripe, Raid6). Win2016 ReFS 64k
8-5-2017 22:53:48 :: Synthetic full backup creation failed Error: Agent: Failed to process method {Transform.CompileFIB}: The handle is invalid.
Failed to duplicate extent. Target file: \\xxx\xxx\xxx.vbk, Source file: \\xxx\xxx\xxx.vbk, TgtOffset: 38785449984, SrcOffset: 38812909568, DataSize: 327680
at the moment that the issue occured.
-the disk latencies are cool (~20ms average)
-cpu of the fileserver is cool
-256gb free memory (yes, 256gb)
-other backup jobs are running at that moment, combined throughput ~ 500MB/S
-ReFS/Explorer is respondig slow
my feeling says ReFS is doing some strange things...
updated my Veeam case, hope they will contact me now, last contact was at 27-april (and 1-may from a automated message)
case ID# 02134458
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
@Richard, honestly it does not seem like your issue has anything to deal with what the issue discussed in this thread. Could be a simple I/O error due to heavy concurrent load on the target volume. By the way, you may consider increasing your RAID stripe size to 128KB or 256KB to better match Veeam workload (avg. block size is 512KB), this will cut IOPS from backup job significantly (and your backup storage IOPS capacity is not fantastic, so it could really use this).
-
- Enthusiast
- Posts: 63
- Liked: 9 times
- Joined: Nov 29, 2016 10:09 pm
- Contact:
Re: REFS 4k horror story
We opened MS support ticket (#117050215676939) with goal to increase the overall bug priority (and as a sideffect we can give ReFS team whatever they would need for investigation).
But our ticket seems to have no effect on prioritization, nor on possibility for being contact sample for ReFS team. And in our case, maybe no effect at all.
Today's reply
- does Veeam have bigger power (TSAnet)?
- if one or more of this thread members are in contact with MS, is it worth any more trying?
But our ticket seems to have no effect on prioritization, nor on possibility for being contact sample for ReFS team. And in our case, maybe no effect at all.
Today's reply
The question isUnfortunately, what you are expecting from our support is something we cannot perform. Third parties has access to TSAnet, which is an specific support for integration between Microsoft products and external software. This team is really above our level and we do not have a communication line with them.
Also, if developers/production team is investigating an issue like this one (which is another team to which we do not have access), there is nothing we can do. Normally, when we identify a bug, we escalate this to a Technical Advisor so they can share this with developers through internal tools. In case there is already an opened investigation for a bug, nothing else is done. What we do for our customers when we reach that point (which is the status on what we are), is to inform them that developers are aware of it. Then customers must wait for a hotfix to the incident which will be delivered through regular updates (if there is any possible solution).
That being said, I have been in touch with my Technical Lead who has instructed me to archive this Service Request as the investigation is on the move. Of course that would be done without charging you any costs.
- does Veeam have bigger power (TSAnet)?
- if one or more of this thread members are in contact with MS, is it worth any more trying?
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
My ticket's (117040315547198) current status is that the manual memory dumps I took and uploaded to MS are being examined by the "debug team". At this point the tier2 support engineer who's on the ticket "will not be able to comment if this is the Bug With REFS, till we hear back from the debug team" (though obviously it is since I see the ReFS kernel driver consume all the memory by using poolmon and looking at the tags when the issue occurs, I understand they have to confirm these things).
So from what it sounds like, my memory dumps aren't being looked at by the ReFS team yet, pending another team to run the dumps through windbg/etc. There have been long, long (weeks) delays in correspondence on my ticket. It took a few weeks to get them to even send me a link for someplace to upload the memory dumps. I'd hope that the priority on the debug portion of this is high, considering we're talking about a massive problem with underlying storage filesystem code itself, but... Anyway, they said they'll keep me updated on the results from the debug team. I'll keep everyone posted.
So from what it sounds like, my memory dumps aren't being looked at by the ReFS team yet, pending another team to run the dumps through windbg/etc. There have been long, long (weeks) delays in correspondence on my ticket. It took a few weeks to get them to even send me a link for someplace to upload the memory dumps. I'd hope that the priority on the debug portion of this is high, considering we're talking about a massive problem with underlying storage filesystem code itself, but... Anyway, they said they'll keep me updated on the results from the debug team. I'll keep everyone posted.
-
- Novice
- Posts: 9
- Liked: never
- Joined: Sep 24, 2015 9:12 am
- Full Name: Nils
- Contact:
Re: REFS 4k horror story
Hi
i have some questions i´m hoping someone can answer.
1. On the page for the patchnotes from MS they have 3 options.
Are the options addative so i can start with Option 1 then add Option 2 and then the most aggressive Option 3 ?
Is there any way to check if the actual patch is in place and that the REG values are being "used" ?
2. Is there any way to check the internals for REFS and when its doing the above tasks ?
As it is now we are experiencing bad performance on our REFS volumes, have ran windows update in the "HOPE" that the patch is installed and that its accepting the regvalues.
Its a bit like walking in a dark room trying to find the light-switch
i have some questions i´m hoping someone can answer.
1. On the page for the patchnotes from MS they have 3 options.
Are the options addative so i can start with Option 1 then add Option 2 and then the most aggressive Option 3 ?
Is there any way to check if the actual patch is in place and that the REG values are being "used" ?
2. Is there any way to check the internals for REFS and when its doing the above tasks ?
As it is now we are experiencing bad performance on our REFS volumes, have ran windows update in the "HOPE" that the patch is installed and that its accepting the regvalues.
Its a bit like walking in a dark room trying to find the light-switch
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
Yes, I'm told they are.Nilsn wrote:Are the options addative
If you're only experiencing bad performance, I don't think you have the issue that that patch is trying (unsuccessfully) to resolve and that this thread is describing. If anything, my understanding is that those options would reduce your performance.Nilsn wrote:As it is now we are experiencing bad performance on our REFS volumes
-
- Enthusiast
- Posts: 73
- Liked: 9 times
- Joined: Oct 26, 2016 9:17 am
- Contact:
Re: REFS 4k horror story
hey there,
I've been experiencing decreasing weekly synthetic fulls performance... example on the exchange vm job :
4 weeks ago : took 1 hour
3 weeks ago : took 4 hours
2 weeks ago : took 8 hours
last weekend : 13 hours
will try some of the reg thingies but I don't have a good feeling about this
I've been experiencing decreasing weekly synthetic fulls performance... example on the exchange vm job :
4 weeks ago : took 1 hour
3 weeks ago : took 4 hours
2 weeks ago : took 8 hours
last weekend : 13 hours
will try some of the reg thingies but I don't have a good feeling about this
-
- Enthusiast
- Posts: 55
- Liked: 9 times
- Joined: Apr 27, 2014 8:19 pm
- Contact:
Re: REFS 4k horror story
Exactly the same scenario as I have! (About the same times also). The only difference is that I do Forward incremental forever and merge daily (with the same time for daily merge as you have for the weekly synthetic full)
The only difference from 4 weeks ago until now is that I have installed the latest Server2016 updates (May update). Of course, we can also expect fragmentation after many runs, but acc. to MS this should´t be an issue with ReFS.
It was a huge job for me to go from NTFS to ReFS since I have a lot of data (350TB). Now ReFS seems to mess everything up. I have 200GB RAM in my server, and about half of it available, so it´s not the ReFS memory issue (also - I´m using 64KB clusters).
What´s happening here? Do Veeam work closely with MS to resolve this? Who knows where it may end up...
The only difference from 4 weeks ago until now is that I have installed the latest Server2016 updates (May update). Of course, we can also expect fragmentation after many runs, but acc. to MS this should´t be an issue with ReFS.
It was a huge job for me to go from NTFS to ReFS since I have a lot of data (350TB). Now ReFS seems to mess everything up. I have 200GB RAM in my server, and about half of it available, so it´s not the ReFS memory issue (also - I´m using 64KB clusters).
What´s happening here? Do Veeam work closely with MS to resolve this? Who knows where it may end up...
-
- Enthusiast
- Posts: 73
- Liked: 9 times
- Joined: Oct 26, 2016 9:17 am
- Contact:
Re: REFS 4k horror story
after installing this month's updates and setting RefsEnableLargeWorkingSetTrim to 1 I've triggered a synthetic full which took only 2 hours
not sure if there's correlation here, from my understanding the reg tweaks are supposed to reduce memory usage not have an impact on performance
not sure if there's correlation here, from my understanding the reg tweaks are supposed to reduce memory usage not have an impact on performance
-
- Enthusiast
- Posts: 55
- Liked: 9 times
- Joined: Apr 27, 2014 8:19 pm
- Contact:
Re: REFS 4k horror story
From my experience - restarting the server speeds up performance for a day or two, then it get´s worse again.
-
- Novice
- Posts: 9
- Liked: never
- Joined: Sep 24, 2015 9:12 am
- Full Name: Nils
- Contact:
Re: REFS 4k horror story
Exactly the same behavior we are seeing at the moment.
The REFS volume is almost unreachable at the morning, when restarting the proxy the volume is more responsive.
64k blocks.
The REFS volume is almost unreachable at the morning, when restarting the proxy the volume is more responsive.
64k blocks.
-
- Enthusiast
- Posts: 73
- Liked: 9 times
- Joined: Oct 26, 2016 9:17 am
- Contact:
Re: REFS 4k horror story
not really reassuringJimmyO wrote:From my experience - restarting the server speeds up performance for a day or two, then it get´s worse again.
btw 64k clusters here as well
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
In our case the synthetics are still very fast after 4 weeks - it really feels like a faster backend storage and much RAM solved the issue for us!
-
- Novice
- Posts: 9
- Liked: never
- Joined: Sep 24, 2015 9:12 am
- Full Name: Nils
- Contact:
Re: REFS 4k horror story
How big volumes are you running if i might ask?
-
- Enthusiast
- Posts: 73
- Liked: 9 times
- Joined: Oct 26, 2016 9:17 am
- Contact:
Re: REFS 4k horror story
72 TB here (27 TB Used)
Veeamone shows 88 TB worth of full backups and 7.4 TB of increments
Veeamone shows 88 TB worth of full backups and 7.4 TB of increments
-
- Enthusiast
- Posts: 73
- Liked: 9 times
- Joined: Oct 26, 2016 9:17 am
- Contact:
Re: REFS 4k horror story
saw your previous posts about this and I plan to buy more ram for the veeam boxes (currently both at 64 GB of ram), but in the mean time I'm wondering if I should switch synthetics to monthly instead of weeklymkretzer wrote:In our case the synthetics are still very fast after 4 weeks - it really feels like a faster backend storage and much RAM solved the issue for us!
-
- Enthusiast
- Posts: 73
- Liked: 9 times
- Joined: Oct 26, 2016 9:17 am
- Contact:
Re: REFS 4k horror story
after re-reading the whole thread I see that some people suggests running jobs in parallel makes things worser
until now I had all my synthetics set on the same day, and a few weeks ago I added a big nasty (10 TB) file sharing VM (which was previously handled by a netapp box) so it could have been a trigger to my degrading performance...
I've now changed my jobs to create synthetics on different days, and for the sake of testing I removed the RefsEnableLargeWorkingSetTrim that I enabled yesterday...
will report back asap
until now I had all my synthetics set on the same day, and a few weeks ago I added a big nasty (10 TB) file sharing VM (which was previously handled by a netapp box) so it could have been a trigger to my degrading performance...
I've now changed my jobs to create synthetics on different days, and for the sake of testing I removed the RefsEnableLargeWorkingSetTrim that I enabled yesterday...
will report back asap
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
192 TB, 100 TB used
-
- Influencer
- Posts: 18
- Liked: 9 times
- Joined: Apr 21, 2017 6:16 pm
- Full Name: Daniel Mayer
- Contact:
Re: REFS 4k horror story
We are going to be speccing out our first VBR system for a customer, and I'm not really sure ReFS will be the way to go due to the issues, it would be for the savings on synthetics. We're looking at between 8TB-12TB repo using one concurrent task probably around 1.5TB to 2TB full backup with maybe 25-30GB daily changes, the current solution doesn't depupe windows files and such, so it will probably be lower for a full. Would 32GB of RAM work for this, or should we just stick to good old NTFS for now? We haven't specced out the hardware, but i'm not keen in tossing a ton of RAM into the solution either to fix something with Microsoft.
-
- Veteran
- Posts: 370
- Liked: 97 times
- Joined: Dec 13, 2015 11:33 pm
- Contact:
Re: REFS 4k horror story
Our server only has 32GB and we've got about 180TB total space. Only about 60-70TB of that is used but our daily rate is more than 20-30GB so you'll probably be fine. Hard to say definitively of course but we don't have any blue screen issues anymore with all the 2016 updates applied and the reg key set
-
- Influencer
- Posts: 18
- Liked: 9 times
- Joined: Apr 21, 2017 6:16 pm
- Full Name: Daniel Mayer
- Contact:
Re: REFS 4k horror story
Dave,
Thanks for the reply. I was going to spec the machines out originally with 16GB but I might do 32GB to be safe and was for sure going to use the reg tweaks. This would be our first VBR deployment to a customer and I don't want it to go south, I was hoping to use ReFS for the space saving and not have to load up a lot of drives, we deal mostly with SMB so budgets can be tight.
Thanks for the reply. I was going to spec the machines out originally with 16GB but I might do 32GB to be safe and was for sure going to use the reg tweaks. This would be our first VBR deployment to a customer and I don't want it to go south, I was hoping to use ReFS for the space saving and not have to load up a lot of drives, we deal mostly with SMB so budgets can be tight.
-
- Enthusiast
- Posts: 73
- Liked: 9 times
- Joined: Oct 26, 2016 9:17 am
- Contact:
Re: REFS 4k horror story
creating my synthetics on different days seems to have solved the issue: yesterday night the fast cloning part of my exchange job only lasted 1 hour, which is roughly the same as what I had a few months agoantipolis wrote: I've now changed my jobs to create synthetics on different days, and for the sake of testing I removed the RefsEnableLargeWorkingSetTrim that I enabled yesterday...
will report back asap
as a side note, if I set RefsEnableLargeWorkingSetTrim to 1 then fastcloning lasts 2 hours
-
- Chief Product Officer
- Posts: 31814
- Liked: 7302 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
All, I've received the second set of test fixes from Microsoft. According to the description, one of them hits the scenario I was suspecting to be potentially causing the issue. Since the only lab where we've managed to reproduce the issue at (personal lab of one of our engineers) is no longer available, our support will be reaching to some of those with Veeam support cases open to see if it helps. If you want to participate, post your Veeam support case ID here (I will delete your post once I forward your case ID to support). Thanks!
-
- Novice
- Posts: 9
- Liked: never
- Joined: Sep 24, 2015 9:12 am
- Full Name: Nils
- Contact:
Re: REFS 4k horror story
Got some more weerd behavior that happened today.
Yesterday i extended a REFS volume from 70TB to 80TB, today when i checked the nightly backups 19 jobs where still running.
The volume is "sluggish" to access in windows on the proxy.
60gb ram and 4 vcpus assigned to the proxy vm.
Anyone else that have "slow" performance and have recently extended the size of the volume?
We are planning to begin the extremely tedious job of rolling back 140tb of backup data to NTFS...
Yesterday i extended a REFS volume from 70TB to 80TB, today when i checked the nightly backups 19 jobs where still running.
The volume is "sluggish" to access in windows on the proxy.
60gb ram and 4 vcpus assigned to the proxy vm.
Anyone else that have "slow" performance and have recently extended the size of the volume?
We are planning to begin the extremely tedious job of rolling back 140tb of backup data to NTFS...
-
- Expert
- Posts: 141
- Liked: 5 times
- Joined: Jan 27, 2010 9:43 am
- Full Name: René Frej Nielsen
- Contact:
Re: REFS 4k horror story
We switched to Veeam Backup & Replication 9.5 update 1 (from another product) in march/april and opted to used ReFS for the space saving. We installed it on a HPE DL380 Gen9 with 32 GB RAM and it has around 55 TB of storage. We're backup up around 100 VM's from vSphere 6.0 Update 3. We installed the Windows update that was supposed to fix the ReFS problems and implemented solution 1.
It ran well for a week or two, but then the server began to become unresponsive. It would still ping so at first we didn't know there was a problem, but Veeam One did report that it didn't get any backup information from the server. I tried to use the server through ILO and being there physically, but there was no video signal and it didn't respond to anything. A hard reset was the only solution. Ever since that it has done that a lot of times with varying intervals, but most of the time it was running for some days, and maybe a week, before it became unresponsive again. It could also happen in just one day. We tried to upgrade the RAM to 192 GB, but it didn't really help.
What has now helped is updating the server with the latest Service Pack for ProLiant version 2017.04.0(21 Apr 2017). It updated the BIOS, some firmware and some drivers. Since I did that it has been rock solid and it has now been running for two weeks and history shows that it would have crashed now, so I'm really hoping that this has fixed our issue. We originally suspected that the issue was ReFS, but it might not have been, or else HPE has fixed some incompatibility with their hardware/software and ReFS.
This is just meant as a heads up to other people with a similiar setup. If you're runnning HPE Proliant Gen9 servers and experience problems,then try this update.
EDIT: We using 64K blocks on ReFS.
It ran well for a week or two, but then the server began to become unresponsive. It would still ping so at first we didn't know there was a problem, but Veeam One did report that it didn't get any backup information from the server. I tried to use the server through ILO and being there physically, but there was no video signal and it didn't respond to anything. A hard reset was the only solution. Ever since that it has done that a lot of times with varying intervals, but most of the time it was running for some days, and maybe a week, before it became unresponsive again. It could also happen in just one day. We tried to upgrade the RAM to 192 GB, but it didn't really help.
What has now helped is updating the server with the latest Service Pack for ProLiant version 2017.04.0(21 Apr 2017). It updated the BIOS, some firmware and some drivers. Since I did that it has been rock solid and it has now been running for two weeks and history shows that it would have crashed now, so I'm really hoping that this has fixed our issue. We originally suspected that the issue was ReFS, but it might not have been, or else HPE has fixed some incompatibility with their hardware/software and ReFS.
This is just meant as a heads up to other people with a similiar setup. If you're runnning HPE Proliant Gen9 servers and experience problems,then try this update.
EDIT: We using 64K blocks on ReFS.
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
Now after 4 weeks of a very stable filesystem we also got the problem that the system got extremly slow - so slow it only wrote with 1-5 MB/s instead of 200+ MB/s. At first we thougt Veeam Update 2 is the problem and opened a case but then we found out that a simple copy operation showed the same issue. A reboot "solved" that for now.
@gostev: Is the fix also for the "Filesystem getting slow over time" issue?
@gostev: Is the fix also for the "Filesystem getting slow over time" issue?
Who is online
Users browsing this forum: Baidu [Spider], Bing [Bot], Regnor and 67 guests