-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
REFS issues (server lockups, high CPU, high RAM)
[UPDATE] October 15, 2018
The solution is to install September 2018 Windows Update (KB4343884) or later, since Windows Updates are cumulative.
Hello,
posted several threads already in the last few days but i have to post again about what happened to us this night.
First of all right now we are in the middle of migrating to REFS repos. We made the error to use 4k blocks on our temporal 120 TB repo. We thought it is no bug deal as it seemed to impact performance of file operations only at first. We monitored memory and cpu usage and did not see the memory preasure others saw because the system is gladly oversized. So we continued to successfully migrate to the new repos.
All went good for a few days, we have to wait 28 days so we can format our "production" backup storage and we were optimistic that we would "survive" that time because of the REFS space savings.
Then i got a message from our monitoring system this night. Our Veeam server was completely unreachable. I went on-site and found that i can move the mouse but not much more. I had to do a hard reset. After the system came up i saw that it tries to create 3 synthetic fulls at the same time, do a tape backup and some copy jobs. All in all nothing unusual - this worked well the nights before. So i disabled the tape job, enabled a limit of 12 concurrent tasks on the repos (before there was no limit) to regulate the load a little bit and drove back home.
10 Minutes later the next alert came in - so we had another crash. So i drove back to the company, did a hard reboot and then limited the REFS repos to 1 concurrent task so that at least our BCJs can finish at some point in the future and started to roll back to our old NTFS repository - with active fulls which i have to do for 1600 machines/140 TB.
Opening a explorer window on the REFS volume takes half a minute even without any load now so it is definately the REFS volume which has issues...
BTW i opened a sev1 case with MS - no response yet....
Markus
The solution is to install September 2018 Windows Update (KB4343884) or later, since Windows Updates are cumulative.
Hello,
posted several threads already in the last few days but i have to post again about what happened to us this night.
First of all right now we are in the middle of migrating to REFS repos. We made the error to use 4k blocks on our temporal 120 TB repo. We thought it is no bug deal as it seemed to impact performance of file operations only at first. We monitored memory and cpu usage and did not see the memory preasure others saw because the system is gladly oversized. So we continued to successfully migrate to the new repos.
All went good for a few days, we have to wait 28 days so we can format our "production" backup storage and we were optimistic that we would "survive" that time because of the REFS space savings.
Then i got a message from our monitoring system this night. Our Veeam server was completely unreachable. I went on-site and found that i can move the mouse but not much more. I had to do a hard reset. After the system came up i saw that it tries to create 3 synthetic fulls at the same time, do a tape backup and some copy jobs. All in all nothing unusual - this worked well the nights before. So i disabled the tape job, enabled a limit of 12 concurrent tasks on the repos (before there was no limit) to regulate the load a little bit and drove back home.
10 Minutes later the next alert came in - so we had another crash. So i drove back to the company, did a hard reboot and then limited the REFS repos to 1 concurrent task so that at least our BCJs can finish at some point in the future and started to roll back to our old NTFS repository - with active fulls which i have to do for 1600 machines/140 TB.
Opening a explorer window on the REFS volume takes half a minute even without any load now so it is definately the REFS volume which has issues...
BTW i opened a sev1 case with MS - no response yet....
Markus
-
- Enthusiast
- Posts: 82
- Liked: 11 times
- Joined: Nov 11, 2016 8:56 am
- Full Name: Oliver
- Contact:
Re: REFS 4k horror story
thx for sharing this!
Would appreciate it, if you can update us on the status!
regards
oliver
Would appreciate it, if you can update us on the status!
regards
oliver
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
MS called - interestingly MS seems to know about the 4 k issues - at least he told me he heard something about issues....
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
Ok this hotfix was recommended: https://support.microsoft.com/en-us/hel ... -kb3216755
Anyone already tried this? I asked for more information about this hotfix...
Anyone already tried this? I asked for more information about this hotfix...
-
- Influencer
- Posts: 20
- Liked: 6 times
- Joined: Feb 01, 2017 8:36 pm
- Full Name: Stef
- Contact:
Re: REFS 4k horror story
Well we're glad we're not the only ones having these issues.
veeam-backup-replication-f2/9-5-refs-se ... 25-15.html
We can also confirm that, though memory usage seemed better at first, the patch does not solve the problem. Even our 64KB formatted 64TB luns are seeing these symptoms. Performance is very poor as well.
veeam-backup-replication-f2/9-5-refs-se ... 25-15.html
We can also confirm that, though memory usage seemed better at first, the patch does not solve the problem. Even our 64KB formatted 64TB luns are seeing these symptoms. Performance is very poor as well.
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
@rendest are your 64 k volumes on the same server with the 4 k volumes?
-
- Influencer
- Posts: 20
- Liked: 6 times
- Joined: Feb 01, 2017 8:36 pm
- Full Name: Stef
- Contact:
Re: REFS 4k horror story
Not anymore since they were taking the kernel hostage.
We now isolated the 4K volumes on a seperate host, and migrating the data towards the 64K ones.
We now isolated the 4K volumes on a seperate host, and migrating the data towards the 64K ones.
-
- Expert
- Posts: 172
- Liked: 20 times
- Joined: Oct 03, 2016 12:41 pm
- Full Name: Robert
- Contact:
Re: REFS 4k horror story
I am also migrating to Refs. So reading the forum, i asume it is absolut best to use 64k volumes and stay away from 4k?
-
- Product Manager
- Posts: 20343
- Liked: 2281 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: REFS 4k horror story
Correct.
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
And from all i have read in the past 36 hours you should test it really good before you throw all your backups on it... In our case all looked great up until there were a bigger number of files on the disk...Robvil wrote:I am also migrating to Refs. So reading the forum, i asume it is absolut best to use 64k volumes and stay away from 4k?
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
Ok i just got a very long email from Microsoft with alot of links where the general recomendation is "use NTFS because REFS has a many limitations". Only one thing was diretly targeted at our situation:
"You should avoid volumes bigger than 64 TB". I find this pretty bad because SOBR is not for us at the moment because we also had some issues with per-VM. And right now we have quite big backup files... For us, a bigger volumes is a must-have right now, if we split our 200 TB backup repo in 4 REFS repos we might loose alot of the REFS space saving benefits...
Markus
"You should avoid volumes bigger than 64 TB". I find this pretty bad because SOBR is not for us at the moment because we also had some issues with per-VM. And right now we have quite big backup files... For us, a bigger volumes is a must-have right now, if we split our 200 TB backup repo in 4 REFS repos we might loose alot of the REFS space saving benefits...
Markus
-
- Chief Product Officer
- Posts: 31630
- Liked: 7128 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
Markus, can you share Microsoft support case ID where this was stated? I wonder if the development team behind ReFS agrees with this statement, or perhaps this is an opinion of the specific support engineer who is simply trying to close the case, as this often happens the best way to find out is to ask the dev team behind ReFS directly - which I can easily do. Thanks!
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
Gostev,
that would be so great - number is 117020115253831
Especially the 64 TB thing is kind of a deal breaker for us...
Markus
that would be so great - number is 117020115253831
Especially the 64 TB thing is kind of a deal breaker for us...
Markus
-
- Service Provider
- Posts: 56
- Liked: 14 times
- Joined: Jan 10, 2012 8:53 pm
- Contact:
Re: REFS 4k horror story
perhaps he meant avoid >64TB partitions *while using 4k cluster size* ? Because that, while not directly, sort of lines up with the inertia that veeam and microsoft have about using 64k for really large volumes.
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
No. In his mail there was not one mentioning of something about the 4 k cluster size... That is also a reason i am kind of caucious about this recommendation.
-
- Influencer
- Posts: 20
- Liked: 6 times
- Joined: Feb 01, 2017 8:36 pm
- Full Name: Stef
- Contact:
Re: REFS 4k horror story
We more or less found a way to circumvent the filesystem from being stuck as shown in the screenshot below.mkretzer wrote:No. In his mail there was not one mentioning of something about the 4 k cluster size... That is also a reason i am kind of caucious about this recommendation.
Edit: Yes it looks like I just drew a white box, it's actually white-space where the windows kernel forgets the disk is ReFS.
We notice that the backup repo's (now newly formatted to 64KB cluster size) still cause the filesystem to be unresponsive, so we tried throttling the repositories to a lower throughput. Surprisingly, this significantly improved our performance. Since the volume doesn't become unresponsive, Veeam can now backup consistently without being interrupted by the unresponsiveness of the volume.
The throughput is still significantly slower than they would have been on NTFS (we are going to log a case for this as well), but at least, it's stable.
We suspect that the lower block size on the previous formatted volume, resulted the volume to get stuck even faster (in fact, 16x faster). We are backupping from All flash storage arrays, so our bottleneck almost always is our destination storage target.
We are currently monitoring the incoming IO's and as soon as it reaches its limit and causes storage latency on the backup target, the filesystem becomes unresponsive. So throttling temporarily circumvents this issue. This, however, isn't a permanent solution since, even with storage latency, the filesystem should keep on working. A 20-30 ms hiccup on the storage lun causes a 20 second unresponsiveness of the ReFS volume, which in turn brings Veeam to a halt...
@Mkretzer, have you tried throttling as well ?
-
- Enthusiast
- Posts: 62
- Liked: never
- Joined: Nov 03, 2011 2:55 pm
- Full Name: Ivor Dillen
- Contact:
Re: REFS 4k horror story
Maybe a confirmation.
I have 2 repo's 64KB cluster size and was testing some backup jobs and some backup copy jobs. Everything was acting normal until I did 2 backup copy jobs at the same time (to the same repo) Then I saw drops (veeam job timeline) in both the jobs at the same time. In the windows resource monitor I saw at disk level alot of writes but no file (and the memory consumption went straight up) - stopping one of the jobs was a solution for the other job to proceed as normal.
Ivor
I have 2 repo's 64KB cluster size and was testing some backup jobs and some backup copy jobs. Everything was acting normal until I did 2 backup copy jobs at the same time (to the same repo) Then I saw drops (veeam job timeline) in both the jobs at the same time. In the windows resource monitor I saw at disk level alot of writes but no file (and the memory consumption went straight up) - stopping one of the jobs was a solution for the other job to proceed as normal.
Ivor
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
@rendest No we did not try to throttle as our main problem was that with 4 K and without the (bad) patch the whole system crashes. But you might be right as the unresponsiveness happened as soon as there is some kind of load.
Do you already have KB3216755 installed? I have the feeling this update makes the volume much more stable under load - but crashes Veeam services after 12 hours or so...
Do you already have KB3216755 installed? I have the feeling this update makes the volume much more stable under load - but crashes Veeam services after 12 hours or so...
-
- Influencer
- Posts: 20
- Liked: 6 times
- Joined: Feb 01, 2017 8:36 pm
- Full Name: Stef
- Contact:
Re: REFS 4k horror story
Good to hear, so now it's up to Veeam by patching whatever Microsoft broke. Since you mentioned it causes Veeam to crash, and our setups are identical, we are waiting for feedback from Veeam before attempting to install the patch.
-
- Influencer
- Posts: 20
- Liked: 6 times
- Joined: Feb 01, 2017 8:36 pm
- Full Name: Stef
- Contact:
Re: REFS 4k horror story
KB4010672 doesn't seem to fix the time-out issues when experiencing latency... so throttling it is for now
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
But can you throttle a fast-clone? In our system the fast-clone still lead to high load and the problem described here
-
- Influencer
- Posts: 20
- Liked: 6 times
- Joined: Feb 01, 2017 8:36 pm
- Full Name: Stef
- Contact:
Re: REFS 4k horror story
As I understood, fastclone are just commands to move block pointers, so that shouldn't be that intensive.mkretzer wrote:But can you throttle a fast-clone? In our system the fast-clone still lead to high load and the problem described here
But there are other maintenance tasks, which do not follow the throttling (for example cleanup/rollback tasks after a failed backup). Those were quite intensive & took our repository hostage overnight.
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
So sadly this is not a good solution for us...
@Gostev: do you see any efforts from microsoft to get this strange latency issues under controll? Was this reproduced by Veeam with 64K blocks?
@Gostev: do you see any efforts from microsoft to get this strange latency issues under controll? Was this reproduced by Veeam with 64K blocks?
-
- Enthusiast
- Posts: 62
- Liked: never
- Joined: Nov 03, 2011 2:55 pm
- Full Name: Ivor Dillen
- Contact:
Re: REFS 4k horror story
on what numbers do you throttle?
-
- Influencer
- Posts: 20
- Liked: 6 times
- Joined: Feb 01, 2017 8:36 pm
- Full Name: Stef
- Contact:
Re: REFS 4k horror story
10 mb/s less of where ReFS craps its pants. Depends on the array.
-
- Enthusiast
- Posts: 62
- Liked: never
- Joined: Nov 03, 2011 2:55 pm
- Full Name: Ivor Dillen
- Contact:
Re: REFS 4k horror story
we need a latency throttling feature on the repository instead of the source side
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: REFS 4k horror story
Just catching on here, why is the 64TB a deal breaker? You can have a lot of thin provisioned volumes with sotrage spaces as example.Especially the 64 TB thing is kind of a deal breaker for us...
Ps. the 64TB "limit" is a VSS limit...
-
- Influencer
- Posts: 20
- Liked: 6 times
- Joined: Feb 01, 2017 8:36 pm
- Full Name: Stef
- Contact:
Re: REFS 4k horror story
The 64TB volumes aren't relevant anymore, since we're experiencing these issues at any lun size.Delo123 wrote:
Just catching on here, why is the 64TB a deal breaker? You can have a lot of thin provisioned volumes with sotrage spaces as example.
Ps. the 64TB "limit" is a VSS limit...
What mkretzer means is that we'd rather have larger jobs on larger volumes for more space savings. (so no per-vm backup file or scale out repo)
-
- Veeam Legend
- Posts: 1192
- Liked: 412 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
Our problem is that we have BIG backup files (up to 8 TB) and it just would not fit very well on such small volumes with all the incrementals.
Per-VM backup files is no solution for us right now...
And BTW storage spaces is not supported on disks behind RAID/FC/SAN controllers... Or did that change with 2016?
Per-VM backup files is no solution for us right now...
And BTW storage spaces is not supported on disks behind RAID/FC/SAN controllers... Or did that change with 2016?
-
- Service Provider
- Posts: 128
- Liked: 27 times
- Joined: Apr 01, 2016 5:36 pm
- Full Name: Olivier
- Contact:
Re: REFS 4k horror story
Out by curiosity, are you using hardware a raid controller, attached storage or JBOD with a storage pool ?
Olivier
Olivier
Who is online
Users browsing this forum: Google [Bot] and 33 guests