-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
@Gostev: Then MS should bring out a recomendation (GB needed per TB of storage or something like that).
I always find it odd that in the open source world it is well known and documentated when something requires extrensive ressources (for example btrfs deduplication) but Microsoft just doesn't seem to care.
I always find it odd that in the open source world it is well known and documentated when something requires extrensive ressources (for example btrfs deduplication) but Microsoft just doesn't seem to care.
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
The copy repo I had the problem on over the weekend and yesterday was a 16GB server doing nothing else. That's not as much as our other servers, but it's not anemic, and the server was utterly destroyed by the issue when blockclone requests landed while anything else (like a crash-recovery scrub) was in operation. I mean, we have monitoring agents that send a single UDP heartbeat packet every few seconds. Even THAT stopped. The clock in the system tray halted and stopped updating time. It didn't manage to send off a single additional UDP packet for 2 consecutive days.Gostev wrote:I am more and more inclined to think that simply bumping RAM by a few GB is the way to go for everyone still having the issue even with the fix applied.
There's just no excuse for a server to literally halt to the point that it does like this. Less memory = less performance in the days of multitasking operating systems and CPU/IO schedulers...not system halts. There's a very, very serious issue in the underlying code still. More memory might help, but what happens when something else is going on the system that's consuming a little extra memory at some point in time...or multithreaded requests land in just the right way... the server pseudo-crashes again?
Sorry, but this is my nightmare scenario - that Microsoft might wipe the chalk off their hands and consider it done, and it ends up being one of those things that's terrible-but-unresolved for the next 2-3 iterations of Windows.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
@Graham8
This may have already been asked but you're 4k or 64k repository's ?
Also if you've applied Option 1 'HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem' but then decide to use Option 2, does the previous option need removal from the registry ?
I ended up having to blow away my repository and start over with no backups, all the crashing ended up causing my volume to fail. I didn't have time to do backup copy jobs because of how often my old setup was failing. I did see the system clock thing you saw, this behavior only started as my repository grew in backup size. Now Im trying to learn from your experience to mitigate this issue going forward. I haven't even applied KB4013429 yet in fear it will cause more issues. I think the time has come now that I'll need to apply it though .. This is a mess for sure, if it happens again I'll have to pull Veeam and go back to my previous backup solution
This may have already been asked but you're 4k or 64k repository's ?
Also if you've applied Option 1 'HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem' but then decide to use Option 2, does the previous option need removal from the registry ?
I ended up having to blow away my repository and start over with no backups, all the crashing ended up causing my volume to fail. I didn't have time to do backup copy jobs because of how often my old setup was failing. I did see the system clock thing you saw, this behavior only started as my repository grew in backup size. Now Im trying to learn from your experience to mitigate this issue going forward. I haven't even applied KB4013429 yet in fear it will cause more issues. I think the time has come now that I'll need to apply it though .. This is a mess for sure, if it happens again I'll have to pull Veeam and go back to my previous backup solution
-
- Influencer
- Posts: 14
- Liked: 2 times
- Joined: Feb 02, 2017 2:13 pm
- Full Name: JC
- Contact:
Re: REFS 4k horror story
With my repo (32GB, with hotfix, with option 1 and option 2 with 8 ) going non-responsive, it looks like the backup chain for my only production got corrupted and i had to start a clone to seed an active full. Due to some current space constraints, I had to delete all of my other backups which weren't production to get a new active full for my production chains. Thankfully I appear to be able to still do restores from the previous chain.
So it looks like I won't be able to offer much insight on my progress because the heavy lifting won't be done until this weekend.
I've now set the option 2 key to 32 and upped the memory from 32GB to 64GB. From reading the responses here, that may not be enough. Seems like we have two priorities here:
1. Get MS to prevent lockup and have responsible response to problems due to low resources or responses it can't fullfill
2. Get Veeam to do testing and advise us on the expected amount of memory required based on factors (repo size, size and # of synthetic fulls I assume).
After I get my new storage array I plan on:
Creating a new 64K formatted ReFS repository
Upping the memory to 256GB out of an abundance of caution until we get better guidance.
(support case 02111363)
So it looks like I won't be able to offer much insight on my progress because the heavy lifting won't be done until this weekend.
I've now set the option 2 key to 32 and upped the memory from 32GB to 64GB. From reading the responses here, that may not be enough. Seems like we have two priorities here:
1. Get MS to prevent lockup and have responsible response to problems due to low resources or responses it can't fullfill
2. Get Veeam to do testing and advise us on the expected amount of memory required based on factors (repo size, size and # of synthetic fulls I assume).
After I get my new storage array I plan on:
Creating a new 64K formatted ReFS repository
Upping the memory to 256GB out of an abundance of caution until we get better guidance.
(support case 02111363)
-
- Chief Product Officer
- Posts: 31816
- Liked: 7303 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
All, if you have lockups with the update installed and registry values created, please also include Microsoft support case ID I could refer ReFS dev team to. Also, please mention if you use 4KB or 64KB clusters on the problematic volume.
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
4k, unfortunately. I wasn't aware that the default had changed with 2016 until it was too late to wipe everything out and start from scratch.kubimike wrote:This may have already been asked but you're 4k or 64k repository's ?
Also if you've applied Option 1 'HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem' but then decide to use Option 2, does the previous option need removal from the registry ?
I removed Option1's reg key and then went straight to Option3 since I don't care (at the moment) about performance - I just wanted to maximize stability. Unfortunately, it appears even option3 doesn't fix the IO deadlock issue judging by my experience yesterday.
With you there... Veeam seems great and all, but we don't have the budget to purchase drives to provision extreme multiples of the consumed network storage in order to use it in a way that doesn't involve ReFS/blockclone.kubimike wrote:This is a mess for sure, if it happens again I'll have to pull Veeam and go back to my previous backup solution
Previously there were ZFS servers set up. Those were fabulous...no performance issues, very minimal memory requirements (wasn't doing dedupe), rock solid stability, ZFS Send/Receive + Snapshots allowed many grandfather-father-son snapshot-based backups in near real-time with no space overhead beyond data growth itself. The only reason I switched things off of it is that I thought it was the responsible thing to move away from something (solaris, cron-based send/receive scripts, etc) that only I was able to support well.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
@graham8 do you have a msft ticket open for this problem ? Also I have another Windows 2016 machine that I patched with '4013429' I don't see the registry keys they are referring to. I guess they need manual creation. Yeah, I moved away from DPM because the setup was nearly 8 years old and we needed something that could do faster restores. Now I just want stability like yourself.
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
I didn't want to at first because I hadn't tried all the options yet, but now I have (option3) and have had the issue reoccur, so I guess I will now when I get a chance.kubimike wrote:@graham8 do you have a msft ticket open for this problem?
Yep, they do.kubimike wrote:I don't see the registry keys they are referring to. I guess they need manual creation.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
@graham8 this might sound nuts but if you want to add stability turn on verifier, this may make it crash less. Worked for me a month back
veeam-backup-replication-f2/server-2016 ... ml#p231993
veeam-backup-replication-f2/server-2016 ... ml#p231993
-
- Chief Product Officer
- Posts: 31816
- Liked: 7303 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
We are seeing good results with 64KB clusters, so that's certainly the easiest way to achieve the wanted stability, as the topic name implies but if you can't migrate then there's not too many options left except all of us continue working with Microsoft to help them make systems with 4KB volumes stable.
-
- Influencer
- Posts: 14
- Liked: 2 times
- Joined: Feb 02, 2017 2:13 pm
- Full Name: JC
- Contact:
Re: REFS 4k horror story
Warning - rant coming.
Is anyone aware of tools to view stats or visualizations of ReFS? I recently deleted a bunch of backups to try to make room for a new chain of backups (these ReFS lockup issues lead to corruption of my existing fileserver backup chain). But after deleting them through the veeam interface, the 'free' space did not show as freed. That makes no sense - jobs should not be sharing pointers between jobs, so why not show more free space?
Then, I created a new job and ran a backup, it failed when it ran out of space, then I deleted it through veeam again, however this time I see the space slowly being free (it was a quick delete on the veeam interface, it appears some sort of background process is slowly reclaiming the space on the ReFS server).
So currently if I total my used space from file sizes, I get 101TB used. On the Windows Explorer side, it's 9.25TB free of 26.9TB. That's understandable - we know ReFS's ability to share pointers. But from googling around, there appears to be no way to know how much of that space is pointers and how much is 'real'.
As an update on my situation, all signs point to me having to nuke my entire backups and reformat at 64KB, and hope I don't have requests for restores for the next 3 months.
So when setting up the Veeam repo, I went with the then recommended cluster size of 4K. I had lockups requiring reboots and did not know why, so I disabled everything but my production backups, which started working.
I was waiting on a Microsoft fix, which came, I applied, and sorta guessed at which 'options' to use - still with lockups. It appears it may be best to add more memory but we don't know exactly how much.
During all of this, my production backup chain got corrupt and currently I just don't have enough space to start anew and still keep my old backups, I'll have to nuke it.
Veeam needs to own this problem even if they are relying on Microsoft - create a check on ReFS volumes for cluster size, and warn or disable Repos on 4K.
Do thorough testing on 64KB (if that's the only really supported cluster size, which sounds like is the case), so guidance on creating ReFS repo servers is clear.
I find it absurd that my little environment and all of ours found issues that were not found during Veeam's testing. And I guess I'm first to have a corrupt chain and missing RPO due to crashing ReFS Servers. And I don't feel like playing the game with Microsoft over whether the ReFS issues are a bug or 'by design' or 'Veeam's problem' where if I lose, I lose $500 over Veeam not doing enough testing to weed out obvious issues. I don't have the time and I can't sacrifice RPO going forward since I need to move on and start creating backups again.
I feel like this isn't being taken seriously enough - we are talking about business's backups here.
Is anyone aware of tools to view stats or visualizations of ReFS? I recently deleted a bunch of backups to try to make room for a new chain of backups (these ReFS lockup issues lead to corruption of my existing fileserver backup chain). But after deleting them through the veeam interface, the 'free' space did not show as freed. That makes no sense - jobs should not be sharing pointers between jobs, so why not show more free space?
Then, I created a new job and ran a backup, it failed when it ran out of space, then I deleted it through veeam again, however this time I see the space slowly being free (it was a quick delete on the veeam interface, it appears some sort of background process is slowly reclaiming the space on the ReFS server).
So currently if I total my used space from file sizes, I get 101TB used. On the Windows Explorer side, it's 9.25TB free of 26.9TB. That's understandable - we know ReFS's ability to share pointers. But from googling around, there appears to be no way to know how much of that space is pointers and how much is 'real'.
As an update on my situation, all signs point to me having to nuke my entire backups and reformat at 64KB, and hope I don't have requests for restores for the next 3 months.
So when setting up the Veeam repo, I went with the then recommended cluster size of 4K. I had lockups requiring reboots and did not know why, so I disabled everything but my production backups, which started working.
I was waiting on a Microsoft fix, which came, I applied, and sorta guessed at which 'options' to use - still with lockups. It appears it may be best to add more memory but we don't know exactly how much.
During all of this, my production backup chain got corrupt and currently I just don't have enough space to start anew and still keep my old backups, I'll have to nuke it.
Veeam needs to own this problem even if they are relying on Microsoft - create a check on ReFS volumes for cluster size, and warn or disable Repos on 4K.
Do thorough testing on 64KB (if that's the only really supported cluster size, which sounds like is the case), so guidance on creating ReFS repo servers is clear.
I find it absurd that my little environment and all of ours found issues that were not found during Veeam's testing. And I guess I'm first to have a corrupt chain and missing RPO due to crashing ReFS Servers. And I don't feel like playing the game with Microsoft over whether the ReFS issues are a bug or 'by design' or 'Veeam's problem' where if I lose, I lose $500 over Veeam not doing enough testing to weed out obvious issues. I don't have the time and I can't sacrifice RPO going forward since I need to move on and start creating backups again.
I feel like this isn't being taken seriously enough - we are talking about business's backups here.
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
This is "normal" with ReFS/StorageSpaces, unfortunately. It runs some kind of metadata cleanup on a delayed loop in a background process, and you only slowly get acknowledgement of freed space after some time goes by. That behavior doesn't exactly give me the warm-fuzzies about the solution being well-constructed. ZFS manages to instantly reflect freed space. At any rate, it seems it's expected behavior.jimmycartrette wrote:this time I see the space slowly being free (it was a quick delete on the veeam interface, it appears some sort of background process is slowly reclaiming the space on the ReFS server).
Regarding the broader issue... I don't particularly blame Veeam. It would have been nice if this had been caught by their QA during stress testing/etc, but to be fair the issue is pretty random. The bigger fault in my mind is with Microsoft...they have access to the source code, after all, and for something as important as an underlying filesystem driver, insanely rigorous testing should have been done. Unfortunately, this isn't the first time ReFS and Storage Spaces itself has had massive, utterly broken, features upon release. Anyone else remember how Storage Spaces used to just instantly go offline and not be mountable again if you consumed near 100% of the space on a volume? I mean...good grief. It's utterly criminal that *that* went out the door of Microsoft, but it did. I guess it's my fault for not remembering that and deciding to use technology from Microsoft that I knew was new (new refs API + blockclone), making the assumption that it wasn't badly broken.
Regarding the 4k thing...Veeam is in a tough spot there, since 4k is the default deployment choice, and people might not be exclusively using their storage pools for Veeam data. I think a warning when first adding a repo would be a very good idea, though. That way people who weren't irrevocably invested in the 4k volume could be aware of it before even getting started, and redo it. The underlying filesystem fair-scheduling bug exists regardless of the 4k vs 64k option, though, so that's really just a bandaid. Actually, the warning should probably indicate that the entire solution is currently unreliable due to a still-unresolved Microsoft bug. I agree it's serious business to make certain users are aware of that before they get started with their first repos in new deployments.
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
@jimmy you're not the only one. I had to dump my volume.
-
- Enthusiast
- Posts: 93
- Liked: 6 times
- Joined: Nov 17, 2011 7:55 am
- Contact:
Re: REFS 4k horror story
To all those having to dump their full backup volume and start again, I feel your pain. I have been there in the past and its not fun and doesn't allow for happy nights full of sleep.
Its exactly because of this though that I run two identical backups routines every night. Instead of relying on one routine and then running copy jobs from that I run the primary backup routine, about 5 separate backups jobs to my primary Veeam repository. I then also run a second identical routine to my secondary DR repository and then finally I run a copy job with GFS to my third repository. The copy jobs are therefore not dependant on the success of the primary job run.
The benefit here is exactly that, if something bad happens to your backup routine you can nuke the whole repository and still be safe. The one backup is not dependant on the data from the second backup.
Now I know this may be impractical for some of you with very large backup runs (my backups only amount to about 8 TBs) but if everything works the way it should, with the power of Veeam (and ReFS when fixed) your backup runs should finish quite quickly, at least much more so than in the past, so consider this an option going forward. There are plenty of advantages, for example, I have one repository as ReFS (64K) and one as 2016 NTFS with DeDup and one as 2012 NTFS. Eggs all in different baskets. I can even chose to run one backup routine as forward and another as reverse if I want and if its suits my needs. If you cant fit in both runs in one night then at least do the second run on weekends so if you have to nuke your primary repository you still have something a maximum of a week old to fall back on.
Just a thought to consider
Its exactly because of this though that I run two identical backups routines every night. Instead of relying on one routine and then running copy jobs from that I run the primary backup routine, about 5 separate backups jobs to my primary Veeam repository. I then also run a second identical routine to my secondary DR repository and then finally I run a copy job with GFS to my third repository. The copy jobs are therefore not dependant on the success of the primary job run.
The benefit here is exactly that, if something bad happens to your backup routine you can nuke the whole repository and still be safe. The one backup is not dependant on the data from the second backup.
Now I know this may be impractical for some of you with very large backup runs (my backups only amount to about 8 TBs) but if everything works the way it should, with the power of Veeam (and ReFS when fixed) your backup runs should finish quite quickly, at least much more so than in the past, so consider this an option going forward. There are plenty of advantages, for example, I have one repository as ReFS (64K) and one as 2016 NTFS with DeDup and one as 2012 NTFS. Eggs all in different baskets. I can even chose to run one backup routine as forward and another as reverse if I want and if its suits my needs. If you cant fit in both runs in one night then at least do the second run on weekends so if you have to nuke your primary repository you still have something a maximum of a week old to fall back on.
Just a thought to consider
-
- Service Provider
- Posts: 315
- Liked: 41 times
- Joined: Feb 02, 2016 5:02 pm
- Full Name: Stephen Barrett
- Contact:
Re: REFS 4k horror story
It seems I'm now caught up in this too. The Veeam server that keeps locking up for me is trying to perform a 6TB backup to a 64k ReFS volume - it never completes, either the server Blue screens or locks hard with Core 0 at 100% Kernel Time.
EDIT:- It appears I'm wrong! Looking at the SAN activity the backup is still running, just windows is completely frozen, even the clock doesn't update.
Note MEMORY is Dynamic - actual memory demand is 5gb
EDIT:- It appears I'm wrong! Looking at the SAN activity the backup is still running, just windows is completely frozen, even the clock doesn't update.
Note MEMORY is Dynamic - actual memory demand is 5gb
-
- Chief Product Officer
- Posts: 31816
- Liked: 7303 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
Added.graham8 wrote:Regarding the 4k thing...Veeam is in a tough spot there, since 4k is the default deployment choice, and people might not be exclusively using their storage pools for Veeam data. I think a warning when first adding a repo would be a very good idea, though.
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
To the ones still having issues with 64k + hotfix: Do you use per-VM backup chains?
Additional to adding RAM & enabling patch option 1 we disabled per-VM (deleting hundrets of file took hours). Now the filesystem reacts much more like NTFS, directories can be browsed while backup is running and so on... Merge performance is even MUCH faster (i do not have the time from before but NTFS it took 90 hours in total for our weekend merge and with REFS under 7 ours).
I have the feeling REFS can handle bigger files better than many many small files (we have 1600 vms and last time it crashed after two weeks = 22400 backup files on disk)
Additional to adding RAM & enabling patch option 1 we disabled per-VM (deleting hundrets of file took hours). Now the filesystem reacts much more like NTFS, directories can be browsed while backup is running and so on... Merge performance is even MUCH faster (i do not have the time from before but NTFS it took 90 hours in total for our weekend merge and with REFS under 7 ours).
I have the feeling REFS can handle bigger files better than many many small files (we have 1600 vms and last time it crashed after two weeks = 22400 backup files on disk)
-
- Chief Product Officer
- Posts: 31816
- Liked: 7303 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
Before we're talking millions of files, quantity of files cannot be a problem for any file system. What you really did by disabling per-VM chains is removal of concurrent processing. If this helped, then theoretically you should be able to achieve the same result by simply reducing concurrent tasks at repository instead.
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
Sorry Gostev but it is also not normal for a FS to crash a OS because you are not throwing RAM at it! We are far away for "normal problems".
The thing is in our case the first sign of trouble was that deletion of a few hundret files because of retention took hours. The other problems came after that. And since other customers have reported laggy filesystem browsing even with the patch and we do no longer see this at all it might be a valid question...
The thing is in our case the first sign of trouble was that deletion of a few hundret files because of retention took hours. The other problems came after that. And since other customers have reported laggy filesystem browsing even with the patch and we do no longer see this at all it might be a valid question...
-
- Service Provider
- Posts: 10
- Liked: never
- Joined: Feb 20, 2017 2:34 pm
- Full Name: James Law
- Contact:
Re: REFS 4k horror story
Hi Gostev,
For us, it is the CPU maxing out at 100% that is causing the issue. This only happens on the repos servers.
RAM is always low and never 100%
The hotfix document states it is high memory usage.
Would this solve my CPU 100% issue?
Regards
James L
For us, it is the CPU maxing out at 100% that is causing the issue. This only happens on the repos servers.
RAM is always low and never 100%
The hotfix document states it is high memory usage.
Would this solve my CPU 100% issue?
Regards
James L
-
- Influencer
- Posts: 15
- Liked: 4 times
- Joined: Jan 06, 2016 10:26 am
- Full Name: John P. Forsythe
- Contact:
Re: REFS 4k horror story
Hi again.
Well there is good and bad news.
Starting of with the good one, the server does not crash any more after installing the HP and Mircrosoft Update and the first registry key.
But... the server is so freaking slow!
Backing up one of my fileservers volumes (1.2TB) took 58 hours, with a throughput of about 6MB/s.
The server tells me this about the bottleneck: 03/04/2017 08:21:08 :: Busy: Source 12% > Proxy 3% > Network 5% > Target 94%
Before on my 2012 R2 Server with NTFS (5 year old hardware) the backup of 7 TB was done in less then 48 hours, now I am at 71 hours and still running.
Could it be the the combination of ReFS and StorageSpaces is causing more problems?
My second Repo (iSCSI target on a Synology NAS) is way faster then my 24 HDD local attached NL-SAS volume.
I was in the situation of dumping backups as well, it totally sucks! Veeam is doing a lot of advertising on how great ReFS is, it might if it works. If not you loose productive backups....
Thats all for now folks,
John
Well there is good and bad news.
Starting of with the good one, the server does not crash any more after installing the HP and Mircrosoft Update and the first registry key.
But... the server is so freaking slow!
Backing up one of my fileservers volumes (1.2TB) took 58 hours, with a throughput of about 6MB/s.
The server tells me this about the bottleneck: 03/04/2017 08:21:08 :: Busy: Source 12% > Proxy 3% > Network 5% > Target 94%
Before on my 2012 R2 Server with NTFS (5 year old hardware) the backup of 7 TB was done in less then 48 hours, now I am at 71 hours and still running.
Could it be the the combination of ReFS and StorageSpaces is causing more problems?
My second Repo (iSCSI target on a Synology NAS) is way faster then my 24 HDD local attached NL-SAS volume.
I was in the situation of dumping backups as well, it totally sucks! Veeam is doing a lot of advertising on how great ReFS is, it might if it works. If not you loose productive backups....
Thats all for now folks,
John
-
- Veteran
- Posts: 361
- Liked: 109 times
- Joined: Dec 28, 2012 5:20 pm
- Full Name: Guido Meijers
- Contact:
Re: REFS 4k horror story
Maybe memory after all? Everything smooth here however the "mapped file" in Rammap shows 400MB Active and 208GB Standby (396GB Total memory here), after a while it will update and go to 62GB Standby, but still..
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
@mkretzer are you saying with the patch its no longer laggy browsing ?
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
Well... another weekend, another deadlocked server. *sigh*
I'm now researching my best path to move back to ZFS or some other vendor solution that doesn't rely on refs for space-efficient backup chains...
I wish I could just continue to use Veeam, and not use fulls, but it seems like, even with backup chain integrity verification/healing, it's not recommended to never take fulls? That was my perception when researching Veeam...is that true? There's no recommended way to do a forever-incremental backup in a reliable way? I don't care about speed - I just want reliability. If this is supported in a safe and reliable manner, speed aside, I'd appreciate someone setting me straight.
With our past backup product (Shadowprotect), it was reliable to never need additional fulls even over years of operation...for smaller (<~2TB) servers. It just didn't scale in performance like we needed (performance is much slower than Veeam, but would be fine, except that it runs into network timeout issues and aborts during the course of the occasional highly-extended crash-recovery ("diffgen") job and never completes successfully).
I'm now researching my best path to move back to ZFS or some other vendor solution that doesn't rely on refs for space-efficient backup chains...
I wish I could just continue to use Veeam, and not use fulls, but it seems like, even with backup chain integrity verification/healing, it's not recommended to never take fulls? That was my perception when researching Veeam...is that true? There's no recommended way to do a forever-incremental backup in a reliable way? I don't care about speed - I just want reliability. If this is supported in a safe and reliable manner, speed aside, I'd appreciate someone setting me straight.
With our past backup product (Shadowprotect), it was reliable to never need additional fulls even over years of operation...for smaller (<~2TB) servers. It just didn't scale in performance like we needed (performance is much slower than Veeam, but would be fine, except that it runs into network timeout issues and aborts during the course of the occasional highly-extended crash-recovery ("diffgen") job and never completes successfully).
-
- Veteran
- Posts: 391
- Liked: 56 times
- Joined: Feb 03, 2017 2:34 pm
- Full Name: MikeO
- Contact:
Re: REFS 4k horror story
@graham8 why not just turn off forever forward, integrity/healing checks and just do incrementals + daily synthetics ? If you're doing daily synthetics they wouldn't necessarily succeed if there was corruption from my understanding. This is how I do backups.
-
- Enthusiast
- Posts: 59
- Liked: 20 times
- Joined: Dec 14, 2016 1:56 pm
- Contact:
Re: REFS 4k horror story
Without ReFS and blockclone, synthetics would just be full backups constructed by assembling existing fulls + incrementals at the repo, correct? In that case, if a client wanted 12 monthly retention points and 14 daily retention points, that would be source_volume * (12+14), which in this case (32TB volume, or 64TB raw) would be nearly 1PB (2PB raw)? That would make the backup server (aaaand the necessary off-site repo) ludicrously expensive in disk costs. Just in cost of disks (let's say $150 / 4TB), that would be $75,000 for the primary repo and $75,000 for the offsite repo, all just to provide the backup for a server in which the storage cost was only a measly $2,400. More, really, since it would involve needing to add ~20 48-bay servers to accommodate all those (4tb) disks... I honestly don't understand how anyone is doing this with Veeam prior to refs+blockclone, given the primary-storage vs backup-storage cost differential...
I work in a different market segment (SMB), and for some I guess those kinds of figures are nothing, but our customers would just say "you're fired. we'll do it ourselves" and then just not do backups if faced with that kind of proposition.
With ZFS (+ GFS snapshots and snapshot send+receive) or Shadowprotect (when it can work reliably, with smaller <2TB servers), the backup servers only require very minor inflations of storage requirement above the base volume itself because they never need to do redundant fulls, and the data turnover rate is low (mainly just addition of data). Not knocking Veeam...it does some things better than those options. It's just, those two options are what our clients are accustomed to.
Maybe I'm missing some big picture here...hopefully!
I work in a different market segment (SMB), and for some I guess those kinds of figures are nothing, but our customers would just say "you're fired. we'll do it ourselves" and then just not do backups if faced with that kind of proposition.
With ZFS (+ GFS snapshots and snapshot send+receive) or Shadowprotect (when it can work reliably, with smaller <2TB servers), the backup servers only require very minor inflations of storage requirement above the base volume itself because they never need to do redundant fulls, and the data turnover rate is low (mainly just addition of data). Not knocking Veeam...it does some things better than those options. It's just, those two options are what our clients are accustomed to.
Maybe I'm missing some big picture here...hopefully!
-
- Chief Product Officer
- Posts: 31816
- Liked: 7303 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
No, as CPU issue is "self inflicted" so to speak. To reduce CPU usage, you should just lower the amount of concurrent tasks in the backup repository settings.law999 wrote:For us, it is the CPU maxing out at 100% that is causing the issue. This only happens on the repos servers.
RAM is always low and never 100%
The hotfix document states it is high memory usage.
Would this solve my CPU 100% issue?
-
- Service Provider
- Posts: 315
- Liked: 41 times
- Joined: Feb 02, 2016 5:02 pm
- Full Name: Stephen Barrett
- Contact:
Re: REFS 4k horror story
If if is of any use, I found that 100% Kernel Usage on Core 0 was locking up one of my Repo VMs - After Setting RSS Profile in the NIC settings in the VM to "NUMA Scaling", it spread the load across all Cores. This allowed the VM to continue to function under heavy load.
Nate
Nate
-
- Chief Product Officer
- Posts: 31816
- Liked: 7303 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: REFS 4k horror story
Do you have a support case open with Microsoft? I would like to forward case ID to the ReFS team as it looks like your servers might be the good subject for investigation with the issue consistently reproduced even with the patch installed.graham8 wrote:Well... another weekend, another deadlocked server. *sigh*
My current problem is that I still don't have a single good example to show to them. But even I myself is not convinced at this time if the issue is real - or is some corner case that has to deal with special settings, special hardware, lack of certain system resource or something along these lines (for example, the issue Nate has just mentioned). The ratio of customers having great success using ReFS vs. customers having this deadlock issue actually suggests it might be the corner case.
@Nate FYI, I actually mentioned this one in the previous weekly forum digest:
Gostev wrote:Just before the weekend, VMware has issued this rush post regarding an issue with VMware Tools versions 9.10.0 up to 10.1.5 that causes network packets to drop due to the fact that all network traffic is serviced by a single guest CPU. It immediately occurred to me that the issue may be especially impactful on virtual backup proxies, because they pump hundreds of MB per second through the network stack – while their CPU is already very busy with all the data processing. So, I thought that this bug could potentially be the reason for those intermittent backup job failures some customers have been having for a while now (the issue where job retry always helps)? I guess we'll find out soon, as soon as VMware Tools are updated.
-
- Veeam Legend
- Posts: 1203
- Liked: 417 times
- Joined: Dec 17, 2015 7:17 am
- Contact:
Re: REFS 4k horror story
Yes, thats what i am saying. For the first time.kubimike wrote:@mkretzer are you saying with the patch its no longer laggy browsing ?
- 128 -> 384 GB RAM
- Patch with Option 1
- No per-VM
And backups and merges are FAST....
Who is online
Users browsing this forum: Andreas Neufert, Bing [Bot], dariusz.tyka, EIvanov, Google [Bot], Semrush [Bot], woifgaung and 154 guests