Comprehensive data protection for all workloads
Post Reply
ekisner
Expert
Posts: 202
Liked: 34 times
Joined: Jul 26, 2012 8:04 pm
Full Name: Erik Kisner
Contact:

RAM drive backup targets?

Post by ekisner »

So, I had a thought.

If I had a B&R server with say 256GB of memory in it, I could allocate 200GB of that to a RAM drive. Now... this isn't big enough to support all of my backups... but it IS big enough to support an incremental pass of jobs that run every 20 minutes. To say that "it isn't big enough" is merely a matter of scale at that point. A terabyte of memory is certainly a real thing.

Thus, B&R could be writing from the source to the RAM drive, whilst trickling the data into the actual repository at the slower rate that the repository storage could support. As we've got a replica job that runs immediately afterward, it could funnel that data out exceptionally fast (it could easily do it in tandem with the actual backup-to-disk rather than waiting for the B2D to finish). It could easily feed additional backup copy jobs as well. (edit: once linked jobs and the repository are all finished writing, the RAM Drive data is purged).

The advantage is that the destination would literally never be a bottleneck unless you're maxing out your network connection (which is hard to do if you've got decent 10GbE or more). It might take time for Veeam to commit the writes, and I'd certainly not want it for something with transaction logs (Exchange, SQL, etc), but for something generic like a file server where you aren't doing some kind of truncation on the server, this would be pretty insane.

Obviously, memory is volatile. Which is why I'd want to avoid transaction log stuff. But for any other kind of backup, if the server crashes mid-backup (or before it has finished trickling into non-volatile storage) you're losing the recovery point, and that point alone.

The implementation would require that Veeam allow for data writing to be cached/staged in high performance storage. Be it an M.2, SSDs, RAM Drives, whatever. There'd need to be contingency handling for when you exceed the capacity of the cache, but ultimately, it doesn't feel like it would be a huge amount of work. The performance gain, however, feels significant - especially so when you've got sequentially linked jobs.
csydas
Expert
Posts: 193
Liked: 47 times
Joined: Jan 16, 2018 5:14 pm
Full Name: Harvey Carel
Contact:

Re: RAM drive backup targets?

Post by csydas »

While it's a cute idea, to me it seems like you'd be eternally fighting with the production side of things as the snapshot it always going to be your major pain in the ass. Since it's volatile, you're losing the full backup from each run and have to put the VM under the stress of a Full backup far more often than you would like to or ideally should.

I'm not saying your idea isn't without merit, but I personally don't see the point except for the most extreme of situations where every second of the snapshot time counts for the applications running on your VM. For the machines we have running, even a terabyte is mincemeat compared to what we're moving on an hourly basis, so at that point the RAM becomes a SPOF as I have to ensure that I can offload all of its contents before I let the RAMdisk clear.

Again, not dismissing your idea outright, but for my shop and my setup, I cannot imagine this being practical. It would be such a choke point that it would be more costly to have a RAMdisk repo for all but the most insignificant of servers, and for those I wouldn't care if they struggled regardless.
ekisner
Expert
Posts: 202
Liked: 34 times
Joined: Jul 26, 2012 8:04 pm
Full Name: Erik Kisner
Contact:

Re: RAM drive backup targets?

Post by ekisner »

Your full backup is only at risk if you're using forward incremental. Reverse incremental are perfectly safe. In the case of production storage, it helps because your snapshot is open for less time... basically, as fast as your production storage can feed data, B&R can take it. Typically your production storage will be faster than your repository, which in turn makes things slower still because your repository ends up being a bottleneck.

As for memory, it really isn't that expensive these days. Remember we don't need premium server memory, Veeam is easily able to do its own checksums on content (and because it's in memory it'll still be insanely fast). It is of course a SPOF, but in any case where you're writing to it, a failure will result in an outage anyways. Bad memory will invariably cause a bluescreen. As it's only a temporary staging area, the risk you face increases only in that you're risking your current backup point while you make it. While that's a crucial thing for something with a truncated log, it's far less so with a reverse incremental pass on something which does not get truncated.
csydas
Expert
Posts: 193
Liked: 47 times
Joined: Jan 16, 2018 5:14 pm
Full Name: Harvey Carel
Contact:

Re: RAM drive backup targets?

Post by csydas »

Well, I think you're kind of missing my point.

Our normal backup load is a around 30 some TB for primary servers and somewhere around 6 TB for the secondary and tertiary servers. Even though RAM is cheap, I don't know of too many boards that support 1 TB (or more? I'm not even sure). Of my primary backups, one of our MSSQL servers is the one that absolutely needs to have regular backups, both the VM backup and the log backups; the logs can be offloaded elsewhere, but even for that server, the full backup file already exceeds what our current RAM maximum on our servers can support. So, okay, we'll look for some of the lower priority Primary Servers that do fit. Even with that, I still need to break this out to a separate backup job that uses this RAM disk, and it's not even the most important server I have. Sure, I reduce the overall snapshot time, but since these servers are all extremely snapshot sensitive anyways, we're doing a mixture of Snapshot Only jobs on the primary storages and BfSS during off-hours, and even then we still get micro-stuns if we aren't careful about our scheduling of HA events and which server is the primary node at that time. With BfSS we're already at the shortest time I can get these VMs off of VMware snapshots as is, as far as I know. The storage snapshots don't add nearly as much pressure, and by that point the VMware snapshot is already gone by the time we're writing to the storage, so it doesn't seem like having a RAM disk is any better here and all it does is add some processing overhead for me since now there's a server (or 3) that have to be processed separately.

If there's a use case that works for you on this, then go for it. In my mind the use case range is just pretty narrow.
ekisner
Expert
Posts: 202
Liked: 34 times
Joined: Jul 26, 2012 8:04 pm
Full Name: Erik Kisner
Contact:

Re: RAM drive backup targets?

Post by ekisner »

Don't forget that this is a buffer. It's still going to be writing into your target storage as fast as your target storage can take it (which in turn frees up that space for additional buffering). All it does is ensure that your production storage feeds faster in the event that your DR storage can't keep up (which frankly, writes are never as fast as reads).

I'm going to guess that for the volume of data you deal with, your prod and DR storage are both exceptionally fast. In many cases I'd be willing to bet that in most cases DR storage is "fast enough to make the window", with the rest of the focus on capacity over performance.

Coupled with a couple NVME SSDs (I'm sure they'd get bad sectors pretty quickly with this volume of writes, but technology is continually improving) you've got a wonderful cache right on the server. Fill up the memory, fill up the SSD, then just slow down to the point that the DR storage can keep up at.

In all honesty, the only reason I say a RAM Drive rather than just "use memory" is to allow for things like NVMEs.
tsightler
VP, Product Management
Posts: 6011
Liked: 2843 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: RAM drive backup targets?

Post by tsightler » 1 person likes this post

I don't really think RAM is that necessary since network bandwidth is likely going to be the bottleneck. You say that it's hard to saturate 10GbE, but that's really not true, 10GbE is usually first bottleneck you hit when trying to design repos for performance. It only takes 2x SATA SSDs in a RAID 0 stripe (or 4 in a RAID 10) to ingest at nearly 10GbE. Even a single NVMe SSD can do it. Heck, even 12-14 SATA spinning disks can do it.

It's why boxes like the S3260 have 2x 40GbE connections, you need lots of network bandwidth to use even the SATA disks spinning in there. And SSDs offer far more protection in the event of a power failure (at least enterprise class SSDs).
Post Reply

Who is online

Users browsing this forum: pirx, Semrush [Bot] and 52 guests