V10 & XFS - all there is to know (?)

Post by **dcolpitts** » Mar 18, 2021 10:42 pm this post

Soncscy - I guess I said 3520, that's the previous generation. As of today, the entry level for the current generation is the 3620, which has 14.5TiB available out of the box. List price is $20k USD. The 10GbE-T card is $2500 (USD list). Then your services care pack is on top of that depending on the desired support level you want. And I'm Canadian based, so by time you convert that from USD to CDN, and then get your big deal discounts, and add a bit of margin, yeah, $30k (Canadian dollars) is a realistic entry price for the customer. That said, you are getting a DL380 GEN10, data encryption, and paid supported for when things go wrong (instead of rolling your own stuff and being stranded with Google when things go horribly wrong). And most of my customers are acquiring with all flash SANs, servers, and related infrastructure via a HPEFS 3 year 0% lease, so it's just another line item on the invoice for most of them.

I guess what I'm really looking for is the overall dedup ratio on the entire XFS volume and not individual files - I'm spoiled from the StoreOnce where it's right there in front of you on Catalyst Stores summary page...

BTW - I'm planning to document my configuration and configuration scripts on my blog page (blog.jbgeek.net) when I get time...

dcc

Mar 18, 2021 11:02 pm

pirx wrote: ↑Mar 18, 2021 5:57 pm According to the the Apollo/ReFS thread 192GB RAM would be enough for a 500TB repository with ReFS. What about XFS? Is this also a reasonable number for xfs with reflinks?

Pirx - Maybe this has already been mentioned, but as an FYI, I was on a call with Federico (the author from HPE of the original Apollo reference document) a few weeks ago, and he mentioned they didn't need that much RAM, but it was populated with this much RAM due to HPE's memory population rules and guidelines for Gen10 (and basic memory optimization in the platform).

Apollo 4510 Memory population guidelines:

Install DIMMs only if the corresponding processor is installed.

If only one processor is installed in a two-processor system, only half of the DIMM slots are available.

To maximize performance, it is recommended to balance the total memory capacity between all installed processors.

When two processors are installed, balance the DIMMs across the two processors.

White DIMM slots denote the first slot to be populated in a channel.

Mixing of DIMM types (UDIMM, RDIMM, and LRDIMM) is not supported.

The maximum memory speed is a function of the memory type, memory configuration, and processor model.

The maximum memory capacity is a function of the number of DIMM slots on the platform, the largest DIMM capacity qualified on the platform, the number and model of installed processors qualified on the platform.

Notes: The maximum memory speed is a function of the memory type, memory configuration, and processor model. Intel Xeon Platinum/Gold 82xx/62xx processors that support 2933 MT/s DIMMs will do so only at 1 DIMM per channel. Configuring 2 DIMMs per channel will drop memory speed back to 2666 MT/s.

dcc

pirx · Post by **pirx** » Mar 19, 2021 7:08 am this post

tsightler wrote: ↑Mar 18, 2021 7:56 pm Also, a lot of that memory recommendation on Windows was attempts to work around the crazy Windows buffer behavior, which was about 10x worse with ReFS due to various bugs. Now, with v11 bypassing the OS buffer for writes on Windows, memory usage should be more normal. Still I would never recommend to go with less than 4GB per core recommended in best practice, and, arguably 4 GB per planned task (generally core = task, but if you want to oversubscribe cores on your repo, having more memory is the best way to make sure you can do this). Admittedly, 4GB is probably overkill, but if your environment is large, VeeamAgent process can grow quite a bit larger. There can be quite a difference in memory usage when you are backing up 50 VMs with 100GB disk vs 50 VMs with 4TB of disks.

I can pretty much say this, I've had far more customers regret not putting memory in the box vs the other way around! If you want to have less problems, don't skimp on it. Admittedly, I don't think you need to go overboard either. I've had customers buy boxes with ~1TB of RAM, which seems maybe the other extreme! If I was buying a big box like that, I'd probably look at no less than 256GB if it's repo only, 384GB if it's proxy+repo.

Note that the customers I work with are on the larger side, almost all have >10,000 systems being protected, some a LOT more, and many have dozens of PB of data under protection, but the lessons are mostly valid for customers of all sizes.

Our environment is mixed, the majority are smaller VMs (100-300GB) but there are also 10+TB VMs. I get the feeling that 192GB from the Apollo example could be not enough. Still this is the optimal configuration in price/performance I guess (12 DIMMs, 16GB). Even if it should be used as Repo server and only for copy jobs.

Post by **Gostev** » Mar 19, 2021 12:31 pm this post

Tom's valid recommendations are based on V10 and earlier. If you're planning to go with V11 right away, then 192GB is plenty for a repository role, because V11 is much less memory-hungry.

Also, you should not be worrying about this all that much in general, and just go with the optimal config - because you cannot possibly go wrong! Keep in mind that Veeam is completely hardware agnostic and can be made to work optimally with ANY server configuration at all. There are close to a million active installations out there today, and their RAM amount varies by two orders of magnitude! While you can afford buying a shiny new server with 192GB RAM, other folks are lucky if they could repurpose some old server with 16GB RAM. But in both cases, Veeam is super easy to tune to the available memory by controlling the number of concurrent tasks (the recommendation for repositories is 4GB RAM per task).

Post by **tsightler** » Mar 19, 2021 5:49 pm this post

192GB is only 16GB less than the best practice recommendation of 4GB per core (2x 26 core = 52 cores = 208GB) so for me I think it's likely to be plenty. I wouldn't be fretting about this personally unless I was going to highly oversubscribe tasks on the box, which is what usually gets people in trouble memory wise (well, resource wise in general). And indeed, the memory reduction in v11 is significant and dramatically lowers the chances of hitting the "worst case" memory consumption scenario, which is what best practice attempts to plan for. It's still possible to have large VeeamAgent processes, even in v11, but the fact that those processes aren't also bloating Windows OS memory usage goes a long way!

pirx · Post by **pirx** » Mar 19, 2021 5:57 pm this post

Thanks for all your feedback, I know I tend to spam a bit. As 192GB with 32GB DIMMs seem not much more expensive than with 16GB, we probably will start with this. Currently we overcommit our 20c/128GB servers with 30+ tasks and cpu/mem only rarely reaches 80% usage.

Post by **tsightler** » Mar 19, 2021 6:27 pm this post

I'm assuming that has to be with v10, correct? I'd guess that v11 would reduce that quite a lot as a good bit of that use is probably OS cache usage vs actually VeeamAgent process usage.

pirx · Post by **pirx** » Mar 19, 2021 6:28 pm this post

yes, all V10 as we must wait on IBM Storage Integration.

Mar 19, 2021 7:42 pm

As i stated before we have a ~490 TB XFS test environment to which we copy ~2000 VMs daily with 32 concurrent tasks.

RAM is 384 GB. The free RAM fluctuates between 373 and 379 free GB, so 188 MB per task! And this is V10.
The only thing we did not yet see is GFS creation, i will keep you posted.

But i think Gostev is right!

pirx · Post by **pirx** » Mar 20, 2021 7:39 am this post

Yeah, I just remember a couple of years ago when I had samba file servers with several dozen TB data and xfs_repair was horrible slow and OOM-killed because of not enough memory. But the servers back then were far from 192GB RAM.

Mar 20, 2021 10:49 am

Interesting. I found:
https://linux-xfs.oss.sgi.narkive.com/z ... ent-per-tb

I did this on our ~490 TB filesystem, filled with 515 TB:
xfs_repair -n -vv -m 1 /dev/mapper/veeamxfs-xfslv
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
- max_mem = 1024, icount = 41728, imem = 163, dblock = 129018783744, dmem = 62997453
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 61569.

Accoring to xfs.org:
The numbers reported by xfs_repair are the absolute minimum required and approximate at that; more RAM than this may be required to complete successfully. Also, if you only give xfs_repair the minimum required RAM, it will be slow; for best repair performance, the more RAM you can give it the better. "

So it needs 60 GB as an absolute minimum. The question is: Will that increase with more block cloned data or is it purely depending on how much % of the filesystem is used?
We were planning to use an old server with 128 GB for our 620 TB production repo with much more block cloning so if it is depending on block cloned data we need to increase that

We will test it without -m 1 now (so it really does the check).

Mar 20, 2021 1:00 pm

Ok the "real" repair took 42 GB - but there was no corruption to fix i wonder if that (and the amount of block cloning) also can change the RAM requirement.

After all that means Veeam Components are not your issue - even with V10. Problem comes from when you need to repair the filesystem! What i find bad about this is that you won't know you need the RAM until you have an issue!

@Gostev: Is that not perhaps a reason to update official FAQs? A user cannot know that they might get an issue with not enough RAM until they need to run xfs_repair!

Post by **Gostev** » Mar 21, 2021 1:12 am this post

Indeed, it will be a good information to include. But I wonder where we can get the solid information on RAM requirements, including all the dependencies.

Mar 21, 2021 10:35 am

I mean at least the xfs tools are telling you how much RAM is needed. Now we just have to do some tests with volumes with different sizes/block cloning data and so on and we can extrapolate a rough value per TB.

I already increased the number of GFS points in our test environment so that in about 1-2 weeks we have twice the blocked cloned data. I will update this thread!

mweissen13 · Post by **mweissen13** » Mar 22, 2021 11:54 am this post

If you want a definitive answer you should probably ask the XFS folks at: http://vger.kernel.org/vger-lists.html#linux-xfs
They are usually pretty knowledgeable and helpful.

pirx · Post by **pirx** » Mar 22, 2021 12:30 pm this post

I already did this, waiting on feedback.

pirx · Mar 23, 2021 10:02 am

Some feedback from xfs list.

Dave Chinner:

Filesystem capacity doesn't massively affect repair memory usage
these days.

The amount of metadata and the type of it certainly does, though. I
recently saw a 14TB filesystem require 240GB of RAM to repair
because, as a hardlink based backup farm, it had hundreds of
millions of directories, inodes and hardlinks in it. Resolving all
those directories and hardlinks took 3 weeks and 240GB of RAM....

I've seen other broken backup server filesystems of similar size
that have had close on 500GB of metadata in them, and repair needs
to index and cross-reference all that metadata. Hence memory demands
can be massive, even in today's terms....

Unfortunately, I haven't seen a broken filesystem containing
extensive production use of reflink at that scale, so I can't really
say what difference that will make to memory usage at this point in
time.

So there's no one answer - the amount of RAM xfs_repair might need
largely depends on what you are storing in the filesystem.

Lucas Stach:

xfs_repair can be quite a memory hog, however the memory requirements
are mostly related to the amount of metadata in the FS, not so much
with the overall size of the FS. So a small FS with a ton of small
files will require much more RAM on a repair run than a big FS with
only a few big files.

However, xfs_repair makes linear passes over its workingset, so it
works really well with swap. Our backupservers are handling filesystems
with ~400GB of metadata (size of the metadump) and are only equipped
with 64GB RAM. For the worst-case where a xfs_repair run might be
needed they simply have a 1TB SSD to be used as swap for the repair
run.

pirx · Post by **pirx** » Apr 07, 2021 10:00 am this post

mkretzer wrote: ↑Mar 21, 2021 10:35 am I mean at least the xfs tools are telling you how much RAM is needed. Now we just have to do some tests with volumes with different sizes/block cloning data and so on and we can extrapolate a rough value per TB.

I already increased the number of GFS points in our test environment so that in about 1-2 weeks we have twice the blocked cloned data. I will update this thread!

Any update?

Post by **mkretzer** » Apr 07, 2021 11:47 am this post

Thats strange:

Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
- max_mem = 1024, icount = 46784, imem = 182, dblock = 129018783744, dmem = 62997453
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 61569.

So... exactly the same as before!

pirx · Post by **pirx** » Apr 14, 2021 4:46 pm this post

blockcloning/reflink speeds up synthetic fulls. How does reflinks impact performance of active fulls or read operations, does this generate more random IO's during read?

Post by **Gostev** » Apr 14, 2021 6:00 pm this post

For active fulls, reflink presence doesn't matter.

As for read operations, it depends on what is being read and how.
For example, most restore operations are heavy on random I/O regardless of the underlying file system.
While copying an active full backup file with Windows Explorer results is a very sequential I/O

pirx · Post by **pirx** » Apr 15, 2021 5:24 am this post

My concern is mostly about reading data for copy/offload jobs + sureback, where surebackup is probably most random?

Post by **Gostev** » Apr 15, 2021 10:43 am this post

Exactly.

All "predictable" workloads (where we know what blocks we will need next) should work fine with any decent enterprise-grade RAID controller even with reflink in use, since we do async I/O which allows RAID controller to group multiple random read requests and execute them in an optimal manner.

But this is not possible when running VMs off of a backup storage, because we cannot know what disk block will the OS request next - so we cannot start reading the corresponding backup file blocks in advance. Only in some cases we can take a guess, for example when we see a few consequent reads, then we start to pre-read further blocks (this helps in scenarios such as copying large files in the guest OS of instantly recovered VM, or for the file-level recovery).

pirx · Post by **pirx** » May 10, 2021 1:57 pm this post

Can someone give me advice on how to test the performance of a 350TB reflinks Linux XFS repository? I did some basic tests with fio, but I'm not sure what are realistic parameters to use (blocksize?)? Or which other tool than fio.

pirx · Post by **pirx** » May 11, 2021 9:17 am this post

And does anyone have the correct mkfs parameters for xfs on a 28 disk RAID60?

Doc's are simple: https://helpcenter.veeam.com/docs/backu ... positories

Code: Select all

mkfs.xfs -b size=4096 -m reflink=1,crc=1 /dev/sda1

As far as I can remember from my past linux adventures giving mkfs.xfs the information about the hardware RAID layout was beneficial for performance. With sw raid mkfs can detect the number of disks etc, but not for HW RAID.

From https://xfs.org/index.php/XFS_FAQ#Q:_Ho ... erformance

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So in my case
su = 256k (Smart Array Controller stripe size)
sw = 24 (number of data disks)

This is working but it prints out a warning: mkfs.xfs: Specified data stripe width 12288 is not the same as the volume stripe width 6144

Code: Select all

# mkfs.xfs -b size=4096 -m reflink=1,crc=1 -d su=256k,sw=24 /dev/sdb -f
mkfs.xfs: Specified data stripe width 12288 is not the same as the volume stripe width 6144
meta-data=/dev/sdb               isize=512    agcount=350, agsize=268435392 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=93755080704, imaxpct=1
         =                       sunit=64     swidth=1536 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Any recommendations? I would go with mkfs defaults, but I think mkfs should know the RAID setup.

May 11, 2021 4:08 pm

I pretty sure that most modern hardware RAID controllers do expose the disk geometry to Linux and allow it to make the correct determination without needing to pass anything specifically. Actually, I think that's exactly why you are getting the error, because what you are passing does not match what the RAID controller is advertising as the underlying geometry.

I believe the reason for the mismatch is that you are not taking into account the RAID0 stripe size, you are passing sw=24, however, you appear to have 28 drives, with two sets of RAID6 using 14 drives each, which are sw=12. Because you are using RAID60, this creates a RAID0 stripe across the two RAID6 arrays, but each of those still only have a sw=12, and the RAID0 creates one stripe per array, so it's still sw=12. The only way you'd use sw=24 is if you had a single RAID6 array with 26 disks. In other words, I believe what mkfs.xfs is telling you is exactly correct.

pirx · Post by **pirx** » May 11, 2021 7:00 pm this post

Thanks - I think you're right. xfs_info also show same values if I use sw/su options or just the defaults. There are a lot of references in the internet that claim it's necessary to add this info in case of hw raid controller. I remember xfs fs on Areca controllers long time ago needed those options.

Any hints for useful benchmarks with fio or other tools, so that I don't screw this up too

I've posted something here veeam-backup-replication-f2/veeam-v11-n ... ml#p414327 Not sure how much time if have for these low level tests.

Post by **tsightler** » May 11, 2021 7:57 pm this post

I don't have access to a hardware RAID controller, but mkfs.xfs uses the minimum_io_size and optimial_io_size hints which should be provided by the controller. You can check this with blockdev:

Code: Select all

blockdev --getiomin --getioopt /dev/sdb

These should map to sunit/swidth I believe and mkfs compares any manually provided values to what it calculates from the value of these parameters which are surfaced by the underlying block device provider/driver, in this case I believe that is the hpsa kernel module perhaps? I think most of the articles that claim you need to do this manually were written long before these hints were available in the kernel, but I'm pretty sure they've been common for most hardware RAID drivers for quite a while now, although it might be interesting to pull these values on your hardware and see what it reports.

pirx · Post by **pirx** » May 12, 2021 5:38 am this post

This is what I get:

Code: Select all

# blockdev --getiomin --getioopt /dev/sdb
262144
3145728


# xfs_info /dev/sdb
meta-data=/dev/sdb               isize=512    agcount=350, agsize=268435392 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=93755080704, imaxpct=1
         =                       sunit=64     swidth=768 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

May 12, 2021 2:11 pm

Code: Select all

# blockdev --getiomin --getioopt /dev/sdb
262144
3145728

Perfect, so exactly as expected, minimum I/O size is the stripe unit size (256K per device) and optimal I/O size is the size of the entire stripe width (3MB, which is 256K * 12). So that shows that the hardware RAID is exposing the geometry correctly so no need to manually tweak anything, mkfs.xfs gets the correct values automatically.

Thanks for sharing that, helps me to know that I wasn't crazy in my expectation that manually tweaking isn't typically required even for hardware RAID these days. I was about 99% sure that this was the case, and I'd bet there are probably still some corner case where manual tweaking is required, but this shows that at least the HP Smart Array controllers expose their optimal geometry information.

R&D Forums

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Re: V10 & XFS - all there is to know (?)

Who is online