Veeam v11 - HPE Apollo 4510 test

pirx · Post by **pirx** » Mar 08, 2021 1:49 pm this post

Can somebody that already has some experience give some feedback about the Apollo 4510 with a large amount of disks? The price seems to be fair, even comparable to similar Supermicro or Cisco boxes.

ASG · Post by **ASG** » Mar 10, 2021 11:53 am this post

FedericoV wrote: ↑Feb 15, 2021 8:32 am How many Disk-array controllers?
This configuration has 2 "HPE Smart Array P408i-p SR Gen10" controllers. I have tested that a single controller cannot write more than 3GB/s. For this reason, I installed two controllers, and this gave me about 6GB/s of write throughput (remember, data is compressed and deduped, so the backup speed is about 2x higher). With 2 controllers it is possible to assign 30 spindles to each one. Each controller has one RAID60 on 2 RAID6 parity-groups of 14 disks each. Each controller is managing 29 disks: 28 for data plus one Hot Spare.
In total there are 58 x "HPE 16TB SAS 12G 7.2K LFF" for a net usable capacity of 768TB

Wait a minute... So ReFS with Windows Dedup (aka you shouldn't do this) or with Veeam Dedup?

FedericoV wrote: ↑Feb 15, 2021 8:32 am Are there smaller storage-optimized servers and smaller/scalable configuration options?
The 4U HPE Apollo 4510 Gen10 has a 2U brother, the Apollo 4200 Gen10. This server has 2 front drawers with 12 LFF disks, plus a rear cage for other 4LFF disks. In total there are 28 LFF slots.
On the Veeam V11 optimized configuration, the Apollo 4200 provides up to 320TB net usable.
There are multiple configuration options, based on smaller disks (the most common are 8,10,12,14,16TB), or with the internal disk slots only 50% populated, and ready for a future upgrade.
In the next few days, I'll complete my tests on the Apollo 4200 Gen 10 with V11, and I'll post an update on the performance.

Any News on this? Want to buy a new Server in the next Months and the 4200 Gen10 looks like a good alternativ to our full-packed DL380 12LFF... Still not sure if I want the All-in-One again or go with XFS Immutable...

Post by **Gostev** » Mar 10, 2021 1:04 pm this post

This talks about Veeam's own dedupe.

pirx · Post by **pirx** » Mar 10, 2021 1:59 pm this post

As we are currently in the process in sizing a All-In-On host (well, two to be precise as copy targets for ~1400VM's, each with 768TB) I'm still wondering about the recommended memory configuration. According https://vse.veeambp.com/ I'd need 3 proxies each with 128GB = ~400GB. Not sure if this already includes the ReFS overhead. How much would I need if there are offload tasks running on these repos? It's sometimes hard to calculate, given that there are backup jobs, copy jobs and even offload tasks.

Mar 10, 2021 2:14 pm

Actually, it is super easy to calculate, because every activity you mentioned requires a repository task slot. So if you set those for example to 16, you can be sure you will never have more than 16 concurrent activities across backup jobs, copy jobs and even offload tasks. In other words, you can limit Veeam to work well with whatever RAM amount you end up with.

But it's hard to imagine why would you need more CPU and RAM than what was used in the configuration discussed in this thread, as those were enough to saturate the dedicated test environment where two fast storage arrays did nothing except serving data for the purpose of the test.

pirx · Post by **pirx** » Mar 10, 2021 3:24 pm this post

Well, a benchmark is a benchmark and does not necessarily be 1:1 real world. I asked because in the past I got at lot of feedback from support that our 7 HW proxies with ~120 cores and >600GB RAM are not sufficient for what we do. And I do not want to discuss this again

If you take offload tasks they do not generate heavy IO on the repository, but they occupy task slots and cpu/mem too. Sometimes I have > 150 Agent tasks on a single host due to offloading/backups/copies. This is also because Veeam does not well load balance the tasks over all available resources (current setup with SMB fixed gw/mountservers etc). I can limit the concurrent tasks but then the job takes longer and this is not what I want. For me its not just about saturating the disks, these All-In-One servers have to be right size for all tasks.

Mar 10, 2021 5:28 pm

When I compare the system resources utilization for a similar "lab" workload between V10 and V11, the difference is huge, with V11 using less resources in a more efficient way.

CPU: Before V11 the second CPU was not so effective to improve performance. The old best practice for Apollo 4000, was to not install it because the resulting max performance was lower with 2 rather than 1 CPU.
On V11 that has dramatically changed. The second CPU contributes to the performance, and the contribution does not come from the CPU alone, but also from the additional RAM bandwidth and PCI slots.

RAM: Before V11 the standard behavior was to use FS buffers for write operations. The effect on Resource Monitor - Memory tab, was that half the RAM was used as a write cache managed by the OS. I use the words "half the RAM" intentionally, because I tested configurations with different amount of RAM, and the result didn't change: roughly 50% was used as write cache and visible in the resource monitor as an orange segment. Have you ever seen jobs in <V11 starting fast for 2-4 minutes, and than slowing down for a while, and then showing performance moving up and sown apparently without a reason? In my lab it was a common behavior, and I pulled my few hairs to find a solution. What surprised me of V11 is that the throughput remains stable and high.
I have just run a backup test to measure the RAM utilization. Before the job started my "in use" RAM was 22,300MB. On a Full backup with 42 concurrent threads I measured a peak at 58,400. I wanted to test an Incremental, but in a lab there are no changes to make a meaningful test. On my tests, Proxy and Repo are on the same Apollo server. I hope this helps for your sizing.

Please note: 128GB is not a good number for Intel based servers such as Apollo. Each CPU has 6 memory channels and we want to use them all to maximize performance, in total we want 12 DIMMS:
step-1 12*8GB=96
step-2 12*16GB=192 - I like it better, also because the 16GB DIMMS do not cost twice as much as the 8GB ones

Yes, this is a lab, so I have removed the external bottlenecks, my 3PAR and Nimble, my SAN, LAN and ESXi servers have noting else to do other than running backup. I understand that in production there might be fluctuations because of the production workload. I agree with you, NEVER SIZE ON MAX PERFORMANCE! In other words, if I see that the Apollo 4510 in steady state ingests backup data at 10.5GiB/s, I will not "sell" a backup windows based on that speed, but knowing what the max speed could be, I have a better knowledge and control of the environment and I understand if it is worth doing some more tuning, or, instead, my HW is already giving all it can give.

P.S. thanks for your interest for my lab testing!

Mar 10, 2021 6:37 pm

Certainly the lab test as designed is a throughput and I/O stress test, not a RAM usage stress test. The test dataset is 45 VMs of 5.5TB total size, so only ~120GB/VM. It's perfect for the test case and the memory usage Federico notes above seems in line with expectations (about 850MB/task for combined proxy/repo datamover processes) for VMs of that average size. However, most mid-sized and larger environments don't have 45 systems of 120GB each, they instead have a few hundred to a few thousand systems that range in size anywhere from 10's/100's of GBs to 10's of TBs. The amount of RAM used per tasks grows as the size of the VMs being backed up gets larger, so for example, the very same test would use quite a bit more memory if those 45 VMs were 2TB in size, or 10TB in size, instead of 120GB.

Also, VM backups are only one type of backup, there are agent backups, NAS backups, backup copy jobs, health checks, all with different amounts of memory required and different possible limits. Best practice memory recommendations attempt to look at all of the possible use cases and come up with a baseline recommendation that will work for all cases, no matter what the environment or use case. Because larger systems take longer to backup, what happens in a lot of large environments is that, by the end of the backup window, the vast majority of task slots are used by monster VMs all using 2-4GB RAM/task. Because of these cases, and others, the best practice recommendations might skew larger than seems necessary when looking at a controlled lab case vs a real world scenario.

Part of what I love about Apollo, and similar servers, is that it's difficult to buy them in configurations that don't include enough memory to meet best practice, unless you oversubscribe the tasks significantly.

pirx · Post by **pirx** » Mar 19, 2021 9:32 am this post

FedericoV wrote: ↑Feb 15, 2021 8:32 am How much memory?
V11 is simply awesome with RAM utilization. If you are running V10, open Windows Resource Monitor on the Memory tab to monitor the RAM utilization. When your server is working at its maximum speed, you can see that the Orange ("Modified") portion of the bar often takes half of the available RAM. On V11 that orange segment is practically invisible.
During my tests, running up to 45 concurrent backup streams in "per-VM file" mode, I have seen the memory utilization normally below 45GB. Only in rare situations I have seen the utilization going above 100GB. For this reason, I precautionarily suggest to install a little more than 100GB. The HPE Apollo 4510 Gen10 gives the best performance when there are exactly 12 DIMMs (6 per CPU).
My recommended RAM configuration is 12 * 16GB=192GB. Maybe 12*8GB=96GB is enough, but the savings are not worth the risk of slowdowns.

As I'm a bit unsure how much memory we need I checked https://memoryconfigurator.hpe.com/. To me it looks like the 192GB configuration with 6x32GB instead of 12x16GB would probably not be more expensive, but has the benefit that I can upgrade to 12x32 (P00924-?21) = 384GB if needed (larger repo, longer retention, more VM's, larger VM's, whatever). I still don't get the huge price difference between P00924-K21, P00924-H21 and P00924-B21 module that are all the same according to specs.

Post by **FedericoV** » Mar 19, 2021 8:01 pm this post

Hello Pirx,
To maximize performance we need to enable the 6 memory channels of each processors. This requires 12 DIMMs whatever their size is.
The other requirement it is to have enough memory.
If you install 3 DIMMs per CPU, you get less RAM bandwidth, but enough RAM capacity.
I do not know what the performance degradation could be. Maybe this is acceptable for your workload.

If you want to see more performance screenshots, have a look here (HPE site): https://community.hpe.com/t5/Around-the ... FUCjK9Khfd

pirx · Post by **pirx** » Mar 20, 2021 8:56 am this post

Federico, https://memoryconfigurator.hpe.com/ is a bit misleading here. Even if I chose performance mode for the memory the tool gives me a configuration with 3x32Gb/CPU for 192GB as best option. Did I select the wrong server type (I checked the offer we have and there are 2 x XL450Gen10 nodes in it)? On the page is shows the XL450 node with 8 channels, in quickspecs its 6 channel. I think I'm confused by 6 channels <-> 8 slots per CPU.

pirx · Post by **pirx** » Mar 24, 2021 4:47 pm this post

In the meantime I got feedback that the HPE memory server configurator is wrong, is shows 8 channels/CPU, where the Intel scalable CPU only has 6.

Post by **FedericoV** » Mar 25, 2021 5:41 pm this post

Yes, this is not the smartest tool.
I have run it now, and I have seen that if I ask for 192GB, it wants to move to the next step. Probably they coded a < rather than <= in the logic
Anyway, the fix is easy, I asked for 190GB, the tool rounded up by 2GB and the result looks correct.

PS Please, teach me how you inserted the picture in your post!

pirx · Post by **pirx** » Mar 25, 2021 6:09 pm this post

I've send you a PN

pirx · Post by **pirx** » Apr 13, 2021 7:04 am this post

What would be the impact if we cannot start with V11? We seem to hit many corner cases, not every hotfix we received for V10 is already fixed in V11 too. Is it just much higher memory requirements? If we want to use Linux + XFS with V10 what will be the difference to V11 (immutability is not the goal tight now)? The servers would be used as copy repositories, maybe also for some backup repos. I know that XFS + reflinks was supported on V10, I'm just looking what has changed in Linux support other than hardened repo + immutability. Is it worth waiting?

Post by **SkyDiver79** » Apr 13, 2021 12:51 pm this post

I also took an "old" repository to V11 and configured the immutability settings afterwards.
If you are concerned about updating to V11, I would install the Linux server(s) with the correct settings and postpone hardening until the upgrade to V11 is complete.

pirx · Post by **pirx** » Apr 13, 2021 1:15 pm this post

Hardening is not that important for us. It's more about performance of such a high density server and what is supported in v10 compared to v11 with Linux repo/proxy server.

pirx · Post by **pirx** » May 11, 2021 1:41 pm this post

Here are some numbers for Apollo 4510 with RHEL 8 and xfs + reflink for a single 28x16TB RAID60. Any hints regarding useful benchmarks appreciated.

Code: Select all

# fio --rw=readwrite --name=test --size=100G --direct=1 --bs=512k --numjobs=20

Run status group 0 (all jobs):
   READ: bw=1527MiB/s (1602MB/s), 76.4MiB/s-348MiB/s (80.2MB/s-365MB/s), io=1000GiB (1074GB), run=146941-670315msec
  WRITE: bw=1528MiB/s (1602MB/s), 76.3MiB/s-349MiB/s (80.0MB/s-366MB/s), io=1000GiB (1074GB), run=146941-670315msec
Disk stats (read/write):
  sdb: ios=2047186/2048082, merge=119/120, ticks=7863530/1349428, in_queue=7637829, util=100.00%

Code: Select all

# fio --rw=write --name=test --size=100G --direct=1 --bs=512k --numjobs=20
Run status group 0 (all jobs):
  WRITE: bw=2003MiB/s (2101MB/s), 100MiB/s-100MiB/s (105MB/s-105MB/s), io=2000GiB (2147GB), run=1021525-1022299msec
Disk stats (read/write):
  sdb: ios=1/4095953, merge=0/239, ticks=0/20170621, in_queue=18120249, util=100.00%

Code: Select all

# fio --rw=read --name=test --size=100G --direct=1 --bs=512k --numjobs=20
Run status group 0 (all jobs):
   READ: bw=1730MiB/s (1814MB/s), 86.5MiB/s-538MiB/s (90.7MB/s-564MB/s), io=2000GiB (2147GB), run=190303-1183552msec
Disk stats (read/write):
  sdb: ios=4095969/6, merge=239/0, ticks=16619508/0, in_queue=14989359, util=100.00%

May 11, 2021 8:31 pm

I'm not really sure what you mean by useful, personally, I find no benchmarks particularly useful for predicting Veeam performance, if that's what you are looking for. The above numbers look pretty much inline for what I would expect from the hardware you have available.

I do have a personal favorite, though I know it's not a very popular choice, but I really like to run iozone in throughput mode with a parallel count similar to whatever task count I'm planning to use. I don't necessarily think the results are super interesting (it's not much different than any benchmark), but I've found it to be a good way to stress test the setup. If a system will run a highly parallel iozone throughput test for a few hours, it's probably pretty stable. I can't count the number of systems I've been able to completely crash with this test, but it's in the 100's after all these years. Something like:

Code: Select all

iozone -I -r 512k -t 8 -s 2g

That will start 8 thread (tasks) each using a 2GB file, so 16GB total, using O_DIRECT to bypass OS caching. Obviously for a server of your size you'd want something bigger, like if you have 32 cores maybe to -t 32 and -s 4g, that way you get 128GB of data or maybe even bigger. I personally still try to follow the 2x RAM rule, but these days that's getting tough because these boxes have so much RAM that those tests will run forever. The most interesting results are, IMO, the write throughput, read throughput and random reader throughput. Mixed workload isn't bad either, just to show how much your performance degrades with a mix of read/write at the same time (usually a LOT), but like I said, the most useful to me is actually just that it works and that the performance numbers are roughly what I expect of the hardware I'm testing.

pirx · Post by **pirx** » May 12, 2021 10:28 am this post

Here are the iozone results https://pastebin.com/BC2NaHup. Nothing unexpected, it's nice that it does different tests in one run.

Post by **tsightler** » May 12, 2021 1:46 pm this post

Yep, looks great, performance is perfectly in-line with expectations of the hardware you have. You can pretty much assume 100MB/s per disk for sequential, somewhere around 35-50% for reverse/stride/random (although some controllers do better an recognizing and optimizing for reverse and stride reads), and around 20-25% for random writes and mixed workload.

Don't get me wrong, I love fio as well, but I just find iozone to be so easy and I've used it for so long at this point I can predict results and usually tell quickly if something seems unexpected. It does have some challenges when used with storage that can do compression/dedupe and so you have to be careful and set some extra parameters so that it writes data that can't be compressed/deduped, but otherwise it's my favorite smoke test benchmark to get a good, quick idea what I can expect from the storage device in question.

pirx · Post by **pirx** » May 12, 2021 6:59 pm this post

I started the first Veeam backups today and saw some temperature warnings while CPU was at 5%. Looks like some hickup and no real problems.

Code: Select all

# grep temperature /var/log/messages
May 12 19:07:11 sdeu2000 kernel: CPU46: Core temperature above threshold, cpu clock throttled (total events = 1)
May 12 19:07:11 sdeu2000 kernel: CPU98: Core temperature above threshold, cpu clock throttled (total events = 1)
....
May 12 19:07:11 sdeu2000 kernel: CPU95: Package temperature/speed normal
May 12 19:07:11 sdeu2000 kernel: CPU46: Core temperature/speed normal
May 12 19:07:11 sdeu2000 kernel: CPU102: Package temperature/speed normal

Code: Select all

06:50:01 PM     all      4.35      0.00      6.86      2.25      0.00     86.54
07:00:01 PM     all      3.72      0.00      5.61      2.20      0.00     88.47
07:10:01 PM     all      3.29      0.00      4.37      0.90      0.00     91.45
07:20:01 PM     all      3.62      0.00      4.47      0.96      0.00     90.95

Post by **FedericoV** » May 17, 2021 2:55 am this post

The best test is running Backup and Restore. Playing with an beast like this, often the bottleneck is outside the server. In this case a synthetic benchmark is useful to predict how many MB/s the system would be able to move if we could give it more data from the external infrastructure.
The synthetic test should be as close as possible to the production workload we want to simulate. Here I would use the same Block Size used by VBR 1MB or 4MB instead 0f 512KB (if I understood it correctly)
The tests run before are on 1 XFS on 1 controller. As long as the system has 2 controllers and other 28 active disks, I would stress them too.
An interesting test is to verify whether it is preferable to have 2 XFSs grouped by a SOBR, or 1 XFS grouping the 2 volumes (one from each controller) using LVM. On windows I have seen 15% less backup speed with Dynamic Disk in stripe mode, but maybe Linux LVM works better. Anyway, once the backup speed is high enough for our goal, it is interesting to test the restore speed. Potentially a volume over all disks should give better random I/O. I say potentially because it depends on the queue depth of vPower-NFS and of the App workload.
P.S. Making a real backup/restore test, it is important to use a dataset that produces a reasonable compression ratio. Reasonable is 2:1, or anything specific to our production mix.
Please, keep posting your results !

pirx · Post by **pirx** » May 17, 2021 4:08 pm this post

I cloned 5 of our jobs and used them for tests of the new Apollo servers. One was backup target the other copy target. 4 of the 5 jobs have ~60 VM's and 20-30 TB VM data, the last one just one large monster VM.

I was able to get 2 GB/s as combined write speed to both RAID60 volumes. Problem is that the Apollo is currently connected only with 2x10GbE with Linux bonding mode 6 (40 GbE will be available later). Our current Windows proxies are sending the data from storage snapshots over LAN to the backup Apollo. As I'm still on v10 I can't use the Apollo as Linux proxy.

I'm sure the throughput would be even higher with 40 GbE or direct backup from storage snapshot. Bottleneck for jobs was source or network.

Code: Select all

# dstat -d -n 5
-dsk/total- -net/total-
 read  writ| recv  send
2458B 1382M|1613M   17M
 421k 1470M|1454M   17M
2014k 1795M|1481M   16M
3457k  793M|1492M   17M
  11k 1660M|1415M   16M
 506k 1156M|1492M   16M
2366k 1974M|1521M   17M
  38k  766M|1523M   17M
 235k 1958M|1600M   17M
  10k 1348M|1443M   16M
  18k 1395M|1392M   20M
2457B  880M|1458M   20M
2458B 1563M|1447M   23M
1638B 1562M|1441M   16M
2450k  206M|1469M   16M
  12M 1895M|1517M   17M
  29k 1510M|1454M   17M
 412k 1380M|1360M   16M
 384k 1420M|1632M   17M
 819B 1469M|1565M   17M
   0  1414M|1559M   17M

One active full:

One inc:

- Parallel restore test of 3 VM's finished with ~150-200 MB/s per VM
- surebackup job with 60 VM's is still running but the current setup is suboptimal and as other mentioned before, PowerNFS will probably be the bottleneck
- performance analysis for copy jobs is possible as bonding will limit this even more as this is a 1-1 connection between 2 servers

pirx · Post by **pirx** » May 17, 2021 4:53 pm this post

There is one thing I'm not yet decided. We've now 2 server that were planned as copy targets. We need new storage for backup jobs too, now we could go with 4 additional server (as we need proxy resources too). We copy backups from DC1 to DC2, so we need 1 backup SOBR and 1 copy SOBR at each DC.

Option 1
- use 1 server at each location with 2 RAID60 volumes as dedicated copy target. 1 dedicated server = 1 SOBR with 2 volumes/extents
- use 2 server at each location with 2 RAID60 volumes as dedicated backup target. 2 dedicated server = 1 SOBR with 4 volumes/extents
- this is for me the "cleaner" solution as backup and copy resources are strictly separated
- if the copy Apollos should be used as proxy for backups too, they would read the data from storage snapshot and then transfer it to the other Apollo via LAN. Not optimal, one more hop.

Option 2
- use one of the two volumes from all 3 server at one location for backup and the other for copy. 3 server = 2 SOBR
- each server would have one backup and one copy volume
- this can get a bit confusing as each server at one location is backup and copy target
- resources would be used better as all 3 hosts at one location would be proxy that read from storage snapshot to local disk, no additional hop. Better performance and for copy more redundancy as 3 instead of 1 server is used at each location.
- amount of storage needed for backup and copy is different, we would have 2 RAID60 volumes with different size in each server - or waste a lot of storage

Post by **HannesK** » May 18, 2021 5:56 am this post

Hello,
I'm glad to hear that you got servers and that the first tests look good so far.

Sounds like you feel better with option 1. Then option 1 is what I would recommend (I would also user that option for the reasons you mention).

With performance... I remember two things about performance... "fast enough" and "too slow". As long as "fast enough" applies, everything should be okay.

Best regards,
Hannes

pirx · Post by **pirx** » May 18, 2021 9:56 am this post

It's not so much about performance, more about redundancy. Just one server on each side for copies seems to be a bit optimistic. Even with high support level, solving problems can take days. And if the OS has a problem, even longer. I think the approach with one volume of each server for backup and the other for copy is not too complicated as long as the volumes and extents have proper names.

Post by **FedericoV** » May 19, 2021 1:32 am this post

Pirx, I like the idea of distributing the workload over 3 servers per site. If a server fails, the SOBR will distribute the workload to the remaining 2 extents and the degradation is kept at minimum.

pirx · May 19, 2021 8:50 am

For backup we planned with 2+ servers anyway, but we did not think about using the same server as backup and copy repo.

Did I mention that I hate Visio...

pirx · Post by **pirx** » May 19, 2021 9:24 am this post

I've to see if it makes sense to set proxy affinity for each server, so that there are no writes from proxy A to repo B... via LAN on each side. But this would be a waste of cpu slots, once we have 40GbE this should not be an issue anymore.

R&D Forums

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Re: Veeam v11 - HPE Apollo 4510 test

Who is online