Host-based backup of VMware vSphere VMs.
Post Reply
kbr
Enthusiast
Posts: 25
Liked: never
Joined: Oct 09, 2020 7:36 am
Full Name: Karl
Contact:

Shared memory connection Errors during HPE Apollo Transform

Post by kbr »

Hi,

We have an environment running VEEAM 10 on HPE Apollo 4200 nodes with a ReFS volume on a RAID6 volume. We had to enlarge the RAID6 set to accomodate more data and went from ten14TB to twenty 14TB disks. So we added 10 addionational disks to the existing RAID6. This of course caused the RAID6 volume to start a transform action to re-stripe the data over the 20 disks. We first set the transform priority to High so it would try to get the transform out of the way quickly. We started friday morning and on monday morning we had about 25% of the transform ready (so in total would take about 11 days). But our backups didn't run correctly anymore. Some specific disks of specific VM's keep failing (the same VM's with the same disks) with the following errors:

Error: Shared memory connection has been forcibly closed by peer. Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}.
Error: Shared memory connection was closed. Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}.
Error: Shared memory connection was closed. Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}.
Error: Shared memory connection was closed. Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}.

We lowered the transform priority to the lowest setting possible but that didn't clear the issues. So the issue now is we have to wait for the transform to complete but in the meantime are unable to complete all backups.

Anybody any usefull insights? We have a support case running and they can just tell us they see a performance bottleneck on the target (the Apollo) but we can't lower the transform priority anymore. There's also no way to cancel it.
Gostev
Chief Product Officer
Posts: 31546
Liked: 6716 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Shared memory connection Errors during HPE Apollo Transform

Post by Gostev »

This got me scared for a moment because v11 had some massive changes in shared memory transport, but then I saw you're still on v10. Certainly no known issues with this engine there, so I would bet this is some hardware issue. v10 should be solid in your configuration (HPE Apollo-based Veeam appliances are very popular and super common).
CLDonohoe
Veeam Software
Posts: 36
Liked: 21 times
Joined: May 10, 2018 2:30 pm
Full Name: Christopher Donohoe
Contact:

Re: Shared memory connection Errors during HPE Apollo Transform

Post by CLDonohoe »

Have you contacted HPE support regarding this performance? I wonder if HPE might give you guidance to raise the priority of the array transformation and just wait for it to complete. Not ideal, I agree, but they may consider this to be acceptable behavior as the RAID controller rewrites and rebalances data. Maybe I can get some useful feedback from HPE's field.

What is the version of your RAID card in the 4200?
CLDonohoe
Veeam Software
Posts: 36
Liked: 21 times
Joined: May 10, 2018 2:30 pm
Full Name: Christopher Donohoe
Contact:

Re: Shared memory connection Errors during HPE Apollo Transform

Post by CLDonohoe »

Immediate feedback from one of the HPE engineers would suggest that your experience is expected. The better method would have been to create a second logical drive which would not have impacted the performance on the existing logical drive. Then utilize both drives in a SOBR.
FedericoV
Technology Partner
Posts: 35
Liked: 37 times
Joined: Aug 21, 2017 3:27 pm
Full Name: Federico Venier
Contact:

Re: Shared memory connection Errors during HPE Apollo Transform

Post by FedericoV » 1 person likes this post

Yes, performance drops and the RAID reorg is extremely slow.
My preferred best practice here would have been to create a new RAID6 volume, a new FS, and finally use use SOBR to expand the Veeam Backup repository.
This is a Data-In-Place expansion with no downtime. Furthermore, When you create the SOBR, VBR will automatically update the jobs substituting the extent-name with the SOBR name as repository destination.

Have a look at this doc, at page 29. (it is for VBR v10): https://psnow.ext.hpe.com/doc/a50000150 ... -psnow-red

Forgive me if I go a little off topic, but I want to point out that even though it is generally thought that disk failures are unrelated and independent, my real-life experience is different.
Sometime disks do not fail independently because there is a common cause that we may not be aware of (overheating, vibrations, spikes on the power line, a disk lot with a weaknes, a FW issue,...).
For this reason, I'm not a fan of a RAID6 with 20 disks. I prefer to stay between 10 and 16. Do not get me wrong, it is totally supported, it is just my preference.
A RAID has to survive to disk failures, If you have 20 disks you have more chances that 1 disk fails, and as long as the rebuild takes about one day with 14TB disks, there is higher risk for a second failure. A RAID6 with 2 failures is still alive, but way slower, and until the rebuild is complete, you are exposed (Murphy's law).
tsightler
VP, Product Management
Posts: 6011
Liked: 2843 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Shared memory connection Errors during HPE Apollo Transform

Post by tsightler »

My preferred best practice here would have been to create a new RAID6 volume, a new FS, and finally use use SOBR to expand the Veeam Backup repository.
The only problem with the recommendation to add another extent is that it somewhat violates Veeam's best practice of 1 repo server = 1 SOBR extent as this introduces operational challenges that may not be apparent at first. It's not so bad if you just go for 1 extent to 2 extents, but if you expand using this method multiple times, the problems introduced also multiply.

Regardless of vendor, I normally recommend paying the short term pain to avoid introducing the long term operational complexity, at least if you can afford to do so. Admittedly every customer situation is different, but I've watched too many customers try to use SOBR to address their short term storage issue only to create significantly larger long term issues.

In this case it's taking a lot longer than I would have anticipated to transform the RAID, but perhaps that's due to the attempt to expand an existing RAID6 as I've always recommended adding the new disks as a new RAID6 set and then striping with RAID60, but I'm not specifically familiar with the capabilities of the Apollo RAID controller here and if it can move from a single RAID6 set to a striped RAID60 set via non-destructive means.
kbr
Enthusiast
Posts: 25
Liked: never
Joined: Oct 09, 2020 7:36 am
Full Name: Karl
Contact:

Re: Shared memory connection Errors during HPE Apollo Transform

Post by kbr »

Hi guys, thanks for the input. We discussed the SOBR option internally (we were not fully aware it would work). We asked HPE and they told us the Apollo 4200 has the ability to do a 48 (!!!) disk RAID6 set so expanding the Array is not a problem. Well practice makes perfect, 11 days of failing backups (with transform on HIGHEST) is not an option, even with transform on the lowest prio it's not really working. So yes an extra SOBR extension might have been the better option but that option is passed now. We have to get the transform finished since we have no way to cancel it. I'm though interested in tsigthlers insight where SOBR extensions are also a problem? Why's that?

Regardering the raid controller as far as i know (can't check at the moment) it's an HPE Smart Array P816i-a controller
tsightler
VP, Product Management
Posts: 6011
Liked: 2843 times
Joined: Jun 05, 2009 12:57 pm
Full Name: Tom Sightler
Contact:

Re: Shared memory connection Errors during HPE Apollo Transform

Post by tsightler » 2 people like this post

Yes, what is engineering maximum vs what is practical/best practice have a tendency to clash in the real world for sure. The manufacturer says my car can do ~150MPH and accelerate from 0-60 in 4.8 seconds, but it's certainly not best practice to drive it this way all the time, although it is fun! It's kind of the same with 48 drives in a single RAID6 array, but without the fun part! :D

First let me preface this by saying that I work almost exclusively with Veeam's enterprise customer base around the globe. Some of these customers have, quite literally, dozens of Apollo servers serving 10's of PBs of data, and they perform exceptionally well so I can say without question that the platform is very solid. None of my comments below are specific to Apollo, they are generic and apply regardless of platform.

SOBR is also a terrific technology, but like any technology, it can be applied in ways that produce less than ideal results. I've seen environments where customers attempted to use SOBR to solve various storage limitations, especially storage consolidation, but that's not SOBRs strength. Can it be used for this? Sure, but only with limited success and some negative side effects. The strength of SOBR is in the ability to aggregate lots of separate physical repositories into a single logical grouping, and it's pretty good at this. I've done whole sessions on "Veeam worst practices" back in the day, but I could probably do an entire session on "Veeam worst SOBR practices". Hmm, maybe I smell my next VeeamON session?

I'll try to summarize enough to give the idea:

1) For repos to work properly, task limits should be set appropriately, but these tasks limits are set "per-extent" with SOBR. If I have a larger server that can handle say, 64 tasks, but I have 4 extents, how do I allocate the tasks? Sure, I can put 16 on each, that's not so bad, but now, if placement decisions based on storage capacity mean that specific extents are full or have less VMs, I can't use those task slots on the other extents, meaning I'm not using the full capacity of the server. Task slots are even more important for BCJs and other functions.

2) It creates small "silos" of storage, which are each performance limited and forces SOBR to make more storage placement decisions giving it more opportunity to be wrong. Image a simple scenario where I have the option of 1 large 400GB extent, or 4 smaller 100GB extents. SOBR will attempt to place backups on each of those extents based on slot availability, capacity, estimated backup size, etc. But every placement decision has a chance to be wrong and smaller sizes simply leave less room for error. For example, lets assume those 4x 100GB extents run a full backup and SOBR made placement decisions and did a pretty good job of balancing on the repos, however, by bad luck 3-4 of the high change rate SQL servers ended up on Extent 1, while Extent 2-4 mostly got larger, low change rate SQL. Over time, due to data locality preference which are required for block clone, those large incremental backups fill Extent 1 to 95% full, but extent 2-4 are still 60% free. Eventually SOBR will be forced to put incrementals on Extent 2-4 because there's not enough space on Extent 1, but this breaks fast clone, so now the new full is built on Extent 2 using full copy, taking longer and using event more space. If everything was a single extent, and all space was consolidated in one, this would be far easier.

3) Each individual individual extent is slower, limiting performance for backups of large systems with many disks for both backups and restores, especially instant restores which will be limited to the IOPS of the underlying storage.

4) It creates overhead in backup jobs. SOBR placement is not without overhead. SOBR has to track used space and perform estimates to make informed placement decisions and each one of these take time. It might not seem like a lot of time, but if you have a larger number of extents (dozens), and a large number of VMs, it can add up to become an impactful part of the process.

Of course there are some benefits as well, smaller failure domain for example, and if your environment is never going to be more than 1-2 appliances and 4-8 SOBR extents, and you don't have monster VMs with high change rates, then, well, it probably doesn't matter too much, SOBR will likely do a good job for you no matter what and any issues are easily manageable. But if you are trying to build a resilient large-scale solution with the least amount of management overhead, using the minimum number of SOBR extents you can get away with is one of the keys. I strongly recommend 1 per physical extent, more than that and I'd want to hear very good justification.
Post Reply

Who is online

Users browsing this forum: mbrzezinski and 54 guests