[Long] AHV Proxy/Plugin Feature Requests & Complaints

HYF_JE · Post by **HYF_JE** » Jan 25, 2023 12:34 am this post

Preamble

Howdy. First time poster.

Quick note before I continue on - I'm a bit unfamiliar with the Veeam scene and wasn't sure if this forum or the Veeam Community would be the best place to post but decided on here due to some (limited) familiarity. A good rule of thumb would be appreciated.

Before I get into the topic I should work to say something positive. The firewall rule documentation is pretty good - was able to hand that to my network guys and they didn't need clarification on anything so that's a good thing. The proxies are also easy/fast to update and that's always welcome. I also can't recall a time where a backup or restore failed due to the AHV proxy itself so that inspires confidence.

I was also a lot more frustrated a few weeks ago with the AHV proxy but a competent and patient support rep helped me along and addressed the majority of my concerns. I've had some pretty lackluster Veeam support experiences as recent as 18 months ago so this was a pleasant surprise. This is my biggest compliment with respect to the AHV backup - support was there when I needed it.

Also with VBR12 (and maybe v4 of the AHV proxy+plugin?) right around the corner hopefully a lot of this is basically a "fixed, just wait" answer.

Also also I'm certain I can't be the first customer to bring up these problems but who knows maybe I'm the only one with the patience to type it up.

Main Complaint

TL;DR - The certificate trust between the AHV proxy and the Nutanix Prism Element cluster is too easy to break and way too hard to fix.

My story begins like most of them - with me making a change I didn't fully understand the knock-on effects of. For that I accept at least some blame for not doing enough research. I was experimenting with automation (via REST calls) against our Nutanix clusters. Big pre-requisite there though is that you need trusted certificates installed across your clusters. So I went through that process (certs coming from our ADCS Enterprise PKI) and got that all sorted out. Lovely, now the red text in my browser is gone and I can move on with automation without creating more tech debt.

That broke Veeam big time. In the VBR console itself this isn't a difficult issue to resolve - you can click-next through the cluster edit screen and accept the new certificate manually to at least get you through the near term. In my case I also didn't have FQDNs for the clusters prior to the certificate setup so Veeam was still pointing at IPs instead of FQDNs. Even still it isn't too hard to manually fix up in the console and I don't have any recommendations for how to improve this feature. Additionally from my basic testing it seems that VBR respects the CAs listed in the Windows trust store so once you have things setup correctly this is really a non-issue.

What I do take issue with though is the proxies. It is basically impossible to re-create the certificate trust between the proxy and its cluster short of dropping into the shell on the proxies, enabling SSH, copying over your trusted CA, installing the CA, and then going through the entire proxy wizard process once more from VBR. Don't get me wrong - this isn't a lot of work when you know what to do but being able to get a prompt in the proxy's web interface saying "The cert changed, this is the new thumbprint/chain. Accept?" would be a game changer. At least let me get my backups going again so I can continue to achieve my RPOs while I fix up the underlying problem.

There should also be a way to manage (Create/Replace/Update/Delete) the root CAs trusted by a proxy better. Dropping to a shell is not the long-term answer. I'd settle for a list in the web interface or an extra page in the proxy deploy/edit wizard to edit the CAs. By the way we have four clusters and four AHV proxies (soon to be six each). I can't imagine how larger Nutanix/Veeam customers would handle this short of some unsupported automation.

I need to ask - what was the plan for certificate renewal when these proxies were first designed/architected? The self-signed certs the Nutanix clusters come with are valid for 10 years. Yes - god forbid you are still running the exact same cluster in 10 years but still, what was the plan for when that cert naturally expires? Just let the backup jobs start to fail? Or was this a calculated risk by product development to say "Only a minority of customers use custom certificates, let's delay programming that logic until a future release." ? If that's the answer I can empathize but wow it feels rough to be on the receiving end of it.

Smaller Requests

OK here's a number of other requests I have to improve the AHV proxy/plugin but honestly a lot of these boil down to "Create parity with VMware backups."

Restore Logging - I don't remember in which cases exactly but I think some restore operations don't show up in a proxy's log (web interface "Events" specifically). I think it might be for an entire VM restore of a VMware backup to Nutanix. QA might want to double check.
VeeamZIP - Noticed two things today (sample size of one problem). I don't see a way to apply retention or encryption to a VeeamZIP backup in the VBR console when the VM is on an AHV platform. Second, I don't think the backup job showed as a running job or in the "Events" page of the web interface. QA might want to double check.
Proxy Domain Search/Suffix - It's silly to modify the /etc/hosts file in the year 2023 as the AHV 3.0 user guide says to do (Configuring Hostname Resolution). At the very least recommend customers to setup FQDNs properly and setup search domains in /etc/systemd/resolved.conf . It's better in every way. Better still, add this configuration to the web interface.
Proxy Multi-Homing - This probably isn't an officially supported configuration, but our AHV proxies all have two NICs. The first NIC is configured via the wizard as you'd expect - it has an address, default gateway, and DNS configured (though really the DNS is system-wide because Linux is sane compared to Windows...). That NIC is primarily for "management" traffic between VBR & AHV Proxy, DNS, Updates, SSH, web interface, etc. It's also on the same L2 & L3 network as the backup repos. We manually configure a second NIC which is on the exact same L2 & L3 network as the Nutanix cluster/CVMs. This second NIC doesn't have a default gateway assigned so there is no "route flapping". This cuts down the network traffic substantially (no crossing a gateway on its way to the Nutanix cluster) while still maintaining a good security posture. I'd love to see if this was an officially supported configuration and configurable through the web UI however.
VM Restore - When doing an entire VM restore (AHV to AHV) the VM is re-added to any previous data protection groups. That seems wrong to me or at least should be an option given to the user in the restore wizard. I have only restored VMs to the same cluster whence they came - I don't know what would happen if you restored the VM to a different cluster. Maybe it would incur an error as the data protection group wouldn't exist? I don't know. Worth asking QA to test.
Restore Point Archival - Similar to VMware backups, please make RP archival & retention supported sooner than later. Taking VeeamZIPs or exporting backups is a silly workaround.
Backup Job Scheduling - Similar to VMware backups, please allow the same backup options. In particular we have missed the "After this job" option from the VMware backup jobs.
Backup File Encryption - Similar to VMware backups, please allow backup file encryption of AHV backups.
Proxy Certificates - As long as we're on the topic of certificate woes, the proxies themselves have self-signed certificates. There must be a better way to handle that though there's definitely no one-size-fits-all solution here. I guess having different options would be nice. e.g. raw PEM uploads (yuck), LetsEncrypt, CSR generation, etc.

Conclusion

If nothing else, now I'm hooked and I'm keen to see how the product will improve over time. Feel free to ask follow-ups (or point out flaws in my thinking) and I'll do my best to respond.

Post by **HannesK** » Jan 26, 2023 9:33 am this post

Hello,
and welcome to the forums.

R&D forums are for situations where you like to talk with Product Management. Community is a customer community. Hope that helps

Main complaint, just to clarify: you changed the certificate on the Nutanix side to a valid certificate and that caused problems with us? Yes, we use the Windows trust store on Windows. I got lost, why the trust between AHV proxy and backup server broke. I remember, that's the case if one changes the Veeam internal certificate and one can fix it by going through the wizard again

Overall: I always recommend to go with the self-signed certificate for Veeam internal communication. Here is why: If you give it an official CA certificate, keep in mind, that VBR is a subordinate CA. That means the backup admin can now create valid certificates that are accepted by your whole domain except one uses extended key usage and the clients respect that option (browsers do, other software... maybe)

Encryption: is applied on repository level. Not on job level.

Multi-homing: that means you have a machine that can go around the firewall. I would send management traffic routed to avoid such "wholes". But up to you.

VM restore: why is it wrong for you? For me it sounds logical to keep things "the same". Adding an option might be possible, but I cannot remember requests for that so far.

Restore point archival: what exactly is meant here? Move to archive tier? (AWS glacier, Azure archive?)

Scheduling: "run after" jobs a bad practice in general. Which problem do you solve with that, that cannot be solved by scheduling the second job 1min later than the first job?

Restore logging: but you see everything in the VBR logs, right? if you remember which operation, then I will check after V12 / V4 is out.

Best regards,
Hannes

HYF_JE · Post by **HYF_JE** » Jan 26, 2023 3:39 pm this post

HannesK wrote:R&D forums are for situations where you like to talk with Product Management. Community is a customer community. Hope that helps

That's what I needed, thanks!

HannesK wrote:Main complaint, just to clarify: you changed the certificate on the Nutanix side to a valid certificate and that caused problems with us? Yes, we use the Windows trust store on Windows. I got lost, why the trust between AHV proxy and backup server broke. I remember, that's the case if one changes the Veeam internal certificate and one can fix it by going through the wizard again

Maybe I made a typo in the original post but the trust I am referring to is the trust between the AHV Proxy and the Nutanix cluster, not the VBR server. More specifically, the AHV Proxy must be "pinning" the trust of the certificate from the Nutanix cluster during first setup. When/if the certificate on the Nutanix cluster changes (and can't be tied back to an already trusted root) then jobs begin to fail. That alone is not necessarily a problem or unexpected, but it would be nice to have a way to override the issue so that RPOs aren't missed.

HannesK wrote:Overall: I always recommend to go with the self-signed certificate for Veeam internal communication. Here is why: If you give it an official CA certificate, keep in mind, that VBR is a subordinate CA. That means the backup admin can now create valid certificates that are accepted by your whole domain except one uses extended key usage and the clients respect that option (browsers do, other software... maybe)

This shouldn't apply to our scenario because as mentioned above, the trust between the AHV Proxy + the VBR wasn't broken.

HannesK wrote:Encryption: is applied on repository level. Not on job level.

So this is where my understanding of Veeam starts to get into territories I'm not super familiar with. We use SOBRs but from a quick review of the settings/options available when editing both individual extents and the SOBR as a whole, I don't see a way to set encryption for the individual extents. There is a way to setup encryption on a capacity (Object storage) repo but not on the performance repo (we use Linux/Immutable/XFS repos for the performance extents). I'll be the first to admit this could be PEBKAC.

HannesK wrote:VM restore: why is it wrong for you? For me it sounds logical to keep things "the same". Adding an option might be possible, but I cannot remember requests for that so far.

I was doing entire VM restore testing so I was restoring VMs but then deleting them right away after testing the restores were proved OK. Simply a minor annoyance to get an alert (from Nutanix) saying that the cluster failed to protect VMs that no longer exist. Again a simple toggle/checkbox would be appreciated but this is a really minor gripe.

HannesK wrote:Restore point archival: what exactly is meant here? Move to archive tier? (AWS glacier, Azure archive?)

I mean if you edit a VMware backup job, on the storage page of the wizard there's a "Keep certain full backups longer for archival purposes". e.g. keep 12 weekly ful backups, keep 6 monthly backups, keep 5 yearly backups. There's no equivalent in the AHV backup jobs.

HannesK wrote:Scheduling: "run after" jobs a bad practice in general. Which problem do you solve with that, that cannot be solved by scheduling the second job 1min later than the first job?

Honestly I don't think there's an exact problem I can highlight or justification for why the VMware jobs were setup this way (before my time). If it's not a recommended thing to do then just ignore this request.

HannesK wrote:Restore logging: but you see everything in the VBR logs, right? if you remember which operation, then I will check after V12 / V4 is out.

Correct everything shows in the VBR logs, it's just nice to see it sometimes from the proxy's perspective. I definitely don't recall the exact jobs off the top of my head but I will update this thread as I observe them again in the wild.

Thanks for your attention!

Post by **HannesK** » Jan 26, 2023 4:18 pm this post

thank you, I think I got it

I am referring to is the trust between the AHV Proxy and the Nutanix cluster

Ah sorry, misunderstanding on my side. It fails, because the new certificate is not trusted, because the proxy does not know about the Windows CA. Going through the cluster wizard again should solve that. It should ask to confirm the new certificate. Did that not work?

Encryption settings are in the access permissions of the scale out repository (not the extents)

to get an alert (from Nutanix) saying that the cluster failed to protect VMs that no longer exist. Again a simple toggle/checkbox would be appreciated but this is a really minor gripe.

makes sense, yes. I will talk to my colleagues about that

"Keep certain full backups longer for archival purposes": that's in V4 / V12

HYF_JE · Post by **HYF_JE** » Jan 26, 2023 8:00 pm this post

HannesK wrote:Going through the cluster wizard again should solve that. It should ask to confirm the new certificate. Did that not work?

Yes and no. Going through the cluster wizard will solve it so long as you've already done the steps I mentioned before (accessing the proxy, enabling SSH, copying over the CA certificate, installing the certificate). That's a lot of steps if you're unfamiliar with the proxies and how to fix this error. The wizard may at some point ask you to confirm a certificate, but that's only the certificate of the management web UI for the Veeam AHV Proxy not for the Nutanix cluster.

HannesK wrote:Encryption settings are in the access permissions of the scale out repository (not the extents)

A-ha! Perfect. Today I learned, thanks. I'll have to put this on the "follow up" list.

HannesK wrote:makes sense, yes. I will talk to my colleagues about that

Wonderful, thanks.

HannesK wrote:"Keep certain full backups longer for archival purposes": that's in V4 / V12

Wonderful news.

SilkBC74 · Post by **SilkBC74** » Sep 22, 2023 11:53 pm this post

HannesK wrote: ↑Jan 26, 2023 9:33 am Scheduling: "run after" jobs a bad practice in general. Which problem do you solve with that, that cannot be solved by scheduling the second job 1min later than the first job?

I love this feature. It ensures that a particular backup job is not run until the first one completes

Post by **HannesK** » Sep 25, 2023 5:34 am this post

Hello Alan,
and welcome to the forums.

Okay, and which problem does it solve if a second backup job does not run if the first is still running?

Best regards,
Hannes

HYF_JE · Post by **HYF_JE** » Sep 27, 2023 2:19 pm this post

I don't want to presume to speak for the previous commenter, but the ONE use case I could think of for the "run after" job scheduling goes like this:

Let there be an application that does not take well to snapshots with VSS integrations/VM stunning (I'm actually not sure how bad stunning is on AHV, if it exists at all). Let this application be setup in HA with a multiple of VMs.

Assume there is a desire to backup all these VMs around the same time to meet an RPO. But you can't backup *all* VMs simultaneously or else you risk stunning the application across all VMs that form the HA. Instead, having a string of jobs that run in series using something like "run after" logic is one way to get around this.

I admit, this is very niche.

Post by **HannesK** » Sep 27, 2023 2:39 pm this post

but a valid use case. thanks for the details!

R&D Forums

[Long] AHV Proxy/Plugin Feature Requests & Complaints

Re: [Long] AHV Proxy/Plugin Feature Requests & Complaints

Re: [Long] AHV Proxy/Plugin Feature Requests & Complaints

Re: [Long] AHV Proxy/Plugin Feature Requests & Complaints

Re: [Long] AHV Proxy/Plugin Feature Requests & Complaints

Re: [Long] AHV Proxy/Plugin Feature Requests & Complaints

Re: [Long] AHV Proxy/Plugin Feature Requests & Complaints

Re: [Long] AHV Proxy/Plugin Feature Requests & Complaints

Re: [Long] AHV Proxy/Plugin Feature Requests & Complaints

Who is online