Howdy. First time poster.
Quick note before I continue on - I'm a bit unfamiliar with the Veeam scene and wasn't sure if this forum or the Veeam Community would be the best place to post but decided on here due to some (limited) familiarity. A good rule of thumb would be appreciated.
Before I get into the topic I should work to say something positive. The firewall rule documentation is pretty good - was able to hand that to my network guys and they didn't need clarification on anything so that's a good thing. The proxies are also easy/fast to update and that's always welcome. I also can't recall a time where a backup or restore failed due to the AHV proxy itself so that inspires confidence.
I was also a lot more frustrated a few weeks ago with the AHV proxy but a competent and patient support rep helped me along and addressed the majority of my concerns. I've had some pretty lackluster Veeam support experiences as recent as 18 months ago so this was a pleasant surprise. This is my biggest compliment with respect to the AHV backup - support was there when I needed it.
Also with VBR12 (and maybe v4 of the AHV proxy+plugin?) right around the corner hopefully a lot of this is basically a "fixed, just wait" answer.
Also also I'm certain I can't be the first customer to bring up these problems but who knows maybe I'm the only one with the patience to type it up.
Main Complaint
TL;DR - The certificate trust between the AHV proxy and the Nutanix Prism Element cluster is too easy to break and way too hard to fix.
My story begins like most of them - with me making a change I didn't fully understand the knock-on effects of. For that I accept at least some blame for not doing enough research. I was experimenting with automation (via REST calls) against our Nutanix clusters. Big pre-requisite there though is that you need trusted certificates installed across your clusters. So I went through that process (certs coming from our ADCS Enterprise PKI) and got that all sorted out. Lovely, now the red text in my browser is gone and I can move on with automation without creating more tech debt.
That broke Veeam big time. In the VBR console itself this isn't a difficult issue to resolve - you can click-next through the cluster edit screen and accept the new certificate manually to at least get you through the near term. In my case I also didn't have FQDNs for the clusters prior to the certificate setup so Veeam was still pointing at IPs instead of FQDNs. Even still it isn't too hard to manually fix up in the console and I don't have any recommendations for how to improve this feature. Additionally from my basic testing it seems that VBR respects the CAs listed in the Windows trust store so once you have things setup correctly this is really a non-issue.
What I do take issue with though is the proxies. It is basically impossible to re-create the certificate trust between the proxy and its cluster short of dropping into the shell on the proxies, enabling SSH, copying over your trusted CA, installing the CA, and then going through the entire proxy wizard process once more from VBR. Don't get me wrong - this isn't a lot of work when you know what to do but being able to get a prompt in the proxy's web interface saying "The cert changed, this is the new thumbprint/chain. Accept?" would be a game changer. At least let me get my backups going again so I can continue to achieve my RPOs while I fix up the underlying problem.
There should also be a way to manage (Create/Replace/Update/Delete) the root CAs trusted by a proxy better. Dropping to a shell is not the long-term answer. I'd settle for a list in the web interface or an extra page in the proxy deploy/edit wizard to edit the CAs. By the way we have four clusters and four AHV proxies (soon to be six each). I can't imagine how larger Nutanix/Veeam customers would handle this short of some unsupported automation.
I need to ask - what was the plan for certificate renewal when these proxies were first designed/architected? The self-signed certs the Nutanix clusters come with are valid for 10 years. Yes - god forbid you are still running the exact same cluster in 10 years but still, what was the plan for when that cert naturally expires? Just let the backup jobs start to fail? Or was this a calculated risk by product development to say "Only a minority of customers use custom certificates, let's delay programming that logic until a future release." ? If that's the answer I can empathize but wow it feels rough to be on the receiving end of it.
Smaller Requests
OK here's a number of other requests I have to improve the AHV proxy/plugin but honestly a lot of these boil down to "Create parity with VMware backups."
- Restore Logging - I don't remember in which cases exactly but I think some restore operations don't show up in a proxy's log (web interface "Events" specifically). I think it might be for an entire VM restore of a VMware backup to Nutanix. QA might want to double check.
- VeeamZIP - Noticed two things today (sample size of one problem). I don't see a way to apply retention or encryption to a VeeamZIP backup in the VBR console when the VM is on an AHV platform. Second, I don't think the backup job showed as a running job or in the "Events" page of the web interface. QA might want to double check.
- Proxy Domain Search/Suffix - It's silly to modify the /etc/hosts file in the year 2023 as the AHV 3.0 user guide says to do (Configuring Hostname Resolution). At the very least recommend customers to setup FQDNs properly and setup search domains in /etc/systemd/resolved.conf . It's better in every way. Better still, add this configuration to the web interface.
- Proxy Multi-Homing - This probably isn't an officially supported configuration, but our AHV proxies all have two NICs. The first NIC is configured via the wizard as you'd expect - it has an address, default gateway, and DNS configured (though really the DNS is system-wide because Linux is sane compared to Windows...). That NIC is primarily for "management" traffic between VBR & AHV Proxy, DNS, Updates, SSH, web interface, etc. It's also on the same L2 & L3 network as the backup repos. We manually configure a second NIC which is on the exact same L2 & L3 network as the Nutanix cluster/CVMs. This second NIC doesn't have a default gateway assigned so there is no "route flapping". This cuts down the network traffic substantially (no crossing a gateway on its way to the Nutanix cluster) while still maintaining a good security posture. I'd love to see if this was an officially supported configuration and configurable through the web UI however.
- VM Restore - When doing an entire VM restore (AHV to AHV) the VM is re-added to any previous data protection groups. That seems wrong to me or at least should be an option given to the user in the restore wizard. I have only restored VMs to the same cluster whence they came - I don't know what would happen if you restored the VM to a different cluster. Maybe it would incur an error as the data protection group wouldn't exist? I don't know. Worth asking QA to test.
- Restore Point Archival - Similar to VMware backups, please make RP archival & retention supported sooner than later. Taking VeeamZIPs or exporting backups is a silly workaround.
- Backup Job Scheduling - Similar to VMware backups, please allow the same backup options. In particular we have missed the "After this job" option from the VMware backup jobs.
- Backup File Encryption - Similar to VMware backups, please allow backup file encryption of AHV backups.
- Proxy Certificates - As long as we're on the topic of certificate woes, the proxies themselves have self-signed certificates. There must be a better way to handle that though there's definitely no one-size-fits-all solution here. I guess having different options would be nice. e.g. raw PEM uploads (yuck), LetsEncrypt, CSR generation, etc.
If nothing else, now I'm hooked and I'm keen to see how the product will improve over time. Feel free to ask follow-ups (or point out flaws in my thinking) and I'll do my best to respond.