Host-based backup of VMware vSphere VMs.
Post Reply
StevenMeier
Enthusiast
Posts: 85
Liked: 31 times
Joined: Apr 22, 2016 1:06 am
Full Name: Steven Meier
Contact:

A Story of support woes and resolutions

Post by StevenMeier » 28 people like this post

This is a long story of a support issue (for brevity I have shortened it down)
This support issue has been going on since February 2019.
It has involved
HPE level 2 and 3
Veeam Level 2 and 3
Microsoft Lvl 3 Debug Team.

The beginning……
We installed Veeam , a green fields install to replace an old Backup infrastructure.
This was a new role and as I had vast Veeam experience from previous roles I got given the task.
The scope and purchase of gear went great and we started the implementation.

Summary
2 Datacenters at 2 different geographic locations
At each site
1 x Apollo 4510 Gen 10 HPE Server (60 x 12 TB Disks, 128 GB ram , 2 x 12 core zeon processors, 2 port 16 GB fibre card , 2 x 10Gb ethernet Ports) OS is windows server 2019 Std edition.
1 x HPE 5250 Storeonce device for deduplication at each site.
A single Veeam mgmt Server – vm , (running server 2019, sql server 2016 Std edition)

Testing worked well , backups flew, and all was working great in the bed in process.
Increased Load to about half to 2 thirds our total VM number In Production.
About 2 weeks in one of the physicals servers Hung…..not blue screened just hung
No rdp connectivity and from the ILO no mouse or keyboard movement, NO response at all.
Powered off and back on……backups continued where left off.
Nothing in windows logs or hardware logs for HPE.
Logged a job with HPE…….so over the course of next month we checked and applied firmware , replaced bits , updated drivers etc etc. All with help from HPE (great support and no issues with their help).As we went for periods of no issue and we thought each minor change had addressed the issue…but one of the Apollos would hang again at one of the sites after a week or few weeks and back to square 1.
As no blue screens we decided next time to use from in the ILO the NMI to generate a memory dump file in windows.
At this stage we could not force a hang or could never know when one would happen, it was random. It might not happen for 3 weeks then we get 2 in 2 days.
The only thing that seemed common was always happened during backup copy jobs.(large amounts of Disk IO)
We used tools to hammer the Disk subsystem for hours on end and it never had an issue.
So over the course of next few weeks when either box would hang we generated the NMI interrupt and generated a dump file. HPE would look and could find not a thing hardware related no faults in logs or AHS logs and some other tools they got me to run.
At this stage I was already working with Veeam also as we only seemed to see this during a backup copy job. I was sharing info with HPE and Veeam, no answers at this stage.
Veeam suggested we log a call with MS and ask them to look at the dump files we had collected.
So I did this.
(So I have HPE level 2 and 3 , Veeam Level 2 and MS dump team ……..imagine the email send and receiving I was doing.)
After many emails we finally got the dump analysis team involved and sent files to them that I had and also as new ones occurred.
At this stage both myself and Veeam were confident its not hardware. I spoke with HPE about not wanting to close call just in case as MS were now involved and if they found a Hardware driver issue we could continue down that track and they said ok no problem. (this is 4 or 5 months now)
MS spent weeks looking at files and eventually came back….”it’s the Veeamagent process”, its locking the cpus for 2 long….blah blah blah.(cutting it short for brevity)
Many emails were exchanged between Veeam, me and MS and back the other way….I tended to agree with Veeam it made no sense for the Veeam agent to be the cause.
I was very annoyed at MS at one point and had a phone conference with Team leader of dump team and a Manager and explained that MS was not answering the questions Veeam had in enough detail to understand their reasoning and assist Veeam looking into what MS perceived as root cause……and to that point I could not make sense of MS logic either.
My Veeam Tech engaged Level 3 Veeam and they spent time going through all my dump files as we did not seem to be getting further with MS dump team(the experts we thought). We had not done this earlier because we believed MS would be best at doing this. Veeam found in every case a file/process associated with windows Defender AV service.
When you have 24 cores that are hyper threaded and a dump file shows 47 or 48 of the cores hanging on the same process…makes you wonder.(why did MS not report this ?)
Before you ask yes everything Veeam required to be excluded from AV was and we checked many many times.
We disabled realtime AV……..after 2 weeks we had a hang. So we removed it completely via the add/remove features/roles. (7 month mark).
Its been 4 months and the cpu utilisation is less and had no hangs. We are still working (Veeam and myself ) and keeping a close eye on things, MS I have had words with and while the job is open I have stated we will be asking for credit back if it turns out to be MS issue.
While this issue is still ongoing (not closed yet) I must say Veeam Support has been fantastic , Clark my Veeam guy has been great and patient to work with . This has been a tricky issue and while its still ongoing I must say HPE and Veeam have gone the extra mile.
MS , well unimpressed is the only word I have. When I said in a phone meeting we believe the root cause is the AV process (long chat) their response was…” if you remove Veeam the server will work fine”……at which point my response is “so the server has to sit there and do nothing for MS to accept blame if it crashes”…basically you could not use it was their logic and because its not crashed the Server OS is not the cause”.

When we did have hangs I would capture/generate the dumps and restart the Server(s) and the backups would continue from where they had stopped …never missed a beat from that point of view. Since it was so random…we have gone to production and removed our old backups completely.
We , even though we have a random issue ongoing we are still far better off and better protected than we were with our old system.
My reason for writing this is partly to make folks aware that MS cannot be everything to everyone and be careful . Their support was more aimed and pointing the finger compared to HPE and Veeam, just wanting info to take ownership and solve the issue……MS were just interested in closing the job and that making sure nothing pointed at the OS.(in my opinion)
Its been a learning experience for me. 25 years in IT and over that time I have logged many support calls with Vendors, Not so many with MS…but MS attitude is very old school…..point the finger rather than work together to solve the issue.
Great experience with Veeam and HPE , MS completely unimpressed with.

The Latest Update , system stable no Hangs, no windows AV installed , CPU utilisation is low…..there has been talk that the cause seems to be Windows Av and REFS related (we use REFS Volumes).
So folks it has been interesting…..
Cheers
Frosty
Expert
Posts: 201
Liked: 45 times
Joined: Dec 22, 2009 9:00 pm
Full Name: Stephen Frost
Contact:

Re: A Story of support woes and resolutions

Post by Frosty »

Where's the LIKE button?! ;) Thanks for sharing.
iknowtech
Service Provider
Posts: 76
Liked: 19 times
Joined: Oct 26, 2017 1:11 am
Full Name: Jason Brantley
Contact:

Re: A Story of support woes and resolutions

Post by iknowtech »

I have had way too many issues with the built in Windows Defender on Windows Server 2016/2019 I remove it immediately after installing the OS from all production servers. At best your server is going to have performance issues, at worst you're going to have situations like your 7 month nightmare.
FedericoV
Technology Partner
Posts: 36
Liked: 38 times
Joined: Aug 21, 2017 3:27 pm
Full Name: Federico Venier
Contact:

Re: A Story of support woes and resolutions

Post by FedericoV » 1 person likes this post

Steven, thanks for the great report!!!
In my HPE & Veeam integration lab (HPE internal), I had been working for months testing and optimizing Apollo 4510 Gen10 and 4200 Gen10 for Veeam workload.
I have seen your same lock twice, once on the 4510 and once on the 4200, but, sadly, I have never been able to reproduce it. The feeling that maybe there was an unresolved instability hidden in my system never left my mind.
I'll include your finding in our white paper: "Reference Architecture for Apollo 4200 and 4510 Gen10" http://h20195.www2.hpe.com/V2/GetDocume ... 0000150enw
Cheers
Steve-nIP
Service Provider
Posts: 129
Liked: 59 times
Joined: Feb 06, 2018 10:08 am
Full Name: Steve
Contact:

Re: A Story of support woes and resolutions

Post by Steve-nIP »

The Apollo 4510 has a hardware issue whereby a resettable fuse trips and cuts power to drives. I'm not sure whether your server has had the hardware fix yet (and it is a hardware fix), but it would be worth talking to HPE about it.
mweissen13
Enthusiast
Posts: 93
Liked: 54 times
Joined: Dec 28, 2017 3:22 pm
Full Name: Michael Weissenbacher
Contact:

Re: A Story of support woes and resolutions

Post by mweissen13 »

Seems like I got lucky since the first thing I always do on all Veeam Servers is remove the Windows Defender Feature!
Great write-up and thanks for your insights into the support woes of MS.
Makes me wonder if Veeam shouldn't be (also) supporting Linux as a platform for B&R.
ejenner
Veteran
Posts: 636
Liked: 100 times
Joined: Mar 23, 2018 4:43 pm
Full Name: EJ
Location: London
Contact:

Re: A Story of support woes and resolutions

Post by ejenner » 1 person likes this post

Another cause of hanging server is the combination of ReFS and System Center Configuration Manager service on Windows Server 2016 repository servers.

I discovered this once, then forgot about it. After a long duration of not seeing any crashes the service was automatically reinstalled on all servers. The crashes came back, after a few weeks of hoping that the crashes I was experiencing were "one-off" and not going to reoccur and trying to ignore I suddenly remembered the incompatibility. I removed the service/agent for CCM and the problem was fixed... again.

Suffice to say, I've proved this at least twice at my site on all my repository servers.

Just thought that would be useful in case anyone finds this thread during searches, ect...
albertwt
Veteran
Posts: 942
Liked: 53 times
Joined: Nov 05, 2009 12:24 pm
Location: Sydney, NSW
Contact:

Re: A Story of support woes and resolutions

Post by albertwt »

Thanks for the sharing @StevenMeier.

I found out that, Microsoft Support recently is always be challenging and frustrating, given they have so many smaller departments managing the specific case for each specific Microsoft product features.

Logging the case will cost you $$$ per incident basis.
--
/* Veeam software enthusiast user & supporter ! */
Post Reply

Who is online

Users browsing this forum: Amazon [Bot], Bing [Bot] and 65 guests