Our known v12.1 problems in a large Hyper-V environment

jotge · Post by **jotge** » May 03, 2024 12:29 pm this post

Hello,

unfortunately we have some problems after the upgrade to version v12.1 and wanted to share our experiences here. Perhaps there are other environments that are experiencing similar problems.

At first our environment...

to be backed up:
- 16 Hyper-V clusters, 105 Hyper-V hosts in total, 1400 VMs in total
- 7x SQL Failover Cluster (Windows Agents)
- 29x servers (Windows Agents)

used for the backup:
- 6x physical repository servers (4x of which are also tape servers)
- 1x virtual Veeam backup server
- 1x virtual Veeam SQL database server (MS SQL Server 2019 Standard)
- 1x virtual Veeam Enterprise Manager Server
- 1x virtual Veeam Veeam ONE Server
- 1x Tape Library with 4 FC connectet LTO-8 drives (M8 labeld tapes)
- 122 backup jobs (95x HyperV, 27x Windows)
- 95 transaction log backup jobs
- 4 backup copy jobs
- 4 backup to tape jobs (source: backup copy)

After upgrading our environment to Veeam v12.1, we have the following problems that did not occur before:

- A single backup job starts sporadically without any processing taking place. However, the process consumes a lot of CPU resources on the backup server (case # 07248027)
The high utilization is also noticeable when operating the backup server. The backup job must be stopped manually in the Windows Process Manager, it cannot be stopped in the VBR Console. These are different jobs, it is not always the same one.

- SQL Restore in a Failover Cluster Environment fails (case # 07243819)
When attempting to restore a database in the failover cluster environment, regardless of whether it is in the original location or redirected, the following error message appears "The specified drive letter is incorrect.". The process cannot be continued.

- Veeam ONE Missing data for the performance counters (case # 07137635)
In addition to a "normal" display of the performance counters for many objects, sometimes no data is displayed at all or data is only displayed as a "flat line". This results in different displays over time. The phenomenon can occur permanently from a certain point in time or be limited in time. Some counters are not displayed at all for certain objects, but are displayed for other objects in the same category.

- We also notice that our backup to tape jobs get stuck and just don't do anything anymore. (no Veeam case open yet, but very likely the next one)
The backup to tape jobs use the backup copy jobs as a source. There is a 1:1 relationship.
If the backup to tape jobs "hang", this triggers a chain reaction in which the backup copy jobs and some backup jobs also hang. Presumably a resource problem. The problem is usually solved by simply terminating the backup to tape jobs.

Has anyone had similar experiences, perhaps in a similar infrastructure environment?

We are already working with Veeam support and hope to have a solution soon, which I would also share here.

Have a nice day

Jan

Post by **david.domask** » May 03, 2024 12:52 pm this post

Hi Jan,

Thank you for the detailed write up, and sorry to hear that these unexpected behaviors are causing headaches.

For the first two cases, I can see Support needs a bit more time to review the logs and focus the plan of action further, so your patience is much appreciated; neither behavior immediately comes to mind as recognized/wide spread, so let's allow the Support Engineers some time to review the logs.

For the 3rd case (07137635), I can see this is with our Advanced Support Team and they have escalated it internally, so looks like we've got the appropriate resources assigned, so hopefully we'll see an update soon.

The tape job issue also is not expected, and I recommend another case for that as well; can you confirm though, the tape job starts (shows as running) but does not appear to progress within the UI, or it starts processing and hangs at some point? Does it at all seem to align with the high CPU usage during other jobs you see by chance?

Post by **SnakeSK** » May 03, 2024 7:09 pm this post

Just logged in to tell you that the locking up is also affecting large portion of our customers as well. We have cases opened since february with no resolution. People on reddit are having similoar experience as you have.

jotge · Post by **jotge** » May 06, 2024 7:58 am this post

Hello Davis,

thanks for your quick response.

I hope we don't experience the same situation as SnakeSK, at least in the case of 07137635 it is developing in that direction, because this case has been open for quite a long time without a solution. The fact is that we use this tool really intensively, not only the backup administrators but also the application owners, so this is not a really nice situation.

To your question. The tape job starts and at some point this dropout occurs. So it's not that no data is being copied at all, it seems to stop sporadically. I can't say whether there is a connection with increased CPU utilization, as the job runs for a very long time and I don't observe it at night. We could of course use Veeam ONE and use the performance counters for analysis, but unfortunately the tool is currently only of limited use (see Case # 07137635).

Regards

Jan

Post by **david.domask** » May 06, 2024 8:58 am this post

Hi Jan,

I think there's no need to monitor the CPU usage just yet like that for the tape job, and Support will be able to start the check with just the job logs; I know you have a few cases open right now, but it's best that the Support team take a look on the Tape job behavior also.

As for 07137635, it looks like the issue was escalated to Veeam RND, so let's wait to see the results of the investigation there; your patience is much appreciated, and the right resources are aligned on the case to understand the behavior more clearly.

@SnakeSK, can you DM me the case numbers or post them here? I'd like to review the previous cases a bit.

Post by **SnakeSK** » May 06, 2024 9:04 am this post

07222212 and 07156879 and 07078243

Post by **david.domask** » May 06, 2024 10:30 am this post

Hi SnakeSk,

Thanks for sharing the cases -- I can see the first two were reported to have been solved after some changes, but the issue returned which brought you to the current case 07222212.

The case is currently with our Advanced Technical Support team, and as I get it, the plan is to perform a debugging dump of the Veeam.Backup.Manager process to understand the hangs a bit more. I know you've had a few cases you've been working on this over and a few "false victories", and appreciate your continued patience and cooperation. The engineer will need a bit of time to review the newest provided information, so please continue working with the Support Team on this one and let's see the results of the log/dump review.

May 06, 2024 3:07 pm

No the current case is 07222212, dumps have been provided, so we will see. The problems started for numerous customers with V12.1, so it´s definetly a version specific, not environment specific

Post by **david.domask** » May 06, 2024 3:22 pm this post

Ah, sorry about that, just a copy/paste error. I'll edit my post to reflect it correctly, and still let's allow the engineer time to review the provided dumps.

Post by **SnakeSK** » May 06, 2024 8:15 pm this post

jotge wrote: May 03, 2024 12:29 pm
- Veeam ONE Missing data for the performance counters (case # 07137635)
In addition to a "normal" display of the performance counters for many objects, sometimes no data is displayed at all or data is only displayed as a "flat line". This results in different displays over time. The phenomenon can occur permanently from a certain point in time or be limited in time. Some counters are not displayed at all for certain objects, but are displayed for other objects in the same category.

Can you try this? This resolved our VOne performance monitoring problems

For the test purpose, please change the performance data collection method in Veeam ONE from "perfmon" to WMI:

- On the Veeam ONE server machine, please open "regedit" and go to HKEY_LOCAL_MACHINESOFTWAREVeeamVeeam ONE MonitorService

- Create or modify the following entry:

Name: HyperVCollectionType

Type: REG_DWORD

Value Data: 1

- Restart the "VeeamDCS" service.

Please wait for 20–30 minutes and check if the performance data is collected for VMs.

We currently have VOne case opened since december (half a year nearly) that when utilizing perfmon some data were straight up missing or not collected at all, even after reboot, service could not be stopped etc - case nr 07051424

The reg key above helped us big time.

jotge · Post by **jotge** » May 07, 2024 10:54 am this post

Hi SnakeSK,

thanks for this tip.

I have implemented it and, unlike all previous "attempts", Veeam ONE has been collecting data again since then.

However, we still have to observe this, because up to now we have always had temporary phases in which data was collected. What makes me a little optimistic, however, is the fact that disk data is being collected again, because it was no longer collected at all from the time of the upgrade.

I would also ask Veeam Support about this setting and whether there are any side effects.

Thank you and best regards

Jan

Post by **SnakeSK** » May 07, 2024 5:48 pm this post

We are running this for 2 weeks with no problems, but properly test it, I really dont know what went wrong with 12.1 release, today sent another batch of dumps and logs on behalf of another customer, backup job locked for 12 hours.

//edit: I also had phases like you did, sometimes VOne monitored for several days, then only disk and networks metrics were reported, sometimes nothing at all for several hours, even VBR job monitoring did not work. After implementing this it has been a lot better.

tomtom94 · Post by **tomtom94** » May 13, 2024 7:27 am this post

Hello!

We are also seeing a similar problem like case # 07248027 or the "backup to tape problem" of the original poster at one of our customers after upgrading to 12.1 a few month ago but on a much smaller environment.
There are only 2 Hyper-V servers, a replication job and a few backup / backup copy jobs.

The replication job sometimes starts and doing just plain nothing.
This blocks the normal backup jobs and due the lack of a timeout option it runs endless.
Therefore also no error messages are sent via mail, lulling you into a false sense of security.

But however unlike to the original poster i can stop the replication job regularly without killing the task.
Stopping the job throws the following error:
Resource not ready: backup proxy
Processing finished with errors at xx.xx.2024 xx:xx:xx

I helped myself by adding a backup window that stops the replication job before the regular backup jobs.

This happens totally random and i was not able to confirm if the job takes up a lot of CPU time.

No case open yet.

Best regards

Tom

Binje07 · Post by **Binje07** » May 13, 2024 9:47 am this post

Hi everyone.

Since we installed the 12.1 version in January we have a huge amount of errors with our Hyperv Backup and Replication, on Hyperv clusters and on standalone.
The main error is jobs getting stucked for no reason.

ReplicationJob starts an get stuck while waiting for the proxy, even on an on-host proxy where nothing else is running.
If the replication is stuck, the backupjob get stuck too. We launch replication every hour so we end-up creating a script which stops the replication if it's still runnning after 45 minutes.
Task scheduler with launching Powershell (C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe) and the argument pointing to the PS1 file (-File "C:\Script\yourscript.ps1") containing (Get-VBRJob -Name "YOURJOBNAME" | Stop-VBRJob).

We even have some backup job stuck (waiting for infrastructure/not obtaining a proxy), it sometimes forces us to reboot the whole VBR Server.

After working with the support team during more than one month, they give me a private patch to correct this behaviour (with the creation of a registry key too).
Seems like it's working but I applied it last week on my critical VBR so I prefer be careful for the moment. I've done the other ones this morning because I got another job stuck and I'm fedup having to connect on week end to check if it's allright.

I add that I need to insist to have a level2 engineer because the level one was a beginner and unable to have a pertinent analyse. The quality of support has drop down since a few months. I know it was a complex problem since the beginning and I lsot a lot of time sending logs.

My whole structure was impacted,3 differents VBR and 20 HyperV, no problem on VMWARE at all. We tried a lot of things and have done a lot of extracting logs before getting the right answer.

Here is the link to the patch:

[Moderator: removed]

Hope this is helping.

Post by **david.domask** » May 13, 2024 9:56 am this post

Hi Binje07,

Thank you for sharing your experience and the solution -- I've removed the hotfix information from your post as hotfixes must only be utilized after the situation is reviewed by Veeam Support.

I'm glad to hear that you were able to get a resolution, though sorry to hear that the case was not to your satisfaction -- can you share the case number for review?

Binje07 · Post by **Binje07** » May 13, 2024 10:04 am this post

I perfectly understand your point of viewe but for exemple tomtom94 is concerned by the same anomaly so if they don't want to wait three month for a patch. That's why I recommended to make a backup. This patch should be share if working.

Post by **david.domask** » May 13, 2024 10:09 am this post

Aha, understood, but there is a logic to having the environment reviewed before applying the hotfixes

So if possible, please share your case number as requested; it can be a an additional point of information during research to confirm from the logging on the behavior, and then act as the situation dictates.

Binje07 · May 13, 2024 11:37 am

Case n°07164826 but I just got a replication stuck so seems it's not solved.

Post by **SnakeSK** » May 13, 2024 6:03 pm this post

We had another deadlock throughout the weekend. Today another one at different customer. Both replicas.

These cases are dating back to 12.1 december release where we had the first support case, its half a year and there js still no resolution?

Wouldnt this be a good time to do some code regression to bring reliability to acceptable levels?

Thank you

jotge · May 14, 2024 6:02 am

Short update from me "Veeam ONE Missing data for the performance counters (case # 07137635)"

So far, all performance data has been collected. As we have had phases in the past where data was collected over 23 days, I still need to monitor it to be sure. But as I mentioned, now that the disk data is being collected - which was not the case before - I'm still optimistic.

The other cases are still open, so far without a solution. Today I have a WebEx session with support about the SQL failover cluster restore.

Have a nice Day!

Post by **david.domask** » May 14, 2024 7:43 am this post

Hi Jan,

Glad to hear there was some progress here; indeed, monitoring is a good idea. Checking the case, I can see that there does seem to be some delay on the most recent update; I will ask the Support Team to respond on the most recent update from the internal investigation and on the new information you've shared.

(Hint: If you every do encounter issues or unexpected delays on a case, use the Talk to a Manager button to reach out to Veeam Support Management and explain the concerns regarding the case.)

Binje07 · May 15, 2024 10:20 am

This morning, another patch with another key in regedti to try solving the randomly stucked replicas or backup jobs

jotge · May 15, 2024 1:06 pm

Update for SQL Restore in a Failover Cluster Environment fails (case # 07243819)

Yesterday we found out from veeam support that the determination of the path of the SQL instance on the cluster node using the SQL statement

Code: Select all

SELECT path_name FROM sys.dm_io_cluster_valid_path_names;

fails. The query also returns no result when called in SQL Management Studio.

So it does not seem to be a Veeam problem but a local SQL failover cluster problem.

Post by **SnakeSK** » May 15, 2024 10:15 pm this post

Binje07 wrote: May 15, 2024 10:20 am This morning, another patch with another key in regedti to try solving the randomly stucked replicas or backup jobs

Nice, I got generic reply that it is being investigated. No regkeys, no patch

jotge · Jun 17, 2024 6:39 am

Update for Veeam ONE Missing data for the performance counters (case # 07137635)

After almost 4 months of problem analysis, internal escalation at Veeam, involvement of the R&D team comes the request to install Windows updates and reboots ... with the justification "because that helped in a similar case"!

This is unacceptable for software that sees itself in the enterprise segment.

A monitoring and reporting solution is all the more important the larger the environment to be monitored!

Veeam introduces a new and, in my opinion, insufficiently tested functionality and cannot offer a solution in the event of a problem. This is something we cannot use.

For this reason, we have decided to no longer actively work on the case and to use veeam One with the "old" data collection method WMI. BTW, the hint on how to switch this came from the Community and not from Veeam Support! Thanks again SnakeSK.

Post by **jorgedlcruz** » Jun 17, 2024 10:39 am this post

Hello Jan,
Apologies on the experience you suffered with the case and escalation. I have been looking around, and there is not much details regarding the issues from Microsoft updates. What I can see it helped another Enterprise is the next. Remember we are talking of a support case in May similar to yours:
* Customer installed the “March Windows updates” on the hosts, then rebooted and the issue is solved. Customer mentioned that the issue specifically started after Windows Server 2019 2023.09 com update.

Unfortunately, we do not have any more reports about similar issue other than that Customer which was resolved with those updates (We do not have details on updates installed, individual packages I mean). I agree that perhaps the solution to change to WMI could have come from support, just at least to see if that data was enough for you, etc.

We are eager to do better, across all departments. So if you are willing to troubleshoot further in the future, after all updates are installed, in case you keep having same error, please open a support case, and send via DM the number, so we will be aware.

Thank you so much

Jun 17, 2024 7:31 pm

We had updates installed and problems disappeared after switching to old method. Even current VBR cases seems to point to Veeam not properly QAing e.g. Microsoft Secure server scenario we have enabled. Also credential guard, LSASS isolation/ppl etc. These are valid and recommended security baselines that we have implemented in infrastructure and Veeam has no objections in its documentation against its usage.

This is not to bash anyone, but you really should slow down in implementing new untested features and calling it a day. Reminds me of windows insider program when everyone was running the test builds in VMs and the machines shat bricks when deployed to real hardware with real nonsynth drivers. Maybe you just should extend the v12 support, give yourself a break, and postpone the v13.

Even your support engineers are admitting that this is the buggiest release they have ever experienced, and I asked about 4 of them.

Slow down pls

Post by **Gostev** » Jun 17, 2024 7:45 pm this post

SnakeSK wrote: Jun 17, 2024 7:31 pmEven your support engineers are admitting that this is the buggiest release they have ever experienced, and I asked about 4 of them.

Well, there's "personal opinion of particular individuals" metric and there's also objective metric, which happen to tell the completely opposite story. Up to you which one to trust

personally I trust the latter.

Post by **SnakeSK** » Jun 17, 2024 7:56 pm this post

of course its subjective, I never said it wasnt

But as I said in my other thread (that I wrote all the cases we had since V12), the amount of support cases we had to open is larger

Jun 17, 2024 8:41 pm

There's for sure an element of luck as every environment is different. But in the bigger picture, for example the number of support cases per product download has been reducing for each release in all the years since we started tracking this metric.

Good point on the documentation though, already asked that settings which are not specifically tested are called out there. As I know that most of the QA testing has always been done against vanilla Windows installs, since this is the only "predictable" Windows configuration to test against.

R&D Forums

Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Re: Our known v12.1 problems in a large Hyper-V environment

Who is online