Backup of enterprise applications (Microsoft stack, IBM Db2, MongoDB, Oracle, PostgreSQL, SAP)
Post Reply
tinto1970
Veeam Legend
Posts: 150
Liked: 45 times
Joined: Sep 26, 2013 8:40 am
Full Name: Alessandro T.
Location: Bologna, Italy
Contact:

During an RMAN restore on a Test server, impact on Production server happened [#07773699]

Post by tinto1970 »

Good day all, today I have experienced an issue with restoring and RMAN backup with Veeam Explorer:

we wanted to restore the WP DB on a test server, TST1. We started the process and it was working on TST1 (we were ssh connected to TST1 and the datafiles were written here by the restore process as expected).

But we noticed that in the same time the production server ORA1 was very slow, the load was high (>50) and the network activity (as seen by vmware monitoring) was near zero.

We immediately stopped the restore process and shut TST1 down. After doing that, the network traffic on the production server ORA1 started again and the load decreased to a normal level.

The infrastructure is pretty good and we don't guess the reason was load/traffic congestion.
Do you have a possible explanation? Maybe the backup files were locked by the restore process and the log backup (running every 15 minutes on the production server ORA1) was "freezing" the production server?

Naturally we sholud not test the restore again before we understand what happend, because the production server serves a very critical application with a lot of users in an important environment.

Thanks in advance
Alessandro aka Tinto | VMCE 2024 | Veeam Legend | VCP-DCV 2023 | vExpert 2025
blog.tinivelli.com
PetrM
Veeam Software
Posts: 3996
Liked: 686 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: During an RMAN restore on a Test server, impact on Production server happened [#07773699]

Post by PetrM »

Hi Alessandro,

It is a very interesting and, at the same time, very sophisticated technical issue. There are multiple reasons that could cause such behavior, and it would be best to continue the investigation with our support team.

At first glance, it looks like an infrastructure-related issue, apparently, high load on ORA1 is observed during data writes to TST1. Are these two VMs on the same datastore and connected to the same switch? Apparently network traffic during restore is quite intensive and creates excessive load. Where is the backup repository located? These are rather rhetorical questions—I'm just trying to give you a hint for troubleshooting. I doubt that we can effectively troubleshoot this through forum posts.

Thanks!
tinto1970
Veeam Legend
Posts: 150
Liked: 45 times
Joined: Sep 26, 2013 8:40 am
Full Name: Alessandro T.
Location: Bologna, Italy
Contact:

Re: During an RMAN restore on a Test server, impact on Production server happened [#07773699]

Post by tinto1970 »

thank you Petr, sure I will continue working with the support team, my concern is that I cannot simply "retry and reproduce the issue" because of the high impact it can have on a super critical production DB server.

The two VMs were on the same datastore and ESXi host; it's a two hosts cluster and it's sadly not possible to separate the test VM where the restore is running into from production VMs.
The backup repository is a Synology NAS. Everything is in the same datacenter.

I also guessed it could be an excessive load, but it sound so strange to me that the network traffic on the production server falls exactly to zero

https://ibb.co/b0cJSCg

during the restore, while the traffic on the test vm is high... why should all the resources go to the restore process?

https://ibb.co/tw0s1Tzj
Alessandro aka Tinto | VMCE 2024 | Veeam Legend | VCP-DCV 2023 | vExpert 2025
blog.tinivelli.com
PetrM
Veeam Software
Posts: 3996
Liked: 686 times
Joined: Aug 28, 2013 8:23 am
Full Name: Petr Makarov
Location: Prague, Czech Republic
Contact:

Re: During an RMAN restore on a Test server, impact on Production server happened [#07773699]

Post by PetrM »

It's about network traffic, right? Maybe you can connect a test VM to another network and test, or somehow check QoS/VLAN priority, and so on.

Anyway, it looks like an environment-specific issue. I'm pretty sure you would see the same behavior if you copied a large file to the same VM from the NAS. So far, I don't see any correlation between the functioning of our Explorer and the issue, but I'll leave it to our support team to provide a technical summary.

Thanks!
tinto1970
Veeam Legend
Posts: 150
Liked: 45 times
Joined: Sep 26, 2013 8:40 am
Full Name: Alessandro T.
Location: Bologna, Italy
Contact:

Re: During an RMAN restore on a Test server, impact on Production server happened [#07773699]

Post by tinto1970 »

oh well, this afternoon I performed a full restore of the TST1 VM, reading from the same repo of the RMAN backup and writing to the same host/datastore (in nbd mode), with no issues for the ORA1 VM.

Maybe tomorrow we'll perform another test, and I will disable the RMAN plugin backup of the WP DB (saving logs every 15 mins) before to go: it's the only thing I can try to stop any "interaction" between Veeam server/proxies and the ORA1 VM during the restore operation on TST1. I have no other idea and I'm also struggling to send the logs to the support engineer because of customer's network restrictions :-\
Alessandro aka Tinto | VMCE 2024 | Veeam Legend | VCP-DCV 2023 | vExpert 2025
blog.tinivelli.com
Post Reply

Who is online

Users browsing this forum: No registered users and 4 guests