v8 slow incremental backup merge: call for support cases

Post by **Gostev** » Mar 11, 2015 11:06 pm this post

If you are experiencing slow backup file merge when your job processes retention policy ("merging oldest incremental into full backup" in the job log), please post the following information below. We will be updating each post inline with the cause and resolution, once those are confirmed by support.

1. Support case ID.
2. Backup repository type (for raw storage, include amount of spindles).
3. Backup job type: Is this a primary backup job in forever incremental backup mode, or a Backup Copy job.
4. Total job size: VM count and total size.
5. Processed incremental size, and time it takes to complete the merge (one latest example).

Please do not post unrelated issues or comments (they will be removed), only the above information please.

Thanks!

manhil · Post by **manhil** » Mar 12, 2015 8:21 am this post

Hello,
here a quick summary of our Situation:

1. Support-Case 00823373

2. Source-Repository and Target-Repository are HP DL180 G6-Server with a Intel Xean E5520-Quad-Core-CPU (2,27GHz, Hyper-Threading enabled) and 24 GB RAM.

12 direct attached 3TB-SAS 7k-Disk as a Raid5.
OS is Windows 2012.
The Productive-Site and the DR-Site is linked through a 1Gbit-WAN-Link. Typically Network-Packet-Times less than 2 ms. Data Transfer through WAN-Accelerators on productive Site and DR-Site. Booth WAN-Accelerator are Hardware-Servers with SSD-Disks.

3. Job-Type = Backup Copy

4. Job-Size: Total-Size = 7,2 TB, VM-Count = 42. The Size of the last .vbk-File in the Backup-Copy-Destination is 2,2 TB.

5. The Merge lasts now 73 hours. In the Veeam-Console it is sitting on 19%. In the Logfile Agent.{Our Jobname}.Target.log is a progress visible (incredible slow, now 15%, about 4,5 hours for one percent-point).

Further Information:

- We upgraded Veeam B&R from v7 to v8 including Patch 1 (Version 8.0.0.917).
- The Job-Settings where not modified after the Upgrade from v7 to v8. The Backup-Copy-Job is configured to keep 60 Restore points and 4 Quarterly Backups. VM Retention after 30 days. Health-Check Monthly on last Wednesday. Inline data deduplication is enabled. Encryption is disabled.
- I know the Disks in the Array are not super-fast, but the Disks-Performance is IMO not the Problem. Perform shows me less than 1 MB/s on Throughput and <= 5 IOPS. The Queue-Length is always 0.000.
- The veeamagent.exe-Process on the Target-Repository is consuming 100% of one CPU-Core all the time. All other CPU-Cores are idle.

[Gostev] Update 1

According to the logs, the process that takes long time in your case is not merge, but so-called "overbuild". Overbuild has to happen before merge can even start. Overbuild is required when VIB to be processed by merge has an incomplete incremental restore point of a VM (for example, incremental restore point was so large that transfer did not have time to complete within a backup copy period - and so it had to continue into the next VIB2). In that case, VM's data from VIB1 needs to be moved into VIB2 before VIB1 can be merged into VBK, otherwise merge will "break" good and complete restore point stored in VBK.

The research why overbuild process takes so long is underway.

I've also noted that the job is using the smallest block size (256KB), which immediately means 4 times slower backup file processing than with the default settings, but this is not important as this would not change the processing speed between v7 and v8, so this is just FYI.

[Gostev] Update 2

We are making good progress on researching the overbuild performance issue, thanks to the OP's great help. A special data mover version with deep performance logging shows that data mover spends very little time both reading and writing, so the issue is not with the storage speed in this case. The new data mover version will be created to add performance logging into the area not covered at this time, which is where most time seems to be wasted.

Unrelated to this, but still a useful data point. We've been doing some very large scale merge performance testing with Veeam Endpoint Backup RC (which shares the backup file processing engine with B&R), and are seeing about 50 MB/s merge performance on a raw storage with 30 spindles of 2TB NL SAS hard drives in RAID-6 with 128KB stripe size (256KB would have been even better).

[Gostev] Update 3

We are keep working on researching this issue. So far, we are unable to reproduce similar data mover behavior in any of our labs, which slows down the troubleshooting.

[Gostev] Update 4

We have finally found the root cause, and determined the set of conditions required to reproduce it.
The issue:
- Is unrelated to backup merge performance (sits in the overbuild process, see explaination in Update 1)
- Is not specific to v8 (existed in the product since Backup Copy was introduced)
- Requires certain data patterns across incremental backups files participating in the overbuild process, along with large incremental backup sizes (hundreds of GB), and gets significantly worse when using small block sizes (as is the case here)
We will now be investigating how and when can we address this issue (issue ID 46072).

[Gostev] RESEARCHED
Issue is unrelated to backup merge performance, and is not specific to v8.

mbrinkho · Post by **mbrinkho** » Mar 16, 2015 2:01 pm this post

1. Support case ID - # 00816437

2. Backup repository type. Storage is DA to the Veeam server - (8) 7.2k 2TB SAS drives in a RAID-5 on a PERC 5/i controller. (Additional details of the server: Server 2012 R2, 16GB RAM, (2) Quad-core Xeon X5355 CPUs @ 2.66GHz.)

3. Job-Type. Primary backup job in forever incremental mode.

4. Job-Size. Approximately 140 VMs, 10TB. The full backup file size on disk is around 5TB.

5. Processed incremental size, and time it takes to complete the merge (one latest example). - This is a bit tough to be accurate on but I would say the incrementals are around 200GB and it has been taking anywhere between 18 and 25+ hours for the "Merging oldest incremental backup into full backup file" task to complete.

[Gostev] Update 1
Support promised to update me tomorrow, as they need to go through the logs first.

[Gostev] Update 2
Through the debug log review this was confirmed to be caused by slow metadata handling issue that I've referenced in the earlier thread. For reference, it is bug ID 44864 on support tracker, and a hot fix is already available.

[Gostev] RESOLVED
Issue is unrelated to backup merge performance, but is specific to v8 (metadata handling issue).

deduplicat3d · Post by **deduplicat3d** » Mar 16, 2015 2:52 pm this post

1. Support case ID.

00832572

2. Backup repository type (for raw storage, include amount of spindles).

Dell R720XD, 12X7200RPM Disks Local Raid 10

3. Backup job type: Is this a primary backup job in forever incremental backup mode, or a Backup Copy job.

Primary forever incremental backup job

4. Total job size: VM count and total size.

2 different jobs, one is 4 tb with 30 vm's and the other is 2tb with 3 vm's

5. Processed incremental size, and time it takes to complete the merge (one latest example)

The one with 30 vm's about 15-30GB per incremental, the one with 3 vm's about 5-10GB per incremental. I switched the job to forever incremental after v8 and did an active full. I believe this is the first time that the jobs hit the retention period and are attempting to merge so i don't have a basis for time it should take. One job has been sitting there for 30 hours and the other for 50 hours on "Merging oldest incremental" with no indication of progress.

[Gostev] Update 1
According to the logs available, actual merge process did not start yet for either job (job manager process seems to be hanging). Support will work with you to get additional logs and investigate manager process issue, but at a first sight it does not look like this issue has anything to deal with merge.

[Gostev] Update 2
No real updates - trying to obtain manager process memory dump to determine the reason why it is hanging (some issues uploading memory dump file to us).

[Gostev] Update 3
Memory dump received and investigated. The issue was found to be caused by job manager process hanging in an attempt to resolve non-existent restore points during the merge initialization process. This issue will impact primary jobs using other backup modes in a similar manner. The fix to run a small SQL script against the configuration database.

[Gostev] RESOLVED
Issue is unrelated to backup merge performance, and is not specific to v8.

JimmyO · Post by **JimmyO** » Mar 23, 2015 3:04 pm this post

1. Support ID 00848381

2. Backup repository type: DAS, HP DL380Gen8, (23) 7.2k 4TB SAS drives in a RAID-5 on a P822 controller. (Server 2012 R2, 128GB RAM, (2) 8-core Xeon E5-2650)

3. Backup job type: Primary backup job in forever incremental backup mode.

4. Total job size: 155VMs, 16Tb processed, 3,5Tb read, 1,2 Tb Transfered

5. Processed incremental size, and time it takes to complete the merge: 1,2Tb, 45 hours

[Gostev] Update 1
According to the logs, everything works well (no delays due to known issues are observed). This makes it possible storage performance issue, so we will be collecting performance debug log and diskspd results as the next step.

[Gostev] Update 2
Performance debug log gave us a good idea on the required combination of factors which are causing this issue:
1. Very large backup files.
2. Very large amount of parallel tasks used to create those.
3. Job age (lots of transforms have already happened), and each transform reduces the speed of the following transforms slightly.
We are now working on the private hot fix that should improve the situation.

tkeith · Post by **tkeith** » Mar 23, 2015 9:01 pm this post

1. Support case ID: 00849378

2. Backup repository type: Redhat VM with NFS volumes hosted out of NetApp

3. Backup job type: Primary backup job in forever incremental backup mode

4. Total job size: 44VMs - 7.4TB (3.4TB used)

5. Processed incremental size, and time it takes to complete the merge:
50-70GB incremental size and last job took 6.5hrs to merge.

[Gostev] Update 1
According to the logs, actual merge takes only half of the time out of these 6.5 hours, and looks to be happening at a fair speed (considering that NetApp is commonly reported to not "like" Veeam I/O patter with the default settings). Out of 3 hours wasted, roughly half of the time is spent on metadata handling issue mentioned earlier, so the same hot fix will help. The rest of the time is lost doing legitimate things, but they take longer than expected, so this will be investigated further. In case no issues are found with the infrastructure, we will investigate some potential optimizations of this logic for future updates.

[Gostev] RESOLVED PARTIALLY
First part of the issue is unrelated to backup merge performance, but is specific to v8 (metadata handling issue).
Second part is subject to further research and potential optimizations of this newly introduced backup mode (I will add any significant updates here).
Update: We've identified some possible performance enhancements that should potentially accelerate all backup modes, and will be testing those.
Update: Private build was tested with the customer, showing further reduction of job time by 40 min. The code was checked into the Update 2 branch.

Josh_O · Post by **Josh_O** » Mar 24, 2015 8:16 pm this post

1. Support case ID: 00852175

2. Backup repository type: CIFS Share from AWS

3. Backup job type: Primary backup job in forever incremental backup mode - 5 restore points

4. Total job size: 2VMs - 515 GB

5. Processed incremental size, and time it takes to complete the merge:
355GB processed last job and took 2 hours to merge. Currently over 22 hours merging.

[Gostev] Hi, you did not include the information on the storage behind the CIFS share (spindles count), please post or PM. Depending on the raw storage IOPS capability, this merge performance could be normal/expected. Note: first (initial) merge cannot be used as a reference point for the following merges, because unlike the following ones, it does sequential I/O.

[Gostev] Update 1
Have not heard back on my storage query. Waiting for the customer to provide performance logs, which should particularly show storage performance capabilities.

[Gostev] Update 2
Putting this one on hold (no response from the customer).

ed8707 · Post by **ed8707** » Mar 26, 2015 5:05 pm this post

1. Support case ID: 00855176
2. Backup repository type (for raw storage, include amount of spindles).
- Qnap TS-669 Pro
- 6 x 7200 RPM SATA drives in a RAID 5. 1 of the 6 is a hot spare.
- CIFS share
- Backup proxy mode is Network, have switched from hotadd, no major difference between the two in terms of processing and moving of the data, but
there is a significant difference in not having the overhead processing that comes with hotadd when adding/removing the vmdk's to the proxy vm. ~2 hours
3. Backup job type: Is this a primary backup job in forever incremental backup mode, or a Backup Copy job.
- Primary job to local qnap nas.
- Using Forever incremental job type.
4. Total job size: VM count and total size.
- 26 VMs (3 of those are templates)
- 7 TB of data is processed from the source, the .VBK file is 2.9 TB on the repository, and nightly VIBs range from 50 GB to 250 GB. On average it is closer
to the 50-100 GB range.
5. Processed incremental size, and time it takes to complete the merge (one latest example).
Comparing before and after applying the hotfix. So this is the past 2 nights of backups and merges, 1 before the hotfix 1 after. the before is also in hotadd mode vs. after with NBD.

BEFORE
- Processed: 7.0 TB
- Read: 164.9 GB
- Transferred: 45.9 GB
Duration: 12:10:04
Merge Duration: 8:47:09

AFTER
- Processed: 7.2 TB
- Read: 386.2 GB
- Transferred: 53.1 GB
Duration: 8:52:55
Merge Duration: 6:12:01

It appears the job is running in ~ 4 hours less time, only about 2.5 hrs is saved with the hotfix, the rest is a result of switching from hotadd to nbd.

[Gostev] Update 1
Waiting for the customer to provide debug logs.

[Gostev] Update 2
Debug logs reviewed, exact same situation as 00849378 above.

[Gostev] RESEARCHED
Issue is unrelated to backup merge performance, but is specific to v8 (metadata handling issue).

eschek · Post by **eschek** » Apr 01, 2015 6:24 am this post

1. Support case ID.
00867702

2. Backup repository type (for raw storage, include amount of spindles).
Netapp FAS8020 with SMB/CIFS
64 Disks for this volume

3. Backup job type: Is this a primary backup job in forever incremental backup mode, or a Backup Copy job.
forever incremental

4. Total job size: VM count and total size.
VMs: 49
Total size: 11,2 TB

5. Processed incremental size, and time it takes to complete the merge (one latest example).
between 80 GB and 180GB

Merge between 8 and 13 hours

[Gostev] Update 1
Debug logs are incomplete (taken during merge), but based on available ones exact same situation as 00849378 above.

[Gostev] RESEARCHED
Issue is unrelated to backup merge performance, but is specific to v8 (metadata handling issue).

Apr 01, 2015 11:33 pm

Thanks to everyone who have submitted their support cases and worked with our support to provide additional stuff required for troubleshooting.

Just to remind, the main goal of this topic was to openly prove the statement that v8 does NOT reduce merge engine performance comparing to v7, and cases when such performance reduction is observed are caused by the known v8 issue with application metadata handling or other unrelated issues.

After processing these randomly submitted support cases above:

1. We have confirmed that actual performance of the merge engine itself did not change between v7 and v8. Those customers who did observe slow merge were all impacted by either the known application metadata handling issue in v8, or by issues that existed in the product before v8.

2. We have found and confirmed overbuild engine performance issue, however it was not specific to v8 and existed in the previous version as well.

3. We have determined one optimization that will improve performance of primary backup jobs using any backup mode (including v8 forever forward incremental). This will make the jobs complete faster than they used to with v7.

4. We have determined an additional optimization that will positively impact both forever forward incremental backup mode, as well as Backup Copy jobs. Most improvement will be seen on backup repositories with poor IOPS capacity. However, this optimization will require 64-bit OS for the backup repository server, and will slightly increase RAM requirements of one (to be able to accommodate the newly introduced cache).

All of these bug fixes and optimizations were implemented in the Update 2 code branch, but are also available through support as a hot fixes.

We will finish researching the remaining support case above, but there is no need to post more submissions at this point. In case you are experiencing a performance issue described in the first post, feel free to open a support case to get an applicable hot fix.

Thanks!

R&D Forums

v8 slow incremental backup merge: call for support cases

Re: v8 slow incremental backup merge: call for support cases

Re: v8 slow incremental backup merge: call for support cases

Re: v8 slow incremental backup merge: call for support cases

Re: v8 slow incremental backup merge: call for support cases

Re: v8 slow incremental backup merge: call for support cases

Re: v8 slow incremental backup merge: call for support cases

Re: v8 slow incremental backup merge: call for support cases

Re: v8 slow incremental backup merge: call for support cases

Re: v8 slow incremental backup merge: call for support cases

Who is online