Linux Backup Repository woes

pirx · Post by **pirx** » May 16, 2021 7:29 am this post

Hi,

I'm setting up our first Linux based backup repository based on Apollo 4510 server (Veeam 10). Throughput is pretty good (~1,5 GB/s over LAN) but jobs have different kind of errors. Before I open a case I wanted to know if those problems are know issues with an easy workaround. This all seems to be related to load, as it happens only when multiple backups are running. But there is not much CPU load and no errors in the usual linux logs.

#1 at this time backup jobs were writing active fulls with 1,5GB/s over LAN to the server, CPU was 90% idle (52 cores). Nothing interesting in /var/log messages.

[15.05.2021 15:19:27] <326> Error Failed to upload file D:\Veeam\Backup\VeeamAgent64 to /tmp/VeeamAgent0bc9a8bd-ebd8-44b8-a373-44510aefd89f
[15.05.2021 15:19:27] <326> Error Failed to find terminal prompt: timeout occurred (60 sec) (System.Exception)

#2 Sometimes the extents are just gone. I don't see any warning in Linux, for the Linux server itself the device was present all time.

backup: 15.05.2021 16:40:19 :: Error: DE-WOP-B01-E01-Test extent is offline.
copy: 15.05.2021 16:29:11 :: Error: Some extents storing required backup files are offline

#3 There seems to be a problem with password too sometimes (sudo), but as this is the same server, same job, just another task, it can't be a general problem with permissions.

15.05.2021 15:37:34 :: Error: Permission denied (password).

#4 connection attempts are failing sometimes

12.05.2021 16:50:37 :: Processing SDET2509 Error: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

#5 /tmp and Veeam user home are not cleaned up

4-7 GB are still in each of those directories, /tmp was only a 4GB partition at beginning but I had to expand it to 14GB to finish a job without errors. Shouldn't Veeam cleanup it's own mess?

-rwxrwxrwx. 1 root root 62964808 May 12 16:45 VeeamAgent655b0e07-2e34-4899-8087-a76ed7a69971
.....

-rwxrwxrwx. 1 xxxxxx xxxxx 62964808 May 12 16:51 79bfc3e6-a1bb-4f44-881c-1625b4f7509b
...

soncscy · Post by **soncscy** » May 16, 2021 7:58 pm this post

Do you have hardened or "classic" repositories with the old SSH connections? I wonder if this is maybe that you need to update sshd_config with the following KB article attributes:

https://www.veeam.com/kb2985

It's not your error exactly, but if we consider that sshd is simply dropping connections, suddenly a lot of this starts to make more sense and it also explains how you can have a load related issue without the apollo itself being underlay (as it's sshd that struggles, not the machine itself)

Post by **Gostev** » May 16, 2021 7:58 pm this post

#1 seems normal as the target data mover does not do much data processing (unlike source data mover running on backup proxy), so why would it need CPU cycles.

The rest sounds like environmental-specific issues to me, based on "sometimes". The load backup generates tends to trigger these in different components.

AFAIR V10 had some registry key around Linux repositories intended for the biggest Cloud Connect service providers using Linux repositories (some optimizations for huge amount of tasks), like not re-uploading the data mover for each task.

I should mention however, considering V11 has a completely different architecture, it will probably be a big waste of your time to work with support on polishing your V10 Linux deployments. Because the new architecture with persistent data movers and no SSH usage removes whole chunks of functionality which were causing scalability issues before.

pirx · Post by **pirx** » May 16, 2021 8:39 pm this post

soncscy wrote: ↑May 16, 2021 7:58 pm Do you have hardened or "classic" repositories with the old SSH connections? I wonder if this is maybe that you need to update sshd_config with the following KB article attributes:

https://www.veeam.com/kb2985

It's not your error exactly, but if we consider that sshd is simply dropping connections, suddenly a lot of this starts to make more sense and it also explains how you can have a load related issue without the apollo itself being underlay (as it's sshd that struggles, not the machine itself)

I already found this KB and applied the settings, the active full's run for 2h now without problems. But I also reduced the number of concurrent tasks from 52 to 26. I changed the RX and TX buffers too. Problem is that I can't retry active full's that often as this is some extra load on production storage.

pirx · Post by **pirx** » May 16, 2021 8:43 pm this post

Gostev wrote: ↑May 16, 2021 7:58 pm #1 seems normal as the target data mover does not do much data processing (unlike source data mover running on backup proxy), so why would it need CPU cycles.

The rest sounds like environmental-specific issues to me, based on "sometimes". The load backup generates tends to trigger these in different components.

AFAIR V10 had some registry key around Linux repositories intended for the biggest Cloud Connect service providers using Linux repositories (some optimizations for huge amount of tasks), like not re-uploading the data mover for each task.

I should mention however, considering V11 has a completely different architecture, it will probably be a big waste of your time to work with support on polishing your V10 Linux deployments. Because the new architecture with persistent data movers and no SSH usage removes whole chunks of functionality which were causing scalability issues before.

Well, it would be a waste of time if we could just update to 11. But we are still on v10 and this must be running stable, update is not an option until U1 is released. This was also a recommendation from support as we'll have to use SMB shares for quite some time. But it's good to know that it might be smother in V11

Post by **Gostev** » May 16, 2021 9:19 pm this post

I can't promise V11 does not have other issues though

however, by 11a all of that will be polished anyway.

pirx · Post by **pirx** » May 17, 2021 5:59 am this post

Regarding the cleanup... I've now added regkey LinAgentFolder and pointed to /opt/veeam. Veeam is now creating the files there, but it's still not cleaning up everything, 1,8GB are still there. Is this something that should be removed by Veeam automatically? Do I have to run a cron task to delete this periodically? I just don't want to run into failed jobs because disk is filling up (like /tmp with 4GB before). If I run a cron task, when is it safe to remove old Veeam files? I don't want to run a rm when the files are still needed and in use.

# ls -l /opt/veeam/
total 136
drwxr-xr-x. 2 root root 4096 May 17 03:02 VeeamAgent065f89b9-a2d7-4e55-bdcd-3c014332b1ce.data
drwxr-xr-x. 2 root root 4096 May 17 01:52 VeeamAgent08cba237-f4f6-4d06-a4f2-85e92d038125.data
drwxr-xr-x. 2 root root 4096 May 17 03:38 VeeamAgent10b7eba3-bbbb-46f1-bc66-51a376c86426.data
...

/opt/veeam/VeeamAgentf1da1f46-a755-447f-9564-65d7667795e6.data:
total 52732
-rw-r--r--. 1 root root 30408 May 17 02:00 libacl.so.1
-rw-r--r--. 1 root root 17608 May 17 02:00 libattr.so.1
-rw-r--r--. 1 root root 128256 May 17 02:00 libblkid.so.1
-rw-r--r--. 1 root root 217624 May 17 02:00 libfuse.so
-rw-r--r--. 1 root root 1238480 May 17 02:00 libNtfsLib.so
-rw-r--r--. 1 root root 15720 May 17 02:00 libuuid.so.1
-rwxr-xr-x. 1 root root 52333296 May 17 02:00 veeamagent

Post by **Gostev** » May 17, 2021 12:37 pm this post

Yes, normally they should be removed. May be it is a side effect of using the redirection? Please let our support engineers investigate this through the logs. Thanks!

pirx · Post by **pirx** » May 22, 2021 5:44 am this post

After some quiet days, Veeam is again filling up /opt/veeam.

Code: Select all

Error: cp: error writing '/opt/veeam/VeeamAgent5415ec14-560a-42e3-85ab-bd3d3d01397b': No space left on device 	
Maximum retry count reached (5 out of 5) 	
Failed to process SDET2301 (0 B) at 22.05.2021 05:59:25

[root@sdeu2001 ~]# df -h /dev/mapper/vgroot-lvopt
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vgroot-lvopt 20G 2.0G 17G 11% /opt

R&D Forums

Linux Backup Repository woes

Re: Linux Backup Repository woes

Re: Linux Backup Repository woes

Re: Linux Backup Repository woes

Re: Linux Backup Repository woes

Re: Linux Backup Repository woes

Re: Linux Backup Repository woes

Re: Linux Backup Repository woes

Re: Linux Backup Repository woes

Who is online