To put the issues into context, we run B&R in a small simple environment. One Hyper-V host with about 15 VMs backing up (forward incremental with synthetic fulls) to a Linux repo with secondary backup to a single tape drive. Weekly Health Check and Surebackup.
I’ve spent 35 years maintaining Windows servers and networks. I’m the sort of person who uses technical support as the very last option after using any other available resources to understand and fix an issue myself.
In the first six months of using B&R I have had to open four technical support cases as a result of experiencing significant issues with the product (I have another issue that needs resolving but it is comparatively minor and I don’t have the time or energy to pursue atm!)
#03470315 – Surebackup (Opened March 19, 2019, Closed June 14, 2019, workaround in place awaiting fixes to be incorporated in v10)
I initially thought this was a problem with configuring our Surebackup lab but actually turned out it was an issue with the Veeam agent – “Exception of type 'Veeam.Backup.AgentProvider.AgentClosedException' was thrown.”
Spent an enormous amount of time on this because the agent couldn’t reproduce the problem. Initially we had to build a ‘proper’ Linux repo (at the time we were using a Synology NAS which is unsupported) to show the issue was still present and reproducable. Because the agent couldn’t reproduce the issue I was convinced that it was something in our environment and spent many days checking logs, drivers, changing hardware and even completely reinstalling operating systems (including the host OS).
Found a workaround myself and it was as a result of this the agent found that the issue was Hyper-V specific and they had been trying to reproduce the problem in a VMWare environment. To say I wasn’t very happy about the time I had wasted (because the agent hadn’t tried to reproduce the issue in the same environment as ours) would be an understatement!
Once the issue was reproduced it was referred to R&D. To quote the agent, “Turned out there is more to our SSH connections than just 1 fix. There are quite a few changes planned for the future and those will be rolled out over multiple updates. The plan as of now is to have them all ready by the time Update 5 comes around.” Presumably, this now means v10
#03535461 – Tape Backup (Closed, workaround in place, no satisfactory resolution)
Following advice on your forum that Active Fulls were not necessary with B&R (veeam-backup-replication-f2/the-need-fo ... 13-60.html) we changed our backups to incremental with synthetic full. After making this change the secondary tape backup 'failed' with a warning - "Backup file xxxxxxxx2019-04-27T021539_F718.vbk will be excluded from the list of files to backup because it is unavailable"
After working on the ticket for a while, ended up finding two workarounds for this myself. The second being the same workaround as for ticket #03470315.
Mentioned this to the agent working on #03470315 and he forwarded the details of #03535461 to the R&D team but I have no idea of the outcome.
The agent on this ticket responded, “As far as the issue related to this case is now over and your tape jobs are running as expected, I suggest to archive this case. Further investigation will be processed within the other case on a higher tier and the assigned engineer will find the root cause and identify the best solution for you.”
I agreed that the case could be closed but although the workaround for the two cases was the same I have no idea whether the cause was the same and what if anything is being done to properly resolve the issue.
#03743171 - Backups keep failing with "Failed to send command Error: An existing connection was forcibly closed by the remote host" (Opened Sept. 1, 2019, Problem identified, unhappy with resolution)
We had an opportunity to repurpose our (unsupported) Synology NAS so decided to replace it with a (supported) Centos based Linux repository. We then started getting intermittent failures of our incremental backups (about one in three) during the creation of the Synthetic Full.
From my time spent on the earlier Surebackup issue I was aware that there are two SSH libraries built into B&R but I had checked and since changing to the Linux repo had been unable to get the newer Renci library to work. We agreed that should be resolved first and in the end I found the solution myself. Renci uses SFTP for file transfer, not SCP (which the older Granados uses). As this wasn’t installed on the repo Renci wouldn’t work. SFTP wasn’t listed as a requirement for a Linux repo in your KB https://www.veeam.com/kb2216. I suggested they updated it and this has since been done.
Unfortunately, using the Renci library didn’t resolve the issue so we got the stage of producing Wireshark dumps to identify the issue. It was when I apologised to the Agent about the time (and therefore the dumps) on the Linux repo being slightly out (1-2 minutes) that they realised this may be (and indeed was) the issue. I had originally setup and tested NTP on the repo but on further investigation discovered that it wasn’t starting up when the server was restarted (and our repo is only on for three hours a night and so restarts daily).
I now have a resolution but am not entirely satisfied. The agent wants to close the case but there doesn’t seem to be any explanation available as to why a small time difference causes this problem, why does it only happen intermittently, what is a ‘satisfactory’ time difference, is this a bug or is time synchronisation a requirement, has this been referred to R&D? Also, going back to your KB article https://www.veeam.com/kb2216 there is no mention of NTP or the need for time synchronisation.
Furthermore, the results of this issue are that the incremental chain is damaged to the point that no further backups can be added to it requiring an Active Full to be made to continue backing up (it seems restores can still be made but I have only tried a couple of small file restores and not tested this properly). With this in mind, if the time synchronisation is a requirement (rather than a bug) then B&R should check that the systems are time synchronised before starting a backup and at least flag a warning and ideally should prevent the backup running entirely as it results in a damaged chain.
I won’t have the problem again because I know the cause and solution but how many other people are going to waste time with it as an issue?
#03806448 – “Attempt to read past the end of the SSH data stream” when running Active Fulls (Opened Oct. 10, 2019, referred to R&D on that date, “They still working on the issue, however they have no solution yet.”)
Because of the problem above, I have had to repeatedly run Active Fulls. As a result of this I have discovered that I intermittently get the above issue. This is not due to time synchronisation issues as it has occurred since resolving the NTP issue on the repo.
After examining the Wireshark dumps the issue was referred to R&D. I don’t know what they found but we have had no resolution yet.
In summary, I have had the product for six months using it in a small simple environment. In that time I have had four separate significant issues (three preventing me from backing up).
#03470315 – bugs in SSH in Hyper-V environment.
#03535461 – possibly related to #03470315, possibly separate bug. Will it be fixed?
#03743171 – bug or requirement? If a requirement, appears to be undocumented and not enforced by B&R software.
#03806448 – based on initial response, appears to be a bug.
Don’t get me wrong, I really like the product (when its working). The technical support is always approachable, helpful and friendly. The forums are invaluable (I spend quite a bit of free time just browsing them to learn more about the product).
However, at a conservative estimate, I have spent at least a month troubleshooting these issues, that’s $20,000 of billable time or to look at it in a different way, a month of my life I won’t get back! I know software is never bug free and there will always be a requirement to spend time sorting out issues but this seems to be disproportionately high for a mature product that is deployed in a small simple environment.
I read about your new developments that will never be relevant to me and just sigh and think to myself “I just want what I have already got to work properly”! The impression I get (it may not be true) is that there may be a reluctance for your tech support agents to involve R&D with an issue and as long as there is a working resolution to an issue that that is the end to the matter, rather then getting an issue properly resolved. This isn’t good for the product or the use of agents time (because unresolved issues keep coming up) and most importantly (