Discussions related to using object storage as a backup target.
Post Reply
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Performance problem with SOBR tiering to COS

Post by lolbebis »

Hi,
I am having performance problem with tiering to S3 storage (IBM COS on-prem).
The offloading takes very long time and sometimes returns error.
I am seeing lots of status code 500 on the object storage also.

I have tried setting the "S3ConcurrentTaskLimit" to 8 but that does not seem the help, also the tiering is running way more than 8 vms simultaneously.
Image
Image

Case number:
#04119837
veremin
Product Manager
Posts: 20284
Liked: 2258 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: Performance problem with SOBR tiering to COS

Post by veremin »

Your support engineer is about to analyze the logs provided. Let's see what he suggests after that. Thanks!
oleg.feoktistov
Veeam Software
Posts: 1918
Liked: 636 times
Joined: Sep 25, 2019 10:32 am
Full Name: Oleg Feoktistov
Contact:

Re: Performance problem with SOBR tiering to COS

Post by oleg.feoktistov »

Hi Mattias,

In addition, by any chance, do you have read/write data rates limited to some point for the backup repositories you use as performance extents in your SOBR? Or any network traffic throttled ?
It's just occurred to me that the session report you shared reflects Throttling as a bottleneck, so those might be possible reasons for the offload performance issues.

Thanks,
Oleg
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis »

Hi,
We have throttling enabled on the performance extents. Set to 350 MB/s.
The thing is the object storage gets overloaded and starts timing out.
IBM says that writing to the COS att 500MB/s should not be a problem.

But right now I am seeing lots of timeouts from the COS accessors and Veeam is tiering 122 vms at the same time?
At least the percentage is ticking on 122 different vms.
oleg.feoktistov
Veeam Software
Posts: 1918
Liked: 636 times
Joined: Sep 25, 2019 10:32 am
Full Name: Oleg Feoktistov
Contact:

Re: Performance problem with SOBR tiering to COS

Post by oleg.feoktistov »

Then let's see what support will come up with. Please also share an outcome on the case here. Thanks!
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis » 1 person likes this post

The issue seems to be resolved now, no 500 errors the last 24 hours.
The problem was relatade to one of the slicestor in IBM COS so not a Veeam problem.
Gostev
Chief Product Officer
Posts: 31559
Liked: 6724 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Performance problem with SOBR tiering to COS

Post by Gostev »

Great, and thanks for coming back to share what was the issue.
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis »

We are still having problem, the issue seems to be that Veeam is offloading too fast for the object storage to keep up.
Is there som way to throttle the SOBR-tiering?
We have tried S3ConcurrentTaskLimit = 2, that does not seem to help in this case.
Gostev
Chief Product Officer
Posts: 31559
Liked: 6724 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Performance problem with SOBR tiering to COS

Post by Gostev »

You can configure Network Traffic Rules in Veeam to limit the bandwidth available to the offload process.
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis » 1 person likes this post

I choose to access to S3 storage through a gateway virtual machine (instead of the Veeam server)
And I throttled the vm.
That did also work.
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis »

Still having problems when to many SOBR-jobs run at the same time, even if the megabyte per second is limited.
The object storage gets hammered with requests.

Is there any way to limit the SOBR-jobs to run just one at at time?
Gostev
Chief Product Officer
Posts: 31559
Liked: 6724 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Performance problem with SOBR tiering to COS

Post by Gostev » 1 person likes this post

Honestly, I would rather look at changing your object storage. There are too many other alternatives that "just work" for our customers.

Honestly, if a few requests per second "hammer" your object storage, then it's probably completely overloaded. With that in mind, stop worrying about backup for a moment, and just think about having to deal with this very issue at the restore time. I mean, you already spent 3 weeks trying to make it work reliably, and still did not succeed... will you have 4 weeks when you need to do a restore?

Also, let's be pragmatic here: even if you do eventually make backup work reliably by severely limiting its throughput, how long will restore from object storage take in the same conditions? It's not like the restore process "hammers" object storage any differently than backup.
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis »

You would think that a new object storage for 500k$ would be able to handle backups.
According to IBM developers Veeam i causing over 5000 requests every second to the objects storage that is more than a few requests.

But you are right we will change the object storage or backup software.
Gostev
Chief Product Officer
Posts: 31559
Liked: 6724 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Performance problem with SOBR tiering to COS

Post by Gostev »

Well, my previous comment was assuming you do have S3ConcurrentTaskLimit set to 2, and yet the storage STILL can't keep up with those few requests per second. But I don't really see how Veeam can be causing 5000 requests per second with this setting enabled.

Perhaps this registry key is simply not set correctly? It should be created on the gateway server associated with the object storage. Perhaps you should ask support to check that this key is set correctly, it's usage should be reflected in the debug logs. And they will also show all S3 API calls Veeam is issuing.
Andreas Neufert
VP, Product Management
Posts: 6748
Liked: 1408 times
Joined: May 04, 2011 8:36 am
Full Name: Andreas Neufert
Location: Germany
Contact:

Re: Performance problem with SOBR tiering to COS

Post by Andreas Neufert »

5000 requests per second are something around 2,5 GB/s, that is more than a single 10Gbe link can handle.

The screenshot above show that it is roughly 180 IO/s or better say 180 put requests per second.

Based on this difference something looks off from configuration perspective.
ronnmartin61
Veeam Software
Posts: 441
Liked: 131 times
Joined: Mar 07, 2016 3:55 pm
Full Name: Ronn Martin
Contact:

Re: Performance problem with SOBR tiering to COS

Post by ronnmartin61 » 1 person likes this post

@lolbebis we conducted joint testing for COS in the IBM labs and the observed between 9500 - 12000 PUT operations / 5 minutes or 16 - 62 PUT operations / second running 3 concurrent tasks at around 140MB/s so yes 5000/second is far in excess of what I'd expect based on the joint testing we've done. Can you re-check settings per Gostev's directions?
Andreas Neufert
VP, Product Management
Posts: 6748
Liked: 1408 times
Joined: May 04, 2011 8:36 am
Full Name: Andreas Neufert
Location: Germany
Contact:

Re: Performance problem with SOBR tiering to COS

Post by Andreas Neufert »

Please allow me to add that the numbers Ronn shared above are from lab tests generating 4x more PUT requests than normal. This is done on purpose to see if the storage can cope with the large number of requests. We achieve this by having the test bed using non-default block size for backups (the smallest we support in the product). In other words, you should expect to see 4 times less operations per second in your environment.
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis »

Well it has actually been about 9 months we have been trying to get it to work :D not 3 weeks.
But we have a case with IBM regarding this: TS003656104

I can see in the logs that the value S3ConcurrentTaskLimit=2 is working, the tiering is doing about 50-200MB/s with that setting (throttled in a virtual machine).
Without virtual machine limit it is tiering with 600MB/s and then we are oversubscribing the disks in the COS.
But 200MB/s should not be a problem and vi have 6x10Gbe on the COS. (3 sites 2x10Gbe per site)

We dont need better performance than 200MB/s.
We have a flash tier for disaster recovery and only need to tier older backups.

But it seems like it is the delete operations during repository cleanup that overloads the system.
And once that happens also PUT operations stops working.

But now we wait for IBM to dig in the logs.
But I realise that it probably is some configuration issue or that we are not using the object storage the way its supposed to be.
Gostev
Chief Product Officer
Posts: 31559
Liked: 6724 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Performance problem with SOBR tiering to COS

Post by Gostev » 2 people like this post

I double-checked with the developer behind this functionality, and he confirmed that S3ConcurrentTaskLimit is used for DELETE operations as well. In other words, there will be no more than this number of outstanding DELETE requests at any given time.
lolbebis wrote: May 13, 2020 11:40 amBut it seems like it is the delete operations during repository cleanup that overloads the system.
And once that happens also PUT operations stops working.
This makes me suspect the issue is not with the number of requests per second, which is very low - but rather with the number of objects deleted. A single bulk DELETE command requests the deletion of thousands objects at once. And certain object storage designs really struggles with deletes processing due to architectural limitations. This seems to be specific to object storage using Cassandra on the back end, if I remember correctly. We've seen similar issues with Cloudian in early days, and they were able to address this relatively quickly on their end.
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis »

Thanks! I will forward that information into the IBM case.
Gostev
Chief Product Officer
Posts: 31559
Liked: 6724 times
Joined: Jan 01, 2006 1:01 am
Location: Baar, Switzerland
Contact:

Re: Performance problem with SOBR tiering to COS

Post by Gostev » 1 person likes this post

The same developer also just dug out the S3MultiObjectDeleteLimit debug key that he used to research similar issue once, and apparently it was never removed - so it should still work. This key limits the number of objects in the single bulk delete requests, so please try playing with it. But don't go too low, as otherwise your retention processing may start lagging behind backups.
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis »

Thanks, I am trying now with S3MultiObjectDeleteLimit=250.
With S3ConcurrentTaskLimit=2, that would mean it will delete 2x250 files at a time I guess.
It is usually during weekends when backupchains are selead there will be lots of deletes.
And I also hope that IBM will find something.
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis »

I am not sure i understand what the S3ConcurrentTaskLimit does.
When I start a SOBR-tiering job there are lots of servers tiering at the same time. On the screenshot I just made there are about 55 servers being sent to S3 at the same time, at least if you look at the percentage counter on each server.
The only way to limit the number of servers tiering at the same time seems to be to lower the number of concurrent task to each backup repository?
So what does S3ConcurrentTaskLimit do?
Thanks.


Image
veremin
Product Manager
Posts: 20284
Liked: 2258 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: Performance problem with SOBR tiering to COS

Post by veremin » 2 people like this post

It controls number of HTTPS requests that one offload process can open for each disk in backup. By default it opens 64 connections.

This is how offload process works:
  • Offload process looks for available repository slots
  • If available, it consumes one slot per offloaded disk
  • Offload process then opens 64 connections (by default) to object storage for each disk
    • 1 connection corresponds to 1 TCP socket (IP:Port) and is used to execute 1 S3 request (PUT, GET, DELETE, etc.) at a time
  • Each connection is kept open for the predefined time period so that additional processing task can reuse it
Example:
  • Repository with 5 available slots
  • Backups of 20 VMs with 1 disk each to offload
  • S3ConcurrentTaskLimit is set to 2 (reduced from default 64)
  • Offload process starts and grabs all available repository slots (while it needs 20 slots for 20 disks, only 5 are available)
  • Offload process opens 5 slots * 2 connections limit = 10 total connections to object storage
So, as you can see, the registry key controls a different settings and does not affect number of disks offloaded.

Thanks!
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis » 1 person likes this post

Thanks for a very good explanation!
So I guess that means that the avaliable slots on the backup repository will impact the number of connections and requests to the S3 storage.

If i tier 100 vms x 2 connections = 200 connections.
Would be the same amount of requests as tiering 3 vms with 64 connections = 192 connections.

This object storage i choking at a very low number of connections.
veremin
Product Manager
Posts: 20284
Liked: 2258 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: Performance problem with SOBR tiering to COS

Post by veremin »

So I guess that means that the avaliable slots on the backup repository will impact the number of connections and requests to the S3 storage.
Yep, you got it right.
lolbebis
Enthusiast
Posts: 26
Liked: 5 times
Joined: Feb 26, 2020 9:33 am
Full Name: Mattias Jacobsson
Contact:

Re: Performance problem with SOBR tiering to COS

Post by lolbebis » 1 person likes this post

Hi,
Just a quick update on this.
IBM found a bug that caused a deadlock in the Object storage.
I have updated the Object storage to 3.15.0.44 and performance seems to be much better now. ( the fix was implemented a few versions prior to that version)
Only tested a few days, but so far so good.
veremin
Product Manager
Posts: 20284
Liked: 2258 times
Joined: Oct 26, 2012 3:28 pm
Full Name: Vladimir Eremin
Contact:

Re: Performance problem with SOBR tiering to COS

Post by veremin »

Thanks, Mattias, for sharing your findings us; keep us updated on how the situation continues!
Post Reply

Who is online

Users browsing this forum: tpayton and 10 guests