-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Performance problem with SOBR tiering to COS
Hi,
I am having performance problem with tiering to S3 storage (IBM COS on-prem).
The offloading takes very long time and sometimes returns error.
I am seeing lots of status code 500 on the object storage also.
I have tried setting the "S3ConcurrentTaskLimit" to 8 but that does not seem the help, also the tiering is running way more than 8 vms simultaneously.
Case number:
#04119837
I am having performance problem with tiering to S3 storage (IBM COS on-prem).
The offloading takes very long time and sometimes returns error.
I am seeing lots of status code 500 on the object storage also.
I have tried setting the "S3ConcurrentTaskLimit" to 8 but that does not seem the help, also the tiering is running way more than 8 vms simultaneously.
Case number:
#04119837
-
- Product Manager
- Posts: 20406
- Liked: 2298 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: Performance problem with SOBR tiering to COS
Your support engineer is about to analyze the logs provided. Let's see what he suggests after that. Thanks!
-
- Veeam Software
- Posts: 2010
- Liked: 670 times
- Joined: Sep 25, 2019 10:32 am
- Full Name: Oleg Feoktistov
- Contact:
Re: Performance problem with SOBR tiering to COS
Hi Mattias,
In addition, by any chance, do you have read/write data rates limited to some point for the backup repositories you use as performance extents in your SOBR? Or any network traffic throttled ?
It's just occurred to me that the session report you shared reflects Throttling as a bottleneck, so those might be possible reasons for the offload performance issues.
Thanks,
Oleg
In addition, by any chance, do you have read/write data rates limited to some point for the backup repositories you use as performance extents in your SOBR? Or any network traffic throttled ?
It's just occurred to me that the session report you shared reflects Throttling as a bottleneck, so those might be possible reasons for the offload performance issues.
Thanks,
Oleg
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
Hi,
We have throttling enabled on the performance extents. Set to 350 MB/s.
The thing is the object storage gets overloaded and starts timing out.
IBM says that writing to the COS att 500MB/s should not be a problem.
But right now I am seeing lots of timeouts from the COS accessors and Veeam is tiering 122 vms at the same time?
At least the percentage is ticking on 122 different vms.
We have throttling enabled on the performance extents. Set to 350 MB/s.
The thing is the object storage gets overloaded and starts timing out.
IBM says that writing to the COS att 500MB/s should not be a problem.
But right now I am seeing lots of timeouts from the COS accessors and Veeam is tiering 122 vms at the same time?
At least the percentage is ticking on 122 different vms.
-
- Veeam Software
- Posts: 2010
- Liked: 670 times
- Joined: Sep 25, 2019 10:32 am
- Full Name: Oleg Feoktistov
- Contact:
Re: Performance problem with SOBR tiering to COS
Then let's see what support will come up with. Please also share an outcome on the case here. Thanks!
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
The issue seems to be resolved now, no 500 errors the last 24 hours.
The problem was relatade to one of the slicestor in IBM COS so not a Veeam problem.
The problem was relatade to one of the slicestor in IBM COS so not a Veeam problem.
-
- Chief Product Officer
- Posts: 31806
- Liked: 7300 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Performance problem with SOBR tiering to COS
Great, and thanks for coming back to share what was the issue.
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
We are still having problem, the issue seems to be that Veeam is offloading too fast for the object storage to keep up.
Is there som way to throttle the SOBR-tiering?
We have tried S3ConcurrentTaskLimit = 2, that does not seem to help in this case.
Is there som way to throttle the SOBR-tiering?
We have tried S3ConcurrentTaskLimit = 2, that does not seem to help in this case.
-
- Chief Product Officer
- Posts: 31806
- Liked: 7300 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Performance problem with SOBR tiering to COS
You can configure Network Traffic Rules in Veeam to limit the bandwidth available to the offload process.
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
I choose to access to S3 storage through a gateway virtual machine (instead of the Veeam server)
And I throttled the vm.
That did also work.
And I throttled the vm.
That did also work.
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
Still having problems when to many SOBR-jobs run at the same time, even if the megabyte per second is limited.
The object storage gets hammered with requests.
Is there any way to limit the SOBR-jobs to run just one at at time?
The object storage gets hammered with requests.
Is there any way to limit the SOBR-jobs to run just one at at time?
-
- Chief Product Officer
- Posts: 31806
- Liked: 7300 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Performance problem with SOBR tiering to COS
Honestly, I would rather look at changing your object storage. There are too many other alternatives that "just work" for our customers.
Honestly, if a few requests per second "hammer" your object storage, then it's probably completely overloaded. With that in mind, stop worrying about backup for a moment, and just think about having to deal with this very issue at the restore time. I mean, you already spent 3 weeks trying to make it work reliably, and still did not succeed... will you have 4 weeks when you need to do a restore?
Also, let's be pragmatic here: even if you do eventually make backup work reliably by severely limiting its throughput, how long will restore from object storage take in the same conditions? It's not like the restore process "hammers" object storage any differently than backup.
Honestly, if a few requests per second "hammer" your object storage, then it's probably completely overloaded. With that in mind, stop worrying about backup for a moment, and just think about having to deal with this very issue at the restore time. I mean, you already spent 3 weeks trying to make it work reliably, and still did not succeed... will you have 4 weeks when you need to do a restore?
Also, let's be pragmatic here: even if you do eventually make backup work reliably by severely limiting its throughput, how long will restore from object storage take in the same conditions? It's not like the restore process "hammers" object storage any differently than backup.
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
You would think that a new object storage for 500k$ would be able to handle backups.
According to IBM developers Veeam i causing over 5000 requests every second to the objects storage that is more than a few requests.
But you are right we will change the object storage or backup software.
According to IBM developers Veeam i causing over 5000 requests every second to the objects storage that is more than a few requests.
But you are right we will change the object storage or backup software.
-
- Chief Product Officer
- Posts: 31806
- Liked: 7300 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Performance problem with SOBR tiering to COS
Well, my previous comment was assuming you do have S3ConcurrentTaskLimit set to 2, and yet the storage STILL can't keep up with those few requests per second. But I don't really see how Veeam can be causing 5000 requests per second with this setting enabled.
Perhaps this registry key is simply not set correctly? It should be created on the gateway server associated with the object storage. Perhaps you should ask support to check that this key is set correctly, it's usage should be reflected in the debug logs. And they will also show all S3 API calls Veeam is issuing.
Perhaps this registry key is simply not set correctly? It should be created on the gateway server associated with the object storage. Perhaps you should ask support to check that this key is set correctly, it's usage should be reflected in the debug logs. And they will also show all S3 API calls Veeam is issuing.
-
- VP, Product Management
- Posts: 7077
- Liked: 1510 times
- Joined: May 04, 2011 8:36 am
- Full Name: Andreas Neufert
- Location: Germany
- Contact:
Re: Performance problem with SOBR tiering to COS
5000 requests per second are something around 2,5 GB/s, that is more than a single 10Gbe link can handle.
The screenshot above show that it is roughly 180 IO/s or better say 180 put requests per second.
Based on this difference something looks off from configuration perspective.
The screenshot above show that it is roughly 180 IO/s or better say 180 put requests per second.
Based on this difference something looks off from configuration perspective.
-
- Veeam Software
- Posts: 538
- Liked: 192 times
- Joined: Mar 07, 2016 3:55 pm
- Full Name: Ronn Martin
- Contact:
Re: Performance problem with SOBR tiering to COS
@lolbebis we conducted joint testing for COS in the IBM labs and the observed between 9500 - 12000 PUT operations / 5 minutes or 16 - 62 PUT operations / second running 3 concurrent tasks at around 140MB/s so yes 5000/second is far in excess of what I'd expect based on the joint testing we've done. Can you re-check settings per Gostev's directions?
-
- VP, Product Management
- Posts: 7077
- Liked: 1510 times
- Joined: May 04, 2011 8:36 am
- Full Name: Andreas Neufert
- Location: Germany
- Contact:
Re: Performance problem with SOBR tiering to COS
Please allow me to add that the numbers Ronn shared above are from lab tests generating 4x more PUT requests than normal. This is done on purpose to see if the storage can cope with the large number of requests. We achieve this by having the test bed using non-default block size for backups (the smallest we support in the product). In other words, you should expect to see 4 times less operations per second in your environment.
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
Well it has actually been about 9 months we have been trying to get it to work not 3 weeks.
But we have a case with IBM regarding this: TS003656104
I can see in the logs that the value S3ConcurrentTaskLimit=2 is working, the tiering is doing about 50-200MB/s with that setting (throttled in a virtual machine).
Without virtual machine limit it is tiering with 600MB/s and then we are oversubscribing the disks in the COS.
But 200MB/s should not be a problem and vi have 6x10Gbe on the COS. (3 sites 2x10Gbe per site)
We dont need better performance than 200MB/s.
We have a flash tier for disaster recovery and only need to tier older backups.
But it seems like it is the delete operations during repository cleanup that overloads the system.
And once that happens also PUT operations stops working.
But now we wait for IBM to dig in the logs.
But I realise that it probably is some configuration issue or that we are not using the object storage the way its supposed to be.
But we have a case with IBM regarding this: TS003656104
I can see in the logs that the value S3ConcurrentTaskLimit=2 is working, the tiering is doing about 50-200MB/s with that setting (throttled in a virtual machine).
Without virtual machine limit it is tiering with 600MB/s and then we are oversubscribing the disks in the COS.
But 200MB/s should not be a problem and vi have 6x10Gbe on the COS. (3 sites 2x10Gbe per site)
We dont need better performance than 200MB/s.
We have a flash tier for disaster recovery and only need to tier older backups.
But it seems like it is the delete operations during repository cleanup that overloads the system.
And once that happens also PUT operations stops working.
But now we wait for IBM to dig in the logs.
But I realise that it probably is some configuration issue or that we are not using the object storage the way its supposed to be.
-
- Chief Product Officer
- Posts: 31806
- Liked: 7300 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Performance problem with SOBR tiering to COS
I double-checked with the developer behind this functionality, and he confirmed that S3ConcurrentTaskLimit is used for DELETE operations as well. In other words, there will be no more than this number of outstanding DELETE requests at any given time.
This makes me suspect the issue is not with the number of requests per second, which is very low - but rather with the number of objects deleted. A single bulk DELETE command requests the deletion of thousands objects at once. And certain object storage designs really struggles with deletes processing due to architectural limitations. This seems to be specific to object storage using Cassandra on the back end, if I remember correctly. We've seen similar issues with Cloudian in early days, and they were able to address this relatively quickly on their end.
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
Thanks! I will forward that information into the IBM case.
-
- Chief Product Officer
- Posts: 31806
- Liked: 7300 times
- Joined: Jan 01, 2006 1:01 am
- Location: Baar, Switzerland
- Contact:
Re: Performance problem with SOBR tiering to COS
The same developer also just dug out the S3MultiObjectDeleteLimit debug key that he used to research similar issue once, and apparently it was never removed - so it should still work. This key limits the number of objects in the single bulk delete requests, so please try playing with it. But don't go too low, as otherwise your retention processing may start lagging behind backups.
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
Thanks, I am trying now with S3MultiObjectDeleteLimit=250.
With S3ConcurrentTaskLimit=2, that would mean it will delete 2x250 files at a time I guess.
It is usually during weekends when backupchains are selead there will be lots of deletes.
And I also hope that IBM will find something.
With S3ConcurrentTaskLimit=2, that would mean it will delete 2x250 files at a time I guess.
It is usually during weekends when backupchains are selead there will be lots of deletes.
And I also hope that IBM will find something.
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
I am not sure i understand what the S3ConcurrentTaskLimit does.
When I start a SOBR-tiering job there are lots of servers tiering at the same time. On the screenshot I just made there are about 55 servers being sent to S3 at the same time, at least if you look at the percentage counter on each server.
The only way to limit the number of servers tiering at the same time seems to be to lower the number of concurrent task to each backup repository?
So what does S3ConcurrentTaskLimit do?
Thanks.
When I start a SOBR-tiering job there are lots of servers tiering at the same time. On the screenshot I just made there are about 55 servers being sent to S3 at the same time, at least if you look at the percentage counter on each server.
The only way to limit the number of servers tiering at the same time seems to be to lower the number of concurrent task to each backup repository?
So what does S3ConcurrentTaskLimit do?
Thanks.
-
- Product Manager
- Posts: 20406
- Liked: 2298 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: Performance problem with SOBR tiering to COS
It controls number of HTTPS requests that one offload process can open for each disk in backup. By default it opens 64 connections.
This is how offload process works:
Thanks!
This is how offload process works:
- Offload process looks for available repository slots
- If available, it consumes one slot per offloaded disk
- Offload process then opens 64 connections (by default) to object storage for each disk
- 1 connection corresponds to 1 TCP socket (IP:Port) and is used to execute 1 S3 request (PUT, GET, DELETE, etc.) at a time
- Each connection is kept open for the predefined time period so that additional processing task can reuse it
- Repository with 5 available slots
- Backups of 20 VMs with 1 disk each to offload
- S3ConcurrentTaskLimit is set to 2 (reduced from default 64)
- Offload process starts and grabs all available repository slots (while it needs 20 slots for 20 disks, only 5 are available)
- Offload process opens 5 slots * 2 connections limit = 10 total connections to object storage
Thanks!
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
Thanks for a very good explanation!
So I guess that means that the avaliable slots on the backup repository will impact the number of connections and requests to the S3 storage.
If i tier 100 vms x 2 connections = 200 connections.
Would be the same amount of requests as tiering 3 vms with 64 connections = 192 connections.
This object storage i choking at a very low number of connections.
So I guess that means that the avaliable slots on the backup repository will impact the number of connections and requests to the S3 storage.
If i tier 100 vms x 2 connections = 200 connections.
Would be the same amount of requests as tiering 3 vms with 64 connections = 192 connections.
This object storage i choking at a very low number of connections.
-
- Product Manager
- Posts: 20406
- Liked: 2298 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: Performance problem with SOBR tiering to COS
Yep, you got it right.So I guess that means that the avaliable slots on the backup repository will impact the number of connections and requests to the S3 storage.
-
- Enthusiast
- Posts: 26
- Liked: 5 times
- Joined: Feb 26, 2020 9:33 am
- Full Name: Mattias Jacobsson
- Contact:
Re: Performance problem with SOBR tiering to COS
Hi,
Just a quick update on this.
IBM found a bug that caused a deadlock in the Object storage.
I have updated the Object storage to 3.15.0.44 and performance seems to be much better now. ( the fix was implemented a few versions prior to that version)
Only tested a few days, but so far so good.
Just a quick update on this.
IBM found a bug that caused a deadlock in the Object storage.
I have updated the Object storage to 3.15.0.44 and performance seems to be much better now. ( the fix was implemented a few versions prior to that version)
Only tested a few days, but so far so good.
-
- Product Manager
- Posts: 20406
- Liked: 2298 times
- Joined: Oct 26, 2012 3:28 pm
- Full Name: Vladimir Eremin
- Contact:
Re: Performance problem with SOBR tiering to COS
Thanks, Mattias, for sharing your findings us; keep us updated on how the situation continues!
Who is online
Users browsing this forum: No registered users and 7 guests