-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: Warning stuck buffer_pool buffer with IO workload #8521
Comments
@shirady can you please take a look ? |
Hi @rkomandu , Additional DetailsAt this moment I'm looking for a level-1 printing from here: noobaa-core/src/util/buffer_utils.js Line 205 in e1bf29e
As you can see from the error stuck:
It comes on GET request, in NSFS it is We want to see the values of the We know that it started with 0:
This printing comes from here: noobaa-core/src/util/buffer_utils.js Lines 217 to 223 in e1bf29e
This warning printing might be in 2 minutes waiting - as configured here: Line 789 in e1bf29e
BTW, Although it is a warning, I'm not sure why it is printed as with |
lets see if we can get your cluster working for the run. DEBUGLEVEL : all Will update once run complete Note: High level thought, this bug might be recreated with the load on the system. Am saying this because the issue 8524 with versioning PUT method didn't run into this error when ran for 3hrs (ran on Wed) |
Ran for the 90 min Warp get op run and didn't run into the buffer_pool error ./warp get --insecure --duration 90m --host :6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket newbucket-warp-get-8521-15nov 2>&1| tee /tmp/newbucket-warp-get-255M-8521-15nov.log [root@gui0 log]# zgrep "stuck" noobaa.log-20241115.gz [root@gui1 log]# zgrep "stuck" noobaa.log-20241115.gz |
Please try to check from the code flow perspective , as mentioned it could also be w/r/t load on the system |
Hi, noobaa-core/src/util/buffer_utils.js Line 205 in e1bf29e
@rkomandu, I'm planning to check other things I will update you here about it. Additional information: I run: The outputs:
|
Hi, I set a shorter timeout (30 milliseconds instead of 2 minutes). - config.NSFS_BUF_POOL_WARNING_TIMEOUT = 2 * 60 * 1000;
+ config.NSFS_BUF_POOL_WARNING_TIMEOUT = 30; //SDSD I reduced the size: config.NSFS_BUF_SIZE_L,
- sem_size: config.NSFS_BUF_POOL_MEM_LIMIT_L,
+ sem_size: 16777216, //SDSD I added a printing to see when a buffer is allocated: } else {
+ console.log('SDSD in buffer allocation');
buffer = this.buffer_alloc(this.buf_size);
} Steps:
After the code changes mentioned above: The operation completed:
I could see the request timeout error logs twice:
Before the mentioned printings I saw that those cases were when a buffer was allocated.
This issue was not reproduced (we both tried), and it is probably something specific to a run on a specific node state. As suggested, I can suggest to change the console log printing from ERROR to WARN. Additional Information:
easier to read:
following the
Create a situation where there is no size: config.NSFS_BUF_SIZE_L,
- sem_size: config.NSFS_BUF_POOL_MEM_LIMIT_L,
+ sem_size: 1, //SDSD This would result in an error
|
@shirady What is the next step to do with this issue? |
@romayalon, I would need to check if we can improve the current behavior of buffer pools. |
@shirady , ran into this problem (occurrence in noobaa.log) when running Warp and trying to perform suspend/resume (HA functionality) noobaa stage rpm that Romy has generated for other issue (8577) noobaa-core-5.17.1-20241211.el9.ppc64le is used
copied the noobaa.log file of the gui1 node into box folder https://ibm.box.com/s/47osyg44y7ph2o31bd8gj1vzt8ok1xc2 |
Hi,
The max I noticed of buffers length is 30. Could you please share what are the resources for the noobaa service on this node? ( |
@shirady
This indicates the Noobaa service is running for the 2d 21h 7min etc
|
I will update you that I plan to add logging printing related to duration of actions that are in around the flow of the timeout message.
and after 30 milliseconds
there is a printing of
Note: the new logs you attached are using RPM: |
@shirady , in the noobaa log the Memory grep was performed, no other way could get the memory at that timestamp when the value was at 30 for the length. |
Environment info
noobaa-20241104 (5.17.1) - standalone noobaa
Actual behavior
./warp get --insecure --duration 60m --host .com:6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket warp-get-bucket-reg 2>&1| tee /tmp/warp-get-11nov2024.log
observed following in the log (system running concurrently long versioning test as well in other directory)
No errors on the client node is observed. I am saying around 03:54 , because the GPFS daemon has started back on the 1 node (out of 2 node protocol node) , where the RR-DNS is configured the IO continued to run when the HA happened previously. So this above message is nothing related to HA (will attach the logs)
Default endpoint forks in the system with 2 CES S3 nodes having 1 CES IP each assigned
Expected behavior
Are we expected to get these ERRORS , as posted above ?
CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer
Steps to reproduce
Run warp as shown below and it occurred on a system where it had medium workload i can say
More information - Screenshots / Logs / Other output
will update once the logs are uploaded https://ibm.ent.box.com/folder/293508364523
The text was updated successfully, but these errors were encountered: