Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Warning stuck buffer_pool buffer with IO workload #8521

Open
rkomandu opened this issue Nov 12, 2024 · 14 comments
Open

Error: Warning stuck buffer_pool buffer with IO workload #8521

rkomandu opened this issue Nov 12, 2024 · 14 comments
Assignees
Labels

Comments

@rkomandu
Copy link
Collaborator

rkomandu commented Nov 12, 2024

Environment info

  • NooBaa Version: VERSION
  • Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify

noobaa-20241104 (5.17.1) - standalone noobaa

Actual behavior

  1. Ran the Warp IO workload of "get"

./warp get --insecure --duration 60m --host .com:6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket warp-get-bucket-reg 2>&1| tee /tmp/warp-get-11nov2024.log

observed following in the log (system running concurrently long versioning test as well in other directory)

Nov 11 03:54:17 node-gui0 [3532906]: [nsfs/3532906] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer    at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25)    at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46)    at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)    at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25)    at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19)    at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)
grep "Error: Warning stuck buffer_pool buffer" noobaa.log | grep "03:54" |wc -l
30

No errors on the client node is observed. I am saying around 03:54 , because the GPFS daemon has started back on the 1 node (out of 2 node protocol node) , where the RR-DNS is configured the IO continued to run when the HA happened previously. So this above message is nothing related to HA (will attach the logs)


Warp analyze  --> ignore 10 errors of "read tcp" as this is expected since the HA is happening (i.e gpfs is starting) 

./warp analyze --analyze.op GET --analyze.v  warp-get-2024-11-11[030028]-5qrv.csv.zst --debug
7550 operations loaded... Done!

----------------------------------------
Operation: GET (5050). Ran 1h0m4s. Size: 256000000 bytes. Concurrency: 20.
Errors: 15
First Errors:
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59180->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59164->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59270->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59244->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59256->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59258->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59178->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59166->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59222->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59234->9.42.93.99:6443: read: connection reset by peer


Requests considered: 5014:
 * Avg: 14.302s, 50%: 17.271s, 90%: 21.121s, 99%: 23.583s, Fastest: 1.486s, Slowest: 1m54.959s, StdDev: 9.26s
 * TTFB: Avg: 141ms, Best: 11ms, 25th: 55ms, Median: 73ms, 75th: 93ms, 90th: 118ms, 99th: 216ms, Worst: 1m35.183s StdDev: 2.096s
 * First Access: Avg: 17.059s, 50%: 19.136s, 90%: 21.355s, 99%: 23.466s, Fastest: 1.486s, Slowest: 1m52.859s, StdDev: 8.407s
 * First Access TTFB: Avg: 152ms, Best: 21ms, 25th: 59ms, Median: 78ms, 75th: 101ms, 90th: 130ms, 99th: 335ms, Worst: 1m33.717s StdDev: 2.18s
 * Last Access: Avg: 10.997s, 50%: 7.742s, 90%: 20.381s, 99%: 22.633s, Fastest: 1.486s, Slowest: 1m52.228s, StdDev: 6.788s
 * Last Access TTFB: Avg: 91ms, Best: 16ms, 25th: 52ms, Median: 69ms, 75th: 86ms, 90th: 108ms, 99th: 192ms, Worst: 36.283s StdDev: 782ms

Throughput:
* Average: 340.77 MiB/s, 1.40 obj/s

Throughput, split into 239 x 15s:
 * Fastest: 678.2MiB/s, 2.78 obj/s (15s, starting 03:57:01 EST)
 * 50% Median: 248.9MiB/s, 1.02 obj/s (15s, starting 03:26:16 EST)
 * Slowest: 43.6MiB/s, 0.18 obj/s (15s, starting 03:14:16 EST)

Default endpoint forks in the system with 2 CES S3 nodes having 1 CES IP each assigned

Expected behavior

Are we expected to get these ERRORS , as posted above ?

CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer

Steps to reproduce

Run warp as shown below and it occurred on a system where it had medium workload i can say

More information - Screenshots / Logs / Other output

will update once the logs are uploaded https://ibm.ent.box.com/folder/293508364523

@rkomandu rkomandu added the NS-FS label Nov 12, 2024
@romayalon
Copy link
Contributor

@shirady can you please take a look ?

@shirady shirady self-assigned this Nov 14, 2024
@shirady
Copy link
Contributor

shirady commented Nov 14, 2024

Hi @rkomandu ,
Could you please reproduce and provide logs with a higher debug level?
I had issues with the GPFS machine and couldn't reproduce a whole test run.

Additional Details

At this moment I'm looking for a level-1 printing from here:

dbg.log1('BufferPool.get_buffer: sem value', this.sem._value, 'waiting_value', this.sem._waiting_value, 'buffers length', this.buffers.length);

As you can see from the error stuck:

[ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)

It comes on GET request, in NSFS it is read_object_stream after we tried to get_buffer (a buffer from the memory to execute the read operation).

We want to see the values of the sem._waiting_value and want to see the trend: Does it keep raising? Does it increase and decrease during the test? etc.

We know that it started with 0:

[3532906]: [nsfs/3532906] [L0] core.sdk.namespace_fs:: NamespaceFS: buffers_pool [ BufferPool.get_buffer: sem value: 2097152 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 33554432 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 425721856 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 3825205248 waiting_value: 0 buffers length: 0 ]

This printing comes from here:

if (this.warning_timeout) {
const err = new Error('Warning stuck buffer_pool buffer');
warning_timer = setTimeout(() => {
console.error(err.stack);
}, this.warning_timeout);
warning_timer.unref();
}

This warning printing might be in 2 minutes waiting - as configured here:
config.NSFS_BUF_POOL_WARNING_TIMEOUT = 2 * 60 * 1000;

BTW, Although it is a warning, I'm not sure why it is printed as with console.error and not console.warn, I can try and suggest a code change.

@rkomandu
Copy link
Collaborator Author

rkomandu commented Nov 15, 2024

@shirady

I had issues with the GPFS machine and couldn't reproduce a whole test run.

lets see if we can get your cluster working for the run.
For now kickstarted the run for 90min , debuglevel set to all for now

DEBUGLEVEL : all
ENABLEMD5 : true

Will update once run complete

Note: High level thought, this bug might be recreated with the load on the system. Am saying this because the issue 8524 with versioning PUT method didn't run into this error when ran for 3hrs (ran on Wed)

@rkomandu
Copy link
Collaborator Author

@shirady

Ran for the 90 min Warp get op run and didn't run into the buffer_pool error

./warp get --insecure --duration 90m --host :6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket newbucket-warp-get-8521-15nov 2>&1| tee /tmp/newbucket-warp-get-255M-8521-15nov.log

[root@gui0 log]# zgrep "stuck" noobaa.log-20241115.gz
[root@gui0 log]# grep "stuck" noobaa.log

[root@gui1 log]# zgrep "stuck" noobaa.log-20241115.gz
[root@gui1 log]# grep "stuck" noobaa.log

@rkomandu
Copy link
Collaborator Author

Please try to check from the code flow perspective , as mentioned it could also be w/r/t load on the system

@shirady
Copy link
Contributor

shirady commented Nov 20, 2024

Hi,
I will share that I ran the warp twice and didn't reproduce it.
I didn't find the printing "Error: Warning stuck buffer_pool buffer" (or any "stuck" in the logs).
I ran it with a high debug level and didn't find a place where the waiting_value is not 0, from this output:

dbg.log1('BufferPool.get_buffer: sem value', this.sem._value, 'waiting_value', this.sem._waiting_value, 'buffers length', this.buffers.length);

@rkomandu, I'm planning to check other things I will update you here about it.

Additional information:

I run:
./warp get --host=<IP-address-of-1-node> --access-key=<> --secret-key=<> --obj.size=256M --duration=60m --bucket=<bucket-name> --objects 1500 --insecure --tls (after creating the bucket).
I had to set the --objects due to space limits on the machine.

The outputs:

  1. Without changing the NSFS_CALCULATE_MD5
----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 367.05 MiB/s, 1.50 obj/s

Throughput, split into 192 x 5s:
 * Fastest: 389.2MiB/s, 1.59 obj/s
 * 50% Median: 368.5MiB/s, 1.51 obj/s
 * Slowest: 325.8MiB/s, 1.33 obj/s

----------------------------------------
Operation: GET. Concurrency: 20
* Average: 849.39 MiB/s, 3.48 obj/s

Throughput, split into 239 x 15s:
 * Fastest: 896.0MiB/s, 3.67 obj/s
 * 50% Median: 851.6MiB/s, 3.49 obj/s
 * Slowest: 774.0MiB/s, 3.17 obj/s
  1. Changing the NSFS_CALCULATE_MD5 to true
----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 161.74 MiB/s, 0.66 obj/s

Throughput, split into 141 x 15s:
 * Fastest: 171.4MiB/s, 0.70 obj/s
 * 50% Median: 162.1MiB/s, 0.66 obj/s
 * Slowest: 148.0MiB/s, 0.61 obj/s

----------------------------------------
Operation: GET. Concurrency: 20
* Average: 821.38 MiB/s, 3.36 obj/s

Throughput, split into 238 x 15s:
 * Fastest: 853.9MiB/s, 3.50 obj/s
 * 50% Median: 823.2MiB/s, 3.37 obj/s
 * Slowest: 741.7MiB/s, 3.04 obj/s

@shirady
Copy link
Contributor

shirady commented Dec 4, 2024

Hi,
I would add that I tried to reproduce the error occurrence by adding code changes (to force the warning to be printed):

I set a shorter timeout (30 milliseconds instead of 2 minutes).
In config.js:

- config.NSFS_BUF_POOL_WARNING_TIMEOUT = 2 * 60 * 1000;
+ config.NSFS_BUF_POOL_WARNING_TIMEOUT = 30; //SDSD

I reduced the sem_size (the size represents 16 MiB, twice the size of NSFS_BUF_SIZE_L = 8388608 = 8 * 1024 * 1024, 8 MiB, the size of NSFS_BUF_POOL_MEM_LIMIT_L is 3825205248 = 3648MiB, when running on 258MiB in the steps below).
In namespace_fs.js:

    size: config.NSFS_BUF_SIZE_L,
- sem_size: config.NSFS_BUF_POOL_MEM_LIMIT_L,
+ sem_size: 16777216, //SDSD

I added a printing to see when a buffer is allocated:
In buffer_utils.js:

        } else {
+            console.log('SDSD in buffer allocation');
            buffer = this.buffer_alloc(this.buf_size);
        }

Steps:
Before the code changes:

  1. Create an account with the CLI: sudo node src/cmd/manage_nsfs account add --name <account-name> --new_buckets_path /tmp/nsfs_root1 --access_key <access-key> --secret_key <secret-key> --uid <uid> --gid <gid>
    Note: before creating the account need to give permission to the new_buckets_path: chmod 777 /tmp/nsfs_root1, chmod 777 /tmp/nsfs_root1
  2. Start the NSFS server with: sudo node src/cmd/nsfs --debug 5
    Notes:
  • I Change the config.NSFS_CHECK_BUCKET_BOUNDARIES = false; //SDSD because I’m using the /tmp/ and not /private/tmp/.
  1. Create the alias for S3 service:alias nc-user-1-s3=‘AWS_ACCESS_KEY_ID=<access-key> AWS_SECRET_ACCESS_KEY=<secret-key> aws --no-verify-ssl --endpoint-url https://localhost:6443’
  2. Check the connection to the endpoint and try to list the buckets (should be empty): nc-user-1-s3 s3 ls; echo $?
  3. Add bucket to the account using AWS CLI: nc-user-1-s3 s3 mb s3://bucket-buf (bucket-buf is the bucket name in this example)
  4. Create the content for the body in size 256MB: dd if=/dev/urandom of=256MB_file bs=1M count=256
  5. Put object: nc-user-1-s3 s3api put-object --bucket bucket-buf --key 256MB_file --body ./256MB_file

After the code changes mentioned above:
8. Restart the server (by ctrl + c and run sudo node src/cmd/nsfs --debug 5, if you want you can redirect to a file: sudo node src/cmd/nsfs --debug 5 > logs_get_object_stream_4_code_changes.txt 2>&1)
9. Get the object: nc-user-1-s3 s3api get-object --bucket bucket-buf --key 256MB_file output_256MB_file

The operation completed:

Dec-4 16:30:53.142 [Upgrade/37314] [L1] core.util.http_utils:: HTTP REPLY RAW GET /bucket-buf/256MB_file

I could see the request timeout error logs twice:

Dec-4 16:30:52.875 [Upgrade/37314] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer
Dec-4 16:30:52.880 [Upgrade/37314] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer

Before the mentioned printings I saw that those cases were when a buffer was allocated.

Dec-4 16:30:52.844 [Upgrade/37314] [LOG] CONSOLE:: SDSD in buffer allocation
Dec-4 16:30:52.849 [Upgrade/37314] [LOG] CONSOLE:: SDSD in buffer allocation

This issue was not reproduced (we both tried), and it is probably something specific to a run on a specific node state.
If, in the future, we will find a certain pattern, we can change the values in the config.js file.

As suggested, I can suggest to change the console log printing from ERROR to WARN.


Additional Information:

  1. No code changes on the mentioned get operation object file 256 MB:
    Printing of sorted_buf_sizes

{ size: 4096, sem_size: 2097152 }
{ size: 65536, sem_size: 33554432 },
{ size: 1048576, sem_size: 425721856 },
{ size: 8388608, sem_size: 3825205248 }

easier to read:
size: NSFS_BUF_SIZE_XS = 4 * 1024 (4 KiB), sem_size: NSFS_BUF_POOL_MEM_LIMIT_XS = 2 MiB
size: NSFS_BUF_SIZE_S = 64 * 1024 (64 KiB), sem_size: NSFS_BUF_POOL_MEM_LIMIT_S = 32MiB
size: NSFS_BUF_SIZE_M = 1 * 1024 * 1024 (1 MiB), sem_size: NSFS_BUF_POOL_MEM_LIMIT_M = 406MiB
size: NSFS_BUF_SIZE_L = 8 * 1024 * 1024 (8 MiB), sem_size: NSFS_BUF_POOL_MEM_LIMIT_L = 3648MiB

core.util.buffer_utils:: BufferPool.get_buffer: sem value 3825205248 waiting_value 0 buffers length

following the pos printings, we can see it reads 8 MiB every time (32 times) every time it gets the semaphore of NSFS_BUF_SIZE_L

  1. With code changes ti create an error:

Create a situation where there is no sem_size available
In namespace_fs.js:

    size: config.NSFS_BUF_SIZE_L,
-  sem_size: config.NSFS_BUF_POOL_MEM_LIMIT_L,
+ sem_size: 1, //SDSD

This would result in an error Semaphore Timeout:

[ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>SlowDown</Code><Message>Reduce your request rate.</Message><Resource>/bucket-buf/256MB_file</Resource><RequestId>m40yp3na-30ei4o-h11</RequestId></Error> GET /bucket-buf/256MB_file {"host":"localhost:6443","accept-encoding":"identity","user-agent":"aws-cli/2.17.11 md/awscrt#0.20.11 ua/2.0 os/macos#24.1.0 md/arch#arm64 lang/python#3.11.10 md/pyimpl#CPython cfg/retry-mode#standard md/installer#source md/prompt#off md/command#s3api.get-object","x-amz-date":"20241128T065657Z","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","authorization":"AWS4-HMAC-SHA256 Credential=Dwertyuiopasdfg11001/20241128/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=fe2f4f7014234a7b57486c39a3a5ce2ee1663f9b26caa7803df54ed87fe13c9d"} Error: Semaphore Timeout
    at Object.new_error_code (/Users/shiradymnik/SourceCode/noobaa-core/src/util/error_utils.js:16:26)
    at Semaphore._on_timeout (/Users/shiradymnik/SourceCode/noobaa-core/src/util/semaphore.js:256:37)
    at Timeout.<anonymous> (/Users/shiradymnik/SourceCode/noobaa-core/src/util/semaphore.js:252:53)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7)

@romayalon
Copy link
Contributor

@shirady What is the next step to do with this issue?

@shirady
Copy link
Contributor

shirady commented Dec 8, 2024

@romayalon, I would need to check if we can improve the current behavior of buffer pools.
I asked in the original issue of buffer pool a question about it, and if we see that we have ideas we might have a discussion about it.

@rkomandu
Copy link
Collaborator Author

rkomandu commented Dec 16, 2024

@shirady , ran into this problem (occurrence in noobaa.log) when running Warp and trying to perform suspend/resume (HA functionality)

noobaa stage rpm that Romy has generated for other issue (8577) noobaa-core-5.17.1-20241211.el9.ppc64le is used
rpm location : https://noobaa-core-rpms.s3.us-east-1.amazonaws.com/noobaa-core-5.17.1-20241211-stage_5_17_2.el9.ppc64le.rpm

./warp mixed --insecure --duration 120m --host <ces-ip>:6443 --access-key X0oJKaTmq5I2hGUGhqzd --secret-key uyYY9OA5syqZ+LsIAiv4f+Dn2mcqlr4kLzpvJ/sQ --tls --obj.size 128M --disable-multipart --bucket newbucket-ha-di-efix-warp-sus-res-16dec   2>&1| tee /tmp/newbucket-ha-di-efix-warp-sus-res-16dec.log
warp: <ERROR> download error: read tcp 9.42.124.184:58094->9.42.93.100:6443: read: connection reset by peer
warp: <ERROR> download error: read tcp 9.42.124.184:58100->9.42.93.100:6443: read: connection reset by peer                                                   Benchmarking:                09m38s / 2h00m00s ┃▓▓▓▓▓▓▓▓█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░┃   8.03%

Dec 16 00:14:28 node-gui1 [3201480]: [nsfs/3201480]    [L1] core.util.buffer_utils:: BufferPool.get_buffer: sem value 3816816640 waiting_value 0 buffers length 1
Dec 16 00:14:34 node-gui1 [3201480]: [nsfs/3201480]    [L1] core.server.system_services.stats_aggregator:: standalon_update_nsfs_stats. nsfs_stats = { fs_workers_stats: { fileread: { count: 8, error_count: 0, min_time: 682, max_time: 936, sum_time: 6137 } }, io_stats: { read_count: 0, read_bytes: 67108864 } }
Dec 16 00:14:38 node-gui1 [3201481]: [nsfs/3201481]    [L0] core.sdk.endpoint_stats_collector:: bucket stats - newbucket-ha-di-efix-warp-sus-res-16dec application/octet-stream : { read_count: 2 }
Dec 16 00:14:38 node-gui1 [3201481]: [nsfs/3201481]    [L0] core.sdk.endpoint_stats_collector:: namespace stats - undefined : { read_count: 2, read_bytes: 256000000 }
Dec 16 00:14:40 node-gui1 [3201480]: [nsfs/3201480]    [L0] core.sdk.endpoint_stats_collector:: bucket stats - newbucket-ha-di-efix-warp-sus-res-16dec application/octet-stream : { read_count: 2 }
Dec 16 00:14:40 node-gui1 [3201480]: [nsfs/3201480]    [L0] core.sdk.endpoint_stats_collector:: namespace stats - undefined : { read_count: 2, read_bytes: 256000000 }


Dec 16 00:14:38 node-gui1 [3201481]: [nsfs/3201481]    [L0] core.sdk.endpoint_stats_collector:: namespace stats - undefined : { read_count: 2, read_bytes: 256000000 }
Dec 16 00:14:40 node-gui1 [3201480]: [nsfs/3201480]    [L0] core.sdk.endpoint_stats_collector:: bucket stats - newbucket-ha-di-efix-warp-sus-res-16dec application/octet-stream : { read_count: 2 }
Dec 16 00:14:40 node-gui1 [3201480]: [nsfs/3201480]    [L0] core.sdk.endpoint_stats_collector:: namespace stats - undefined : { read_count: 2, read_bytes: 256000000 }
Dec 16 00:16:27 node-gui1 [3201481]: [nsfs/3201481] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer    at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25)    at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46)    at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)    at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25)    at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19)    at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)
Dec 16 00:16:28 gpfs-p10-gui1 [3201480]: [nsfs/3201480] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer    at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25)    at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46)    at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)    at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25)    at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19)    at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)
Dec 16 00:16:28 node-gui1 [3201481]: [nsfs/3201481] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer    at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25)    at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46)    at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)    at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25)    at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19)    at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)
Dec 16 00:16:28 node-gui1 [3201480]: [nsfs/3201480] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer    at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25)    at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46)    at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)    at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25)    at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19)    at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)


copied the noobaa.log file of the gui1 node into box folder https://ibm.box.com/s/47osyg44y7ph2o31bd8gj1vzt8ok1xc2

@shirady
Copy link
Contributor

shirady commented Dec 17, 2024

Hi,
@rkomandu thanks for the information.
I noticed in the logs that we have this printings:

Dec 16 00:51:02 gpfs-p10-gui1 [3201481]: [nsfs/3201481]    [L1] core.util.buffer_utils:: BufferPool.get_buffer: sem value 3674210304 waiting_value 0 buffers length 30

The max I noticed of buffers length is 30.
we know that for 128MiB it reads every iteration reads 8 MiB and uses the buffer size of NSFS_BUF_SIZE_L, which means that at this point it uses 8 * 30 = 240 MiB.

Could you please share what are the resources for the noobaa service on this node? (systemctl status noobaa)
I want to see if it might be related in someway.

@rkomandu
Copy link
Collaborator Author

@shirady
have restarted multiple times for testing other defect (HA feature, suspend/resume), so the timestamp of dec 16th 00:51 is not possible , the noobaa resources that are used is as shown below


2024-12-16_03:05:39.371-0500: [I] mmcesop: isServiceRunning: rc=3, response=Redirecting to /bin/systemctl status noobaa.service
○ noobaa.service - The NooBaa service.
     Loaded: loaded (/usr/lib/systemd/system/noobaa.service; enabled; preset: disabled)
     Active: inactive (dead) since Mon 2024-12-16 03:05:35 EST; 3s ago
   Duration: 2d 21h 7min 25.210s
    Process: 3200676 ExecStart=/usr/local/noobaa-core/bin/node /usr/local/noobaa-core/src/cmd/nsfs.js (code=killed, signal=TERM)
    Process: 3980362 ExecStop=/bin/kill $MAINPID (code=exited, status=0/SUCCESS)
   Main PID: 3200676 (code=killed, signal=TERM)
        CPU: 6h 58min 23.820s

This indicates the Noobaa service is running for the 2d 21h 7min etc

In noobaa.log
Dec 16 00:50:58 node-gui1 node[3769297]: [/3769297]    [L2] core.util.os_utils:: promise exec systemctl status noobaa | grep Memory  false
..
Dec 16 00:51:11 node-gui1 node[3769733]: [/3769733]    [L2] core.util.os_utils:: promise exec systemctl status noobaa | grep Memory  false

@shirady
Copy link
Contributor

shirady commented Dec 17, 2024

@rkomandu,

  • Could you edit your comment and add the "Memory" printing as well?
  • I tried to search inside the maximum number of of the buffer length in the logs that I received. If there is s timestamp that you suggest to focus in, please let me know.

I will update you that I plan to add logging printing related to duration of actions that are in around the flow of the timeout message.
My current assumptions:

  1. It might be related to the buffer allocation - In the example above we can see that there was the printing of

Dec-4 16:30:52.844 [Upgrade/37314] [LOG] CONSOLE:: SDSD in buffer allocation

and after 30 milliseconds

config.NSFS_BUF_POOL_WARNING_TIMEOUT = 30; //SDSD

there is a printing of

Dec-4 16:30:52.875 [Upgrade/37314] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer

  1. It is something that was slow during the GET (reading the object) - although I tried to search for the string "took too long" (grep "took too long") and the maximum size I noticed was "687.859 ms" in "FileFsync" (probably related to PUT) and not higher than 200 ms for "FileRead".
  • Reminder the timeout is 2 minutes, so currently, I don't have evidence in the logs for that assumption.
    config.NSFS_BUF_POOL_WARNING_TIMEOUT = 2 * 60 * 1000;

Note: the new logs you attached are using RPM:
https://noobaa-core-rpms.s3.us-east-1.amazonaws.com/noobaa-core-5.17.1-20241211-stage_5_17_2.el9.ppc64le.rpm
And they include PR #8521.

@rkomandu
Copy link
Collaborator Author

@shirady , in the noobaa log the Memory grep was performed, no other way could get the memory at that timestamp when the value was at 30 for the length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants