Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion fails in uv__io_poll() in aix.c after IBM XL C++ Runtime upgrade #3465

Closed
laurencehook opened this issue Feb 8, 2022 · 13 comments
Closed

Comments

@laurencehook
Copy link

  • Version: 1.42.0
  • Platform: AIX 7200-05-03-2148

The IBM App Connect Enterprise v11/12 integration product on AIX is built using the IBM XL C++ v16.1.0.3 compiler.
It embeds Node.js v14.x.x (most recently 14.18.1 and 14.18.3) which has been built using the GCC v6 compiler.
Therefore, execution of IBM App Connect Enterprise on AIX requires;

  • XLC 16.1.0.3 runtime or above
  • gcc runtime at v6 or above

We have found that if the IBM XL C++ v17.1 runtime libraries are present on the system, IBM App Connect Enterprise consistently fails to start and the following assertion error is reported to stderr:

Assertion failed: __EX, file ../deps/uv/src/unix/aix.c, line 297

which maps to the following line in method uv__io_poll():

assert((unsigned) pc.fd < loop->nwatchers);

If I simply add a printf statement to report these values, we see something like:
pc.fd, nwatchers: 1532713819, 30

If I add many more printf statements to try to debug it, the problem does not occur.

At the time of the assertion, two threads are in method uv__io_poll():

__assert_c99 [/usr/lib/libpthreads.a(shr_xpg5_64.o)]
uv__io_poll [/var/opt/ace-11.0.0.15/server/lib/libnode.83.a]
uv_run [/var/opt/ace-11.0.0.15/server/lib/libnode.83.a]
_ZN4node16NodeMainInstance3RunEv [/var/opt/ace-11.0.0.15/server/lib/libnode.83.a]
_ZN4node5StartEiPPc [/var/opt/ace-11.0.0.15/server/lib/libnode.83.a]
_ZN13NodejsManager19startAndMonitorNodeEv [bipbroker]

and

uv__io_poll [/var/opt/ace-11.0.0.15/server/lib/libnode.83.a]
uv_run [/var/opt/ace-11.0.0.15/server/lib/libnode.83.a]
ZZN4node23WorkerThreadsTaskRunner20DelayedTaskScheduler5StartEvENUlPvE_4_FUNES2 [/var/opt/ace-11.0.0.15/server/lib/libnode.83.a]
_pthread_body [/usr/lib/libpthreads.a(shr_xpg5_64.o)]

Current workaround is to uninstall the IBM XL C++ V17.1 runtime libraries and use the v16.1.0.8 runtime libraries. The issue was raised with the IBM XL C++ compiler team but they first wanted the owners of aix.c to investigate the assertion.

@richardlau
Copy link
Contributor

cc @libuv/aix

@richardlau richardlau added the aix label Feb 8, 2022
@vtjnash
Copy link
Member

vtjnash commented Feb 8, 2022

Looks like a nodejs bug. libuv does not use (or support) threads in that manner

@vtjnash
Copy link
Member

vtjnash commented Feb 8, 2022

Can you run with any sanitizer tools, such as ASAN or TSAN?

@mhdawson
Copy link
Contributor

@laurencehook is there a recreate using the AIX binaries that are available on https://nodejs.org/en/download/ ?

@mhdawson
Copy link
Contributor

Or possibly a simplified C program using libuv that also recreates?

@mhdawson
Copy link
Contributor

The assert at line 297 has been that way for 8 years, so its nothing new/recent in the code.

The structure used in the assert is struct poll_ctl pc;

Doc for that structure in AIX - https://www.ibm.com/docs/en/aix/7.2?topic=files-pollseth-file

@laurencehook
Copy link
Author

It was suggested that we try rebuilding Node.js with a more recent GCC compiler version. This recommendation happened to coincide with discovering that a local AIX build environment for Node.js had been "broken" since upgrading the AIX level to 7.2 TL5 SP3, due to the following issue:

https://community.ibm.com/community/user/power/communities/community-home/digestviewer/viewthread?GroupId=6211&MessageKey=7af2f62f-f23f-40cb-aa87-048625dad735&CommunityKey=10c1d831-47ee-4d92-a138-b03f7896f7c9&tab=digestviewer

Upgrading to GCC v8.3.0.6 resolved the build issue due to the struct sigset_t conflict AND appears to have resolved the runtime assertion failure in uv__io_poll().

More testing will be needed once we have upgraded our product build environment for AIX with this later GCC compiler, but it seems promising.

@mhdawson
Copy link
Contributor

@laurencehook thanks for the update and good to hear.

@laurencehook
Copy link
Author

Unfortunately, after upgrading our product AIX build machines to GCC v8.3.0.6, the assert failure in uv__io_poll() still occurs. The local sandbox build from last month seemed ok, but the problem still exists with the output from our product build machine after the GCC upgrade.

@gireeshpunathil
Copy link
Contributor

had a quick look. 1532713819 == 0x5B5B5B5B.

This pattern is typically invalid block eye-catcher. That means, pc was either not allocated, or overwritten.

can you run with export MALLOCDEBUG=catch_overflow on the terminal prior to the run and see if we get any useful info?

@laurencehook
Copy link
Author

Thanks for the suggestion Gireesh. I did try with 'MALLOCDEBUG=catch_overflow,validate_ptrs' set, and it did appear to blow up consistently when trying to use the 'events' ptr.

Further debugging with printing hex values for some of the variables appeared to show that the pollfd 'events' array returned by pollset_poll() was looking fine. It returns one valid entry. But the 'for loop' counter variable 'i' is becoming corrupted and this is being used to increment the 'events' ptr to the next array entry. So pollset_poll() returns 1 event, nfds =1. The first time into the loop everything appears fine, but when we next test the 'i' counter variable, it has some wacky value that evaluates to a large negative number, so still less than nfds (=1), and we use that 'i' as the next offset to 'events'. And that's when the assert will typically fail.

It was suggested that we disable gcc compiler optimization. With this disabled, or with optimization level 1, our application starts ok.
With optimization set to level 2 or 3, it fails to start. We can compile with the optimization level set to 1 for just the aix.c source file, and the remaining code at optimization level 3, and this seems to start ok.
This is our best workaround at the moment.

@stale
Copy link

stale bot commented Jun 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

@stale stale bot added the stale label Jun 12, 2022
@bnoordhuis
Copy link
Member

It sounds like we're dealing with a compiler bug here and not something libuv can fix so I'll go ahead and close this out. Let me know if there is reason to reopen.

@bnoordhuis bnoordhuis closed this as not planned Won't fix, can't repro, duplicate, stale Nov 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants