Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults with 3.1.0rc2 #2458

Closed
DanielFEvans opened this issue Apr 30, 2020 · 14 comments
Closed

Segfaults with 3.1.0rc2 #2458

DanielFEvans opened this issue Apr 30, 2020 · 14 comments
Labels
awaiting_feedback Awaiting feedback from reporter

Comments

@DanielFEvans
Copy link
Contributor

Having compiled GDAL 3.1.0rc2 into a Python wheel, and attempted to run our software tests, I am seeing occasional (and, annoyingly, unpredictable) segmentation faults.

The backtrace output from one such issue is included below, and the structure of the trace always seems to include the one mention of libproj, followed by the list of GDAL calls.

As PROJ being pointed to by the trace, I've tried building with PROJ v6.2.1 (version used for my previous GDAL 3.0.4 build) and v6.3.2, but both have shown the issue.

This could indicate a compilation issue on my side, but I'm wondering if there has been a change in the GDAL interface to PROJ that might be causing problems?

*** Error in `/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/bin/python3.6': free(): invalid pointer: 0x00007f0e600db440 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x7f0efa50a299]
/lib64/libstdc++.so.6(_ZNSs6assignERKSs+0x9e)[0x7f0ebddd907e]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/./libproj-4c5f0a23.so.15.2.1(proj_context_get_database_path+0x7e)[0x7f0eb750cace]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(+0x4e694f)[0x7f0eb926394f]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(_Z20OSRGetProjTLSContextv+0x9)[0x7f0eb9263b29]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(_ZNK19OGRSpatialReference11exportToWktEPPcPKPKc+0x6a)[0x7f0eb92b996a]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(+0x5c066d)[0x7f0eb933d66d]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(_Z20GDALCloneTransformerPv+0x62)[0x7f0eb9342c22]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(+0x5c5d8e)[0x7f0eb9342d8e]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(+0x5c1adc)[0x7f0eb933eadc]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(_Z20GDALCloneTransformerPv+0x44)[0x7f0eb9342c04]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(+0x5ca2e9)[0x7f0eb93472e9]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(_ZN19CPLWorkerThreadPool20WorkerThreadFunctionEPv+0x19)[0x7f0eb9d1bba9]
/home/jbanorthwest.co.uk/danielevans/venvs/farmcat3/lib64/python3.6/site-packages/osgeo/../GDAL.libs/libgdal-34960b89.so.26.1.0(+0xf36e2a)[0x7f0eb9cb3e2a]
/lib64/libpthread.so.0(+0x7ea5)[0x7f0efaf67ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f0efa5878dd]

Operating system

Scientific Linux 7.6

GDAL version and provenance

3.1.0rc2, compiled locally via scripts at https://github.com/DanielFEvans/gdalmanylinux/tree/gdal_3.1.0

@rouault
Copy link
Member

rouault commented Apr 30, 2020

but I'm wondering if there has been a change in the GDAL interface to PROJ that might be causing problems?

There have been a few . 095bc42 comes to mind, but they are probably others.
As your build uses an old Linux tool chain, I'm wondering if there might not be an issue with TLS on it. Or perhaps this is a bug on GDAL side. But without a way to reproduce it, hard to tell

@DanielFEvans
Copy link
Contributor Author

I'll keep poking around to try and pin down any pattern about the tests where it's crashing.

@DanielFEvans
Copy link
Contributor Author

DanielFEvans commented May 1, 2020

I reverted to the version of the build script which gave me a working 3.0.4 build, and only changed the GDAL version to 3.1.0rc2. I've not experienced any error since going back to that earlier script.

That suggests to me that it's the version of a specific dependency, or the presence/absence of a configuration flag that causes the issue - and (hopefully) not a general issue with GDAL. I'll keep working through the changes made, though it takes a while to rebuild each time.

@rouault rouault added the awaiting_feedback Awaiting feedback from reporter label May 9, 2020
@rouault
Copy link
Member

rouault commented May 9, 2020

any update ?

@DanielFEvans
Copy link
Contributor Author

I have still very occasionally seen segfaults, but no luck in tracking them down. They seem less common with the original build dependencies, but still there.

I'm also entertaining the possibility that they're in our Python code - as you know, unexpected segfaults are a 'feature' of the Python bindings if explicit references aren't kept to every object in a chain. It could be that GDAL 3.1.0 is exposing some issues/race conditions that already existed, but weren't causing problems.

@rouault
Copy link
Member

rouault commented May 11, 2020

It could be that GDAL 3.1.0 is exposing some issues/race conditions that already existed

I doubt so.

From your stack trace, is it possible that your Python code uses Python multiprocessing forking ? The crash likely occurs in the proj_context_get_database_path() call added in 095bc42 to fix #2221 . Could you try to revert it and see if it makes a difference ?

@DanielFEvans
Copy link
Contributor Author

Yes, we do use multiprocessing - I'll try reverting that change and see what happens.

If it matters at all, the method we use to create child processes is spawning, rather than forking. We've found that libraries such as rtree/libspatialindex don't play well with forking, due to the shared references that result.

@rouault
Copy link
Member

rouault commented May 11, 2020

to create child processes is spawning, rather than forking

spawning should be fine. But if the code path I underlined above is triggered, it should be following a fork() not a spawn

@DanielFEvans
Copy link
Contributor Author

As yet, I've not had much luck trying to revert. Simply doing git revert 095bc4 and compiling has resulted in a version that exits Python with an unexplained RuntimeError every time a spatial reference is used/handled (no error code or message is shown). I suspect that means there's other, subsequent code changes in v3.1.0 that also need reverting/editing, but I don't see anything obvious in the files changed (not that I'm too confident that I'd spot 'obvious' C++ issues).

I have tried recompiling the SWIG Python bindings (using SWIG 4.0.1) as part of the compilation - i.e. make veryclean and make generate - but that didn't help.

I've pushed the code as it is after reversion to: https://github.com/DanielFEvans/gdal/tree/revert_fork_change

Any pointers on where I might need to go hunting for the problem?

@rouault
Copy link
Member

rouault commented May 14, 2020

Normally just reverting should be fine. This should have no impact on the Python bindings, and you wouldn't need to regenerate them. Are you sure you build works fine (basic use of the Python bindings) on the stock 3.1 release ?

@DanielFEvans
Copy link
Contributor Author

Are you sure you build works fine (basic use of the Python bindings) on the stock 3.1 release ?

Always good to ask those questions. It turns out I'd not updated a couple of paths to account for building from a git clone, rather than the release source archive.

So far, the indication is that reverting that change has stopped the segfaults - running our software test suite repeatedly for an hour worked, while trying the same with the previous build resulted in a segfault after about half an hour. However, since the problems are intermittent, I'm not completely convinced yet, and will keep an eye out for any further issues.

@rouault
Copy link
Member

rouault commented May 18, 2020

Could you possibly do a CXXFLAGS="-g -fsanitize=address" build of PROJ and GDAL with 095bc42 applied , so there's better diagnostics when it crashes ?

@DanielFEvans
Copy link
Contributor Author

FYI, it seems that this problem no longer exists as of GDAL 3.1.2.

Apologies that I never got back to you on -fsanitize=address - I never worked out how to get a fully working build using it.

@rouault
Copy link
Member

rouault commented Jul 27, 2020

I suspect this was fixed per #2746

@rouault rouault closed this as completed Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting_feedback Awaiting feedback from reporter
Projects
None yet
Development

No branches or pull requests

2 participants