Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HWLOC 2.4 and HWLOC 2.5 are binary incompatible #477

Open
ivankochin opened this issue Jul 2, 2021 · 14 comments
Open

HWLOC 2.4 and HWLOC 2.5 are binary incompatible #477

ivankochin opened this issue Jul 2, 2021 · 14 comments

Comments

@ivankochin
Copy link

What version of hwloc are you using?

  • HWLOC 2.4
  • HWLOC 2.5

Which operating system and hardware are you running on?

Windows

Details of the problem

HWLOC 2.5 breaks the binary backward compatibility with regard to HWLOC 2.4.

To reproduce the issue, compile the simple HWLOC example using HWLOC 2.4:

#include <hwloc.h>

int main() {
    hwloc_topology_t topology;
    hwloc_topology_init(&topology);
    hwloc_topology_load(topology);
}

And then run it with DLL from HWLOC 2.5.

You will see the following error:
image

@bgoglin
Copy link
Contributor

bgoglin commented Jul 2, 2021

We had one possibly related report last year with hwloc 2.2 but we never understood what was going on. https://www.mail-archive.com/hwloc-users@lists.open-mpi.org/msg01565.html

There shouldn't be any ABI break above since topology is just a pointer that is filled by init() and then used by load(). Things would break if init() from one hwloc version was used with load() from another version (but even this should work with hwloc 2.4 and 2.5).

Can you check the return value of init() in case it failed? The wrong location reported 0x000000000000000d0 makes me think that the topology poiinter is NULL.

@ivankochin
Copy link
Author

I have updated the example by the following way:

#include <hwloc.h>
#include <iostream>

int main() {
    hwloc_topology_t topology{ nullptr };
    auto init_result = hwloc_topology_init(&topology);
    std::cout << "init result: " << init_result << " topology now is: " << topology << std::endl;
    auto load_result = hwloc_topology_load(topology);
    std::cout << "load result: " << load_result << " topology now is: " << topology << std::endl;
}

And now I have got the new error from hwloc_topology_init():
image

@ivankochin
Copy link
Author

ivankochin commented Jul 2, 2021

I also performed the same experiments with the following HWLOC versions: 2.0, 2.1, 2.2, 2.3, 2.4, 2.5

And it seems like all these versions are incompatible with each other. But the issue reproduces only on Windows, it seems like on Linux all works fine (but I checked not all versions on Linux).

@bgoglin
Copy link
Contributor

bgoglin commented Jul 5, 2021

I can reproduce the issue when building a program inside a MSVC project using hwloc 2.5 headers/libs [1], and then running it from either the hwloc 2.4 zipball bin directory (fails) or 2.5 (works). When building with cygwin, I don't see any issue. Unfortunately, I don't get anything useful from dumpbin, ldd, objdump, nm on the binary hence I can't check whether msvc hardwired some symbol address, etc.

I am not sure I am doing this as expected. Here's my config:
I put this in the C/C++ command-line config option:
/I"C:\Users\goglin\Downloads\hwloc-win64-build-2.5.0\include"
and this in the Linker command-line option:
"C:\Users\goglin\Downloads\hwloc-win64-build-2.5.0\lib\libhwloc.lib" /LIBPATH:"C:\Users\goglin\Downloads\hwloc-win64-build-2.5.0\bin"
Then I copy one libhwloc-15.dll inside the Project exe file directory, and launch the program from MSVC.

@ivankochin
Copy link
Author

The case that you describe above is about forward binary compatibility. But my case is about backward binary compatibility. You should compile the application with HWLOC 2.4 but run it with HWLOC 2.5 DLL.

I do it using the following way:

"C:\Program Files (x86)\Microsoft Visual Studio\2019\Professional\VC\Auxiliary\Build\vcvarsall.bat" amd64
set INCLUDE=<path_to_hwloc_2_4>\include;%INCLUDE%
set LIB=<path_to_hwloc_2_4>\lib;%LIB%
cl K:\kivan\cpp\hwloc_2_4_2_5_incompatibility.cpp libhwloc.lib /EHsc /DEBUG /Zi /Od
copy <path_to_hwloc_2_5_binaries>\libhwloc-15.dll .
hwloc_2_4_2_5_incompatibility.exe

Also, it seems like this is entry points mapping issue. To prove it lets consider the following example:

    std::cout << "version is: " << hwloc_get_api_version() << std::endl;
    hwloc_topology_t topology{ nullptr };
    auto init_result = hwloc_topology_init(&topology);
    std::cout << "init result: " << init_result << " topology now is: " << topology << std::endl;
    auto load_result = hwloc_topology_load(topology);
    std::cout << "load result: " << load_result << " topology now is: " << topology << std::endl;

When it executes with the same DLL and headers versions, then the return code is 0 and the output is "version is: 132096".

The assembly of the hwloc_get_api_version() function in such case:

000000006EC45C50  mov         eax,20400h  
000000006EC45C55  ret  

But if I use the DLL for HWLOC 2.5 then it crashes during the hwloc_get_api_version() call. The assembly of this function in this case:

000000006EC48010  push        r15  
000000006EC48012  push        r14  
000000006EC48014  push        r13  
000000006EC48016  push        r12  
000000006EC48018  push        rbx  
000000006EC48019  sub         rsp,20h  
000000006EC4801D  xor         r8d,r8d  
000000006EC48020  xor         r12d,r12d  
000000006EC48023  mov         r14,rdx  
000000006EC48026  xor         edx,edx  
000000006EC48028  mov         r15,rcx  
000000006EC4802B  call        000000006EC4C0F0

I don't know why it has such behavior but the assembler is different and it seems like we just call the incorrect address. So I can assume that the root cause of the issue is incorrect entry points mapping.

@bgoglin
Copy link
Contributor

bgoglin commented Jul 7, 2021

It looks like cygwin fails too if I link with libhwloc.lib explicitly instead of passing -L../lib -lhwloc to gcc. Do you happen to know what libhwloc.lib is compared to the DLL?

@ivankochin
Copy link
Author

No, I don't. Could you please clarify the question?

@bgoglin
Copy link
Contributor

bgoglin commented Jul 27, 2021

It looks like linking against libhwloc.lib raises the issue (what MSVC does, and cygwin can be forced to do), while linking against libhwloc.dll doesn't (cygwin uses that one by default). But I don't know what libhwloc.lib is, at least compared to libhwloc.dll hence I don't know why linking against the former would cause an error and not the latter.

@ivankochin
Copy link
Author

Hope I understand your question correctly, so I will try to provide some helpful information. .lib files on Windows have two meanings: static library and static import library generated during the DLL assembling which just contains interfaces to load the DLL in runtime.

HWLOC is the dynamic library, so in our case libhwloc.lib should be an import library for the runtime loading of libhwloc.dll. And it is since the libhwloc.lib weights only 70Kib on Windows while the full DLL weights 2Mib.

But it seems like libhwloc.lib is broken because the libhwloc.dll works incorrectly after the loading.

I found some part of helpful information here. Hope that I don't provide the wrong information to you.

@ivankochin
Copy link
Author

Hello, I have news regarding this problem. It seems like the root cause is in the build system which is used to build HWLOC packages on Windows. As I understand the packages on Windows are built using some Linux subsystem like MinGW or SygWin, am I right?

I do such an assumption because I have built several HWLOC versions using hwloc.sln from the contib/windows folder and in this case, there is no incompatibility break. Could you please check this workaround on your side to be sure that this works?

@bgoglin
Copy link
Contributor

bgoglin commented Dec 3, 2021

Yes prebuilt zipballs are built using MSYS2 and MinGW (we support cygwin but it's only used in the CI). But this may change in the future because the recently-added CMake support makes things muuuuuch easier (and it generates a hwloc.sln that seems much better, at least not outdated). I'll see if I can test the compatibility across MSVC-built libs.

@bgoglin
Copy link
Contributor

bgoglin commented Dec 3, 2021

I can confirm I still get the issue with our prebuilt 2.4/2.5 libraries but no incompatibility between 2.4/2.5 built from hwloc.sln

@ivankochin
Copy link
Author

Did I understand correctly that you are planning to change the way of building the packages to solve this issue?

@bgoglin
Copy link
Contributor

bgoglin commented Dec 3, 2021

I don't know yet. In the past, building was MSYS/MinGW only. This environment isn't easy to install, and building is very long. That's why we provided pre-built binaries in ZIPs. Then cygwin support was added and made things slightly easier but cygwin has some other drawbacks. CMake clearly changes things. Building is easy and fast, an many windows developers already have CMake installed.

Things that may happen in the future, from most likely to unlikely

  1. drop the MSVC solution under contrib/windows and tell people to use CMake to generate it. This is likely to happen because contrib/windows is sort of obsolete, we often got complaints that it's not very good/flexible.
  2. stop shipping binary ZIPs and tell people to build with CMake. Depends if users want them. We could also look at distributing binaries built with CMake.
  3. drop support for MinGW and/or Cygwin. They are easy to maintain so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants