Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated IR is not portable #1104

Open
ptillet opened this issue Jan 27, 2023 · 13 comments
Open

Generated IR is not portable #1104

ptillet opened this issue Jan 27, 2023 · 13 comments
Labels

Comments

@ptillet
Copy link
Collaborator

ptillet commented Jan 27, 2023

libdevice uses absolute path for the bytecode location. This is not portable. Not sure how to fix this, maybe libdevice location can be passed as an argument for compilation, rather than baked into the IR?

@ptillet ptillet added the bug label Jan 27, 2023
@ptillet ptillet changed the title Generated TTGIR is not portable Generated IR is not portable Jan 27, 2023
@Jokeren
Copy link
Contributor

Jokeren commented Jan 27, 2023

Yeah, that sounds like a good solution

@yuguo68
Copy link
Contributor

yuguo68 commented Feb 6, 2023

We hit a libdevice.10.bc not found compile/runtime error from https://github.com/openai/triton/blob/972b761390c45bcb0d0091741067603b2ab2ed42/lib/Target/LLVMIR/LLVMIRTranslation.cpp#L126-L133.

Can we mitigate via a env variable for the libdevice.10.bc path?

@Jokeren
Copy link
Contributor

Jokeren commented Feb 6, 2023

Just curious, if this path does not exist, what's the actual path you get?

@ptillet
Copy link
Collaborator Author

ptillet commented Feb 6, 2023

I'm definitely surprised. Unless you share the IR (but not the cubin) between different machines then you shouldn't run into the issue

@yuguo68
Copy link
Contributor

yuguo68 commented Feb 6, 2023

I believe it is related to our build system and maybe remote execution. The path is correct for local machine but failed to load at runtime.

@ptillet
Copy link
Collaborator Author

ptillet commented Feb 6, 2023

This is strange indeed. Does that mean that the TTGIR is generated on one machine, but then converted to PTX on another machine? Can I see the full stack trace of the error?

@yuguo68
Copy link
Contributor

yuguo68 commented Feb 6, 2023

Unfortunately the stack trace only contains one line complaining missing libdevice.10.bc. _build(name, src, srcdir) ie executed remotely. https://github.com/openai/triton/blob/a13ddf08e2984d3d734f242e4bea589c3fbbbca4/python/triton/compiler.py#L1366

And using an env variable for the path does work.

@yuguo68
Copy link
Contributor

yuguo68 commented Feb 6, 2023

The error should be from kernel compilation. Tried it again the full error stack is

Failed to load triton/python/triton/language/libdevice.10.bcTranslate to LLVM IR failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.

Looks like no space between the sentences.

@ptillet
Copy link
Collaborator Author

ptillet commented Feb 6, 2023

I don't think an environment variable would be a good workaround. The fundamental issue here seems to be that path is a relative path, so this is bound to create problems indeed.

Maybe something better would be for the frontend to copy libdevice.10.bc into the cache directory and have the IR contain the absolute path to libdevice.10.bc in the cache. Still not ideal, but that should be more fleixble than the current approach without requiring the use of an environment variable.

@yuguo68
Copy link
Contributor

yuguo68 commented Feb 6, 2023

Maybe our use case is a bit special since we may compile kernels on remote machines which does not have triton source code. We already use https://github.com/openai/triton/blob/a13ddf08e2984d3d734f242e4bea589c3fbbbca4/python/triton/compiler.py#L1066
for ptxas path.

@ptillet
Copy link
Collaborator Author

ptillet commented Feb 6, 2023

I see. Ok then, feel free to submit a PR that adds an TRITON_LIBDEVICE_PATH env var for now.

@malfet
Copy link
Collaborator

malfet commented Feb 11, 2023

How about using dladdr to get path to the shared object and then look for library relative to it

@malfet
Copy link
Collaborator

malfet commented Feb 11, 2023

I hope fixed by #1176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants