Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pinned numpy in github action #3974

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

tarang-jain
Copy link
Contributor

Pin numpy version to < 2 in github action

@tarang-jain
Copy link
Contributor Author

tarang-jain commented Oct 19, 2024

@asadoughi I am surprised by how the seg fault in RAFT builds has suddenly arrived. My guess was a numpy version mismatch. The conda envs for RAFT 24.06 have numpy<2, which is why I pinned numpy=1.26.4. Upon running a valgrind on the torch tests, I see this:

...==1912667== Conditional jump or move depends on uninitialised value(s)
==1912667==    at 0x1270B55B: at::native::_to_copy(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x1341BD55: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x12B12C28: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x13264C8B: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::_to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x12C23375: at::_ops::_to_copy::call(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x1270969E: at::native::to(at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x135F2CF3: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd_dtype_to>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x12DA753C: at::_ops::to_dtype::call(at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x12087C84: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x120899F5: at::TensorIteratorBase::build(at::TensorIteratorConfig&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x1208AC24: at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x1239341E: at::meta::structured_add_Tensor::meta(at::Tensor const&, at::Tensor const&, c10::Scalar const&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==  Uninitialised value was created by a stack allocation
==1912667==    at 0x12087320: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)

which makes me wonder that downgrading torch might help. Please let me know if you have any suggestions. This exact same action was working earlier, right? If the github action was unchanged, it makes me wonder that this has to do something with version compatibility of some of the packages since the action does not pin versions for any of the packages.

@asadoughi
Copy link
Contributor

We can look into version pinning for all packages involved for the RAFT CI. Do you have a compatibility version of torch for RAFT 24.06? More generally, is there a published compatibility matrix for each version of RAFT?

facebook-github-bot pushed a commit that referenced this pull request Oct 22, 2024
Summary:
Related to testing in #3974

Based on comparing the logs of two runs:
- failing: https://github.com/facebookresearch/faiss/actions/runs/11409771344/job/31751246207
- passing: https://github.com/facebookresearch/faiss/actions/runs/11368781432/job/31625550227

Pull Request resolved: #3980

Reviewed By: junjieqi

Differential Revision: D64778154

Pulled By: asadoughi

fbshipit-source-id: f4e53fed3850f3e0f391015c0349ee14da68330a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants