Skip to content

Misc fixes and docs update #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 9, 2021
Merged

Conversation

rnett
Copy link
Contributor

@rnett rnett commented Jan 26, 2021

Fixes #168 (makes ndarray and framework respect -DskipTests), gitignores bazel config files, and adds a CONTRIBUTING.md file as discussed in #176 (comment). Also adds a link to the gitter channel.

If there's anything else you think should be added to CONTRIBUTING.md let me know, these are all of the tips that I'm aware of.

@rnett
Copy link
Contributor Author

rnett commented Jan 30, 2021

To add when I'm not building:

Currently, do to build time constraints, the GPU binaries only support compute capacities 3.5 and 7.0.  To use with un-supported GPUs, change the value [here](tensorflow-core/tensorflow-core-api/build.sh#L27) build the binaries yourself.

To build for GPU, pass `-Djavacpp.platform.extension=-gpu` to maven.  Tensorflow's configure script can be used to set options (like cuda locations) for the native build.  Make sure the `TF_CUDA_COMPUTE_CAPABILITIES`  value in your `.tf_configure.bazelrc` matches the value set in `build.sh`.

Copy link
Collaborator

@karllessard karllessard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rnett , that's valuable documentation that was definitely missing

rnett added 5 commits February 1, 2021 21:51
Signed-off-by: Ryan Nett <rnett@calpoly.edu>
Signed-off-by: Ryan Nett <rnett@calpoly.edu>
Signed-off-by: Ryan Nett <rnett@calpoly.edu>
Signed-off-by: Ryan Nett <rnett@calpoly.edu>
Signed-off-by: Ryan Nett <rnett@calpoly.edu>
@rnett rnett force-pushed the rn_fixes_and_docs branch from 5f37408 to 95c32e4 Compare February 2, 2021 06:18
@rnett rnett requested a review from karllessard February 2, 2021 06:18
Signed-off-by: Ryan Nett <rnett@calpoly.edu>
@rnett
Copy link
Contributor Author

rnett commented Feb 2, 2021

I changed build.sh to use a set TF_CUDA_COMPUTE_CAPABILITIES by default as per #200, @karllessard can you add the CI build tag just in case?

Signed-off-by: Ryan Nett <rnett@calpoly.edu>
@karllessard
Copy link
Collaborator

@rnett , the GPU build failed with that change:

Cuda Configuration Error: Invalid compute capability: compute_'35
INFO: Elapsed time: 15.494s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)

https://github.com/tensorflow/java/runs/1826080574?check_suite_focus=true

Signed-off-by: Ryan Nett <rnett@calpoly.edu>
@rnett
Copy link
Contributor Author

rnett commented Feb 4, 2021

It's the single quotes, I think. Should be fixed.

@karllessard
Copy link
Collaborator

@rnett , failed again for a very awkward reason this time. I can simply retrigger a build, sometimes things get wrong on GitHub Action hosted servers but I'll you validate first:

Analyzing: 6 targets (180 packages loaded, 9563 targets configured)
Analyzing: 6 targets (199 packages loaded, 14836 targets configured)
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/7e825abd5704ce28b166f9463d4bd304348fd2a9.tar.gz failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 404 Not Found
INFO: Analyzed 6 targets (204 packages loaded, 18538 targets configured).
INFO: Found 6 targets...
[0 / 1,314] [Prepa] BazelWorkspaceStatusAction stable-status.txt
[2,727 / 6,848] Creating runfiles tree bazel-out/k8-opt-exec-50AE0418/bin/external/org_tensorflow/tensorflow/cc/ops/array_ops_gen_cc.runfiles; 0s local ... (2 actions, 1 running)
ERROR: /root/.cache/bazel/_bazel_root/47b4c4ffab8583ddc67d4f1b3159b1ed/external/nccl_archive/BUILD.bazel:53:1: C++ compilation of rule '@nccl_archive//:device_lib' failed (Exit 1)
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/limits: In static member function 'static constexpr long double std::numeric_limits<long double>::denorm_min()':
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/limits:1799:65: error: expected ')' before numeric constant
       denorm_min() _GLIBCXX_USE_NOEXCEPT { return __LDBL_DENORM_MIN__; }
                                                                 ^~~
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/limits:1799:68: error: cannot convert 'double (*)(const char*) throw ()' to 'long double' in return
       denorm_min() _GLIBCXX_USE_NOEXCEPT { return __LDBL_DENORM_MIN__; }

https://github.com/tensorflow/java/runs/1827956486?check_suite_focus=true

@rnett
Copy link
Contributor Author

rnett commented Feb 4, 2021

Yeah that doesn't sound like anything I changed.

@karllessard
Copy link
Collaborator

You are right, master is also failing now for the same reasons... (sigh)... @saudet , do you think it is a wrong GCC installation or something like that?

@saudet
Copy link
Contributor

saudet commented Feb 5, 2021

@karllessard Tried upgrading to GCC 8 and that didn't change anything, so probably not an issue with GCC:
https://github.com/saudet/tensorflow-java/actions/runs/541859636

@saudet
Copy link
Contributor

saudet commented Feb 7, 2021

@karllessard I'm not having any more luck with TensorFlow 2.4.1 and CUDA 11.0 either, so it's not something that used to be broken for sure. It actually seems like something recently broke. It's obviously downloading something somewhere for CUDA that isn't pinned to a specific version, and that got upgraded unintentionally somewhere upstream, which broke TF's build, at least on CentOS. I assume this is going to get fixed upstream, eventually...

@karllessard
Copy link
Collaborator

Thanks @saudet , maybe we can go back to this container with Cuda we used to have before we start installing it manually? It stopped working as well but might have been fixed upstream since? Also that shortened the build time for Linux, which seems less stable these days.

@saudet
Copy link
Contributor

saudet commented Feb 7, 2021 via email

@karllessard
Copy link
Collaborator

@rnett , were you able at least to build your version with GPU support on your machine? It might take awhile before this issue gets resolved by itself, I'll merge this if you confirm that it worked on your side.

@rnett
Copy link
Contributor Author

rnett commented Feb 9, 2021

Yeah, although I had TF_CUDA_COMPUTE_CAPABILITIES set in my bazelrc so I'm not 100% it was tested. I'll run one tonight or tomorrow without it and let you know.

@rnett
Copy link
Contributor Author

rnett commented Feb 9, 2021

Ok, it works locally.

@karllessard karllessard merged commit 52bbff3 into tensorflow:master Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Match test configuration for ndarray and framework to core
3 participants