Misc fixes and docs update #195

rnett · 2021-01-26T00:00:47Z

Fixes #168 (makes ndarray and framework respect -DskipTests), gitignores bazel config files, and adds a CONTRIBUTING.md file as discussed in #176 (comment). Also adds a link to the gitter channel.

If there's anything else you think should be added to CONTRIBUTING.md let me know, these are all of the tips that I'm aware of.

rnett · 2021-01-30T04:21:46Z

To add when I'm not building:

Currently, do to build time constraints, the GPU binaries only support compute capacities 3.5 and 7.0.  To use with un-supported GPUs, change the value [here](tensorflow-core/tensorflow-core-api/build.sh#L27) build the binaries yourself.

To build for GPU, pass `-Djavacpp.platform.extension=-gpu` to maven.  Tensorflow's configure script can be used to set options (like cuda locations) for the native build.  Make sure the `TF_CUDA_COMPUTE_CAPABILITIES`  value in your `.tf_configure.bazelrc` matches the value set in `build.sh`.

karllessard

Thank you @rnett , that's valuable documentation that was definitely missing

CONTRIBUTING.md

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

rnett · 2021-02-02T20:00:53Z

I changed build.sh to use a set TF_CUDA_COMPUTE_CAPABILITIES by default as per #200, @karllessard can you add the CI build tag just in case?

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

karllessard · 2021-02-04T03:55:16Z

@rnett , the GPU build failed with that change:

Cuda Configuration Error: Invalid compute capability: compute_'35
INFO: Elapsed time: 15.494s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)

https://github.com/tensorflow/java/runs/1826080574?check_suite_focus=true

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

rnett · 2021-02-04T06:10:29Z

It's the single quotes, I think. Should be fixed.

karllessard · 2021-02-04T13:52:08Z

@rnett , failed again for a very awkward reason this time. I can simply retrigger a build, sometimes things get wrong on GitHub Action hosted servers but I'll you validate first:

Analyzing: 6 targets (180 packages loaded, 9563 targets configured)
Analyzing: 6 targets (199 packages loaded, 14836 targets configured)
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/llvm/llvm-project/archive/7e825abd5704ce28b166f9463d4bd304348fd2a9.tar.gz failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 404 Not Found
INFO: Analyzed 6 targets (204 packages loaded, 18538 targets configured).
INFO: Found 6 targets...
[0 / 1,314] [Prepa] BazelWorkspaceStatusAction stable-status.txt
[2,727 / 6,848] Creating runfiles tree bazel-out/k8-opt-exec-50AE0418/bin/external/org_tensorflow/tensorflow/cc/ops/array_ops_gen_cc.runfiles; 0s local ... (2 actions, 1 running)
ERROR: /root/.cache/bazel/_bazel_root/47b4c4ffab8583ddc67d4f1b3159b1ed/external/nccl_archive/BUILD.bazel:53:1: C++ compilation of rule '@nccl_archive//:device_lib' failed (Exit 1)
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/limits: In static member function 'static constexpr long double std::numeric_limits<long double>::denorm_min()':
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/limits:1799:65: error: expected ')' before numeric constant
       denorm_min() _GLIBCXX_USE_NOEXCEPT { return __LDBL_DENORM_MIN__; }
                                                                 ^~~
/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/limits:1799:68: error: cannot convert 'double (*)(const char*) throw ()' to 'long double' in return
       denorm_min() _GLIBCXX_USE_NOEXCEPT { return __LDBL_DENORM_MIN__; }

https://github.com/tensorflow/java/runs/1827956486?check_suite_focus=true

rnett · 2021-02-04T21:47:54Z

Yeah that doesn't sound like anything I changed.

karllessard · 2021-02-05T03:23:41Z

You are right, master is also failing now for the same reasons... (sigh)... @saudet , do you think it is a wrong GCC installation or something like that?

saudet · 2021-02-05T23:32:08Z

@karllessard Tried upgrading to GCC 8 and that didn't change anything, so probably not an issue with GCC:
https://github.com/saudet/tensorflow-java/actions/runs/541859636

saudet · 2021-02-07T01:00:53Z

@karllessard I'm not having any more luck with TensorFlow 2.4.1 and CUDA 11.0 either, so it's not something that used to be broken for sure. It actually seems like something recently broke. It's obviously downloading something somewhere for CUDA that isn't pinned to a specific version, and that got upgraded unintentionally somewhere upstream, which broke TF's build, at least on CentOS. I assume this is going to get fixed upstream, eventually...

karllessard · 2021-02-07T02:44:33Z

Thanks @saudet , maybe we can go back to this container with Cuda we used to have before we start installing it manually? It stopped working as well but might have been fixed upstream since? Also that shortened the build time for Linux, which seems less stable these days.

saudet · 2021-02-07T02:57:24Z

Using the container isn't going to change anything about CUDA. That's not the issue here.

karllessard · 2021-02-08T13:57:00Z

@rnett , were you able at least to build your version with GPU support on your machine? It might take awhile before this issue gets resolved by itself, I'll merge this if you confirm that it worked on your side.

rnett · 2021-02-09T00:39:54Z

Yeah, although I had TF_CUDA_COMPUTE_CAPABILITIES set in my bazelrc so I'm not 100% it was tested. I'll run one tonight or tomorrow without it and let you know.

rnett · 2021-02-09T04:29:19Z

Ok, it works locally.

karllessard requested changes Jan 30, 2021

View reviewed changes

CONTRIBUTING.md Outdated Show resolved Hide resolved

CONTRIBUTING.md Outdated Show resolved Hide resolved

CONTRIBUTING.md Outdated Show resolved Hide resolved

CONTRIBUTING.md Outdated Show resolved Hide resolved

rnett added 5 commits February 1, 2021 21:51

Don't force test runs in ndarray and framework

e191599

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

gitignore bazel config files

010753b

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

Add CONTRIBUTING.md

9188d29

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

Add note about code generation

1663cf9

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

Updates

95c32e4

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

rnett force-pushed the rn_fixes_and_docs branch from 5f37408 to 95c32e4 Compare February 2, 2021 06:18

rnett requested a review from karllessard February 2, 2021 06:18

Use set TF_CUDA_COMPUTE_CAPABILITIES by default

46e67b6

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

karllessard added the CI build label Feb 3, 2021

mikaelhg mentioned this pull request Feb 3, 2021

README should include an example of a build command with explicit component paths #206

Open

Add dedicated native builds section

c8c6ec1

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

Fix quoting

0038265

Signed-off-by: Ryan Nett <rnett@calpoly.edu>

karllessard approved these changes Feb 9, 2021

View reviewed changes

karllessard merged commit 52bbff3 into tensorflow:master Feb 9, 2021

Misc fixes and docs update #195

Misc fixes and docs update #195

Uh oh!

Conversation

rnett commented Jan 26, 2021

Uh oh!

rnett commented Jan 30, 2021

Uh oh!

karllessard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rnett commented Feb 2, 2021

Uh oh!

karllessard commented Feb 4, 2021

Uh oh!

rnett commented Feb 4, 2021

Uh oh!

karllessard commented Feb 4, 2021

Uh oh!

rnett commented Feb 4, 2021

Uh oh!

karllessard commented Feb 5, 2021

Uh oh!

saudet commented Feb 5, 2021

Uh oh!

saudet commented Feb 7, 2021

Uh oh!

karllessard commented Feb 7, 2021

Uh oh!

saudet commented Feb 7, 2021 via email

Uh oh!

karllessard commented Feb 8, 2021

Uh oh!

rnett commented Feb 9, 2021

Uh oh!

rnett commented Feb 9, 2021

Uh oh!

Uh oh!