Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK] Issue finding tensorflow during Install RAVEN libraries for Mac M2 #2158

Closed
10 tasks done
yoshiurr-INL opened this issue Jul 27, 2023 · 33 comments
Closed
10 tasks done
Assignees
Labels
priority_critical task This tag should be used for any new capability, improvement or enanchment under-discussion issues that are under discussion

Comments

@yoshiurr-INL
Copy link
Collaborator

yoshiurr-INL commented Jul 27, 2023


Under Discussion Topic

Machine Specification
Equipment: MacBook Pro
OS: Ventura 13.5
Processor: Apple M2 Max
Screenshot 2023-07-27 at 12 28 32 PM

Summary of the topic to be discussed with the development team
While installing RAVEN libraries using "--install", the pip install for tensorflow cannot find a version that satisfies the requirements of tensorflow==2.10.*
Screenshot 2023-07-27 at 12 19 29 PM
Screenshot 2023-07-27 at 12 19 58 PM
Screenshot 2023-07-27 at 12 20 41 PM

When trying to use "--mamba" instead, the installation process does not start.
Screenshot 2023-07-27 at 12 21 11 PM

Describe the solution you'd like to be implemented
Identify whether this issue is common for Mac systems.
Identify whether this issue is common for M1 and M2 chips.

Describe alternatives you've considered
Maybe conda installing tensorflow?


For Change Control Board: Issue Review

This review should occur before any development is performed as a response to this issue.

  • 1. Is it tagged with a type: defect or task?
  • 2. Is it tagged with a priority: critical, normal or minor?
  • 3. If it will impact requirements or requirements tests, is it tagged with requirements?
  • 4. If it is a defect, can it cause wrong results for users? If so an email needs to be sent to the users.
  • 5. Is a rationale provided? (Such as explaining why the improvement is needed or why current code is wrong.)

For Change Control Board: Issue Closure

This review should occur when the issue is imminently going to be closed.

  • 1. If the issue is a defect, is the defect fixed?
  • 2. If the issue is a defect, is the defect tested for in the regression test system? (If not explain why not.)
  • 3. If the issue can impact users, has an email to the users group been written (the email should specify if the defect impacts stable or master)?
  • 4. If the issue is a defect, does it impact the latest release branch? If yes, is there any issue tagged with release (create if needed)?
  • 5. If the issue is being closed without a pull request, has an explanation of why it is being closed been provided?
@yoshiurr-INL yoshiurr-INL added the under-discussion issues that are under discussion label Jul 27, 2023
@joshua-cogliati-inl
Copy link
Contributor

Hm, if you change the line in the dependencies.xml from:
<tensorflow source="pip" os='mac,linux'>2.10</tensorflow>
to
<tensorflow os='mac,linux'>2.10</tensorflow>
does it install?

(Note that we do not currently have automated testing on arm64)

@wanghy-anl
Copy link
Contributor

@joshua-cogliati-inl Joshua, I found the identical issue on my M1 MacBook Pro 13 inch (OS: Ventura 13.5; Processor: Apple M1), just like Ramon experienced.

I tried to edit the dependencies.xml as you suggested, and the conda environment can be established by ./scripts/establish_conda_env.sh --install.

However, after ./build_raven and ./run_tests -j4, 23 tests are marked as "Diff" or "Failed". See the attached log file.

Haoyu
log_run_test_j4_20230802.log

Hm, if you change the line in the dependencies.xml from: <tensorflow source="pip" os='mac,linux'>2.10</tensorflow> to <tensorflow os='mac,linux'>2.10</tensorflow> does it install?

(Note that we do not currently have automated testing on arm64)

@joshua-cogliati-inl
Copy link
Contributor

Okay, so we can install it if we switch tensorflow back to conda-forge, but it fails some tests. I think the correct solution for this is probably to switch to a newer version of tensorflow.

@wanghy-anl
Copy link
Contributor

Thanks Joshua. Let me know if you have any candidate versions in your mind. I can test on my M1 machine (it's idle recently)

Okay, so we can install it if we switch tensorflow back to conda-forge, but it fails some tests. I think the correct solution for this is probably to switch to a newer version of tensorflow.

@joshua-cogliati-inl
Copy link
Contributor

Tensorflow 2.12 and 2.13 might be worth trying.

@joshua-cogliati-inl
Copy link
Contributor

I started testing tensorflow 2.12 in #2138 but we need a few updates for it.

@wanghy-anl
Copy link
Contributor

@joshua-cogliati-inl, here are the results:
Using 2.12 (I modified Line 49 of dependencies.xml to <tensorflow os='mac,linux'>2.12</tensorflow>: Can establish conda environment, but has 14 Failed tests and 16 Diff tests, see log below;
log_run_test_j4_tensorflow_2_12_2023AUG03.log

Using 2.13 (Only available through PIP channel, I modified Line 49 of dependencies.xml to <tensorflow source="pip" os='mac,linux'>2.13</tensorflow>: Can establish conda environment, but has 673 Failed tests, see log below;
log_run_test_j4_tensorflow_2_13_2023AUG03.log

Tensorflow 2.12 and 2.13 might be worth trying.

@joshua-cogliati-inl
Copy link
Contributor

Hm, for 2.13, something is being done incorrectly:

ImportError: Failed to import grpc on Apple Silicon. On Apple Silicon machines, try `pip uninstall grpcio; conda install grpcio`. Check out https://docs.ray.io/en/master/ray-overview/installation.html#m1-mac-apple-silicon-support for more details.

@wanghy-anl
Copy link
Contributor

Is there anything we can do within raven's establish_conda_env.sh script?

Hm, for 2.13, something is being done incorrectly:

ImportError: Failed to import grpc on Apple Silicon. On Apple Silicon machines, try `pip uninstall grpcio; conda install grpcio`. Check out https://docs.ray.io/en/master/ray-overview/installation.html#m1-mac-apple-silicon-support for more details.

@joshua-cogliati-inl
Copy link
Contributor

It might be worth adding 'grpcio' as a conda dependency and see if that solves it.

@joshua-cogliati-inl
Copy link
Contributor

Otherwise, yes, we might need to modify establish_conda_env.sh

@wanghy-anl
Copy link
Contributor

I added the <grpcio/> to dependencies.xml, and the conda environment can be established, but 14 failed and 16 diff tests. See the dependencies.xml and log attached.
dependencies_and_log_2023AUG04.zip

It might be worth adding 'grpcio' as a conda dependency and see if that solves it.

@joshua-cogliati-inl
Copy link
Contributor

I added the to dependencies.xml, and the conda environment can be established, but 14 failed and 16 diff tests. See the dependencies.xml and log attached.

It looks like a bunch of the diff and failed are because of the tensorflow update. So that is probably the first thing that we need to fix.

@wanghy-anl
Copy link
Contributor

Joshua, let me know when you need to test the fix. I can do the test on M1 chip.

@joshua-cogliati-inl joshua-cogliati-inl changed the title [UNDER-DISCUSSION] Issue finding tensorflow during Install RAVEN libraries [UNDER-DISCUSSION] Issue finding tensorflow during Install RAVEN libraries for Mac M2 Aug 28, 2023
@joshua-cogliati-inl
Copy link
Contributor

joshua-cogliati-inl commented Aug 28, 2023

For future reference, these are the changes made to dependencies.xml compared to current devel (scipy is actually updated by a devel change, so we probably do not need to downgrade scipy, also smt was added in devel as well):

--- dependencies.xml	2023-08-28 10:20:41.567497521 -0600
+++ /tmp/.fr-NTKHA2/dependencies.xml	2023-08-04 08:39:21.000000000 -0600
@@ -37,7 +37,7 @@
   <main>
     <h5py/>
     <numpy>1.22</numpy>
-    <scipy>1.9</scipy>
+    <scipy>1.7</scipy>
     <scikit-learn>1.0</scikit-learn>
     <pandas/>
     <!-- Note most versions of xarray work, but some (such as 0.20) don't -->
@@ -46,8 +46,9 @@
     <matplotlib>3.5</matplotlib>
     <statsmodels>0.13</statsmodels>
     <cloudpickle>2.2</cloudpickle>
-    <tensorflow source="pip" os='mac,linux'>2.10</tensorflow>
-    <tensorflow source="pip" os='windows'>2.10</tensorflow>
+    <tensorflow source="pip" os='mac,linux'>2.13</tensorflow>
+    <tensorflow source="pip" os='windows'>2.13</tensorflow>
+    <grpcio/>
     <!-- conda is really slow on windows if the version is not specified.-->
     <python skip_check='True' os='windows'>3.8</python>
     <python skip_check='True' os='mac,linux'>3</python>
@@ -70,7 +71,6 @@
     <!-- redis is needed by ray, but on windows, this seems to need to be explicitly stated -->
     <redis source="pip" os='windows'/>
     <imageio source="pip">2.22</imageio>
-    <smt/>
     <line_profiler optional='True'/>
     <!-- <ete3 optional='True'/> -->
     <pywavelets optional='True'>1.1</pywavelets>

@wanghy-anl
Copy link
Contributor

Joshua, is this dependencies.xml in any branch? I can give it a try if you can point me to the correct branch.

@joshua-cogliati-inl
Copy link
Contributor

Joshua, is this dependencies.xml in any branch? I can give it a try if you can point me to the correct branch.

I just used the dependencies.xml file you included in your zip file, and I also just updated the #2138 with 2.13 instead of 2.12

@wanghy-anl
Copy link
Contributor

Thanks, I will wait until #2138 gets merged and then test it on M1 chip.

@joshua-cogliati-inl
Copy link
Contributor

Joshua, is this dependencies.xml in any branch? I can give it a try if you can point me to the correct branch.

It is on my joshua-cogliati-inl:tensorflow_212 branch that #2138 uses, it would be useful to know if it fixes things on the M1 chip.

@wanghy-anl
Copy link
Contributor

It is on my joshua-cogliati-inl:tensorflow_212 branch that #2138 uses, it would be useful to know if it fixes things on the M1 chip.

Thanks Joshua, Let me give it a try on M1 chip tonight or tomorrow. I will attach the log file here.

@joshua-cogliati-inl
Copy link
Contributor

FYI: If anyone uses the diff for the dependencies.xml, do not remove smt since that will cause newer versions of RAVEN to fail.

@joshua-cogliati-inl
Copy link
Contributor

joshua-cogliati-inl commented Sep 6, 2023

On further investigation, smt does not seem to be available for macos amd64: https://pypi.org/project/smt/#files
so we probably do need to change <smt/> to <smt optional='True'/> and put imports that use smt into try catch blocks.

@wanghy-anl
Copy link
Contributor

FYI: If anyone uses the diff for the dependencies.xml, do not remove smt since that will cause newer versions of RAVEN to fail.

Josh, you were correct. I deleted <smt/> in the attached dependencies_a.xml and 694 tests failed on M1 chip. See attached Log_Sep05_2023_a.log.
So I re-added <smt source='pip'/> in the attached dependencies_b.xml and it runs better. 19 tests failed. See attached Log_Sep05_2023_b.log.
Sep_5_2022_Trials.zip

@joshua-cogliati-inl
Copy link
Contributor

Some errors I saw:

File ".../raven/ravenframework/Optimizers/acquisitionFunctions/AcquisitionFunction.py", line 138, in conductAcquisition res = sciopt.differential_evolution(optFunc, bounds=self._bounds, polish=self._polish, maxiter=self._maxiter, tol=self._tol,
TypeError: differential_evolution() got an unexpected keyword argument 'vectorized'

File ".../python3.10/site-packages/netCDF4/__init__.py", line 3, in <module> from ._netCDF4 import
ImportError: dlopen(.../python3.10/site-packages/netCDF4/_netCDF4.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_nc_close'

libc++abi: terminating due to uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::overflow_error>>: Error in function ibeta_derivative<e>(e,e,e): Overflow Error

Also, a bunch of diffs.

I think it is worth trying netcdf 1.6 to see if that fixes the netcdf errors.
I think the floating point hardware must be a bit different and causing the overflow error and some of the diffs.

@wangcj05
Copy link
Collaborator

wangcj05 commented Sep 7, 2023 via email

@joshua-cogliati-inl
Copy link
Contributor

So apparently the remaining errors are:

FAILED:
Diff tests/framework/redundantInputs
Diff tests/framework/NDGridProbabilityWeightValue
Diff tests/framework/CodeInterfaceTests/CobraTF/test3
Diff tests/framework/pca_sparseGridCollocation/polyCorrelation
Diff tests/framework/PostProcessors/LimitSurface/testLimitSurfaceIntegralPPWithBoundingError
Diff tests/framework/Optimizers/GeneticAlgorithms/simionescuConstrainedInvLin
Diff tests/framework/Samplers/SparseGrid/normal
Failed tests/framework/Samplers/SparseGrid/betanorm
Failed tests/framework/Samplers/SparseGrid/beta
Diff tests/framework/Samplers/SparseGrid/triangular
Diff tests/framework/pca_adaptive_sgc/test_adaptive_sgc_poly_pca_analytic

PASSED: 778
SKIPPED: 93
FAILED: 11

I think a lot of those are from differences between how arm64 and amd64 handle floating point numbers. (From what I have seen online, I think basic arithmetic (+-*/) are the same, but things like floating to integer and back are different as well as functions like sin which will give differences eventually)

@wangcj05
Copy link
Collaborator

wangcj05 commented Sep 11, 2023 via email

@alfoa
Copy link
Collaborator

alfoa commented Sep 25, 2023

Just FYI: (on M2, I had to download and "pip install" smt directly from https://github.com/SMTorg/SMT)

@joshua-cogliati-inl
Copy link
Contributor

@alfoa Yes, we are discussing smt at: #2138 (comment)

@wangcj05 wangcj05 changed the title [UNDER-DISCUSSION] Issue finding tensorflow during Install RAVEN libraries for Mac M2 [TASK] Issue finding tensorflow during Install RAVEN libraries for Mac M2 Sep 29, 2023
@wangcj05 wangcj05 added priority_critical task This tag should be used for any new capability, improvement or enanchment labels Sep 29, 2023
@wangcj05
Copy link
Collaborator

This issue is partly addressed by PR #2138

@joshua-cogliati-inl
Copy link
Contributor

joshua-cogliati-inl commented Nov 10, 2023

It looks like #2201 fixed the beta Sampler problems:

(49/69) Success(  2.87sec)tests/framework/Samplers/SparseGrid/beta
(50/69) Success(  2.90sec)tests/framework/Samplers/SparseGrid/betanorm

Update:
And for that matter all the RAVEN tests currently pass on Mac OS amd64:

PASSED: 794
SKIPPED: 95
FAILED: 0
 ... RAVEN tests passed successfully.

@wangcj05
Copy link
Collaborator

wangcj05 commented Nov 10, 2023 via email

@wangcj05
Copy link
Collaborator

It seems this issue has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority_critical task This tag should be used for any new capability, improvement or enanchment under-discussion issues that are under discussion
Projects
None yet
Development

No branches or pull requests

5 participants