Update ML AI Plugin To Support Keras 3 #1223

bpaul4 · 2024-05-21T21:27:48Z

Fixes/Addresses:

Updates syntax in ML AI Plugin examples, methods, and tests to support Keras 3 which replaces the standard H5 (.h5) file format with a new Keras (.keras) file format, and deprecates the legacy SavedModel (folder) model type into a inference-only TFSM loading support structure. There is still a way to load SavedModel files, but they cannot have a custom layer that is detectable by FOQUS.

Summary/Motivation:

Changes proposed in this PR:

Reorganize ML AI example file structure into supported Keras, deprecated (legacy) Keras, and non-Keras trainer examples
Update examples to Keras 3 syntax and regenerate example files (files for non-Keras trainers were regenerated as well to ensure compatibility with the latest versions)
Update node method structure and tests to confirm compatibility with Keras 3 (latest TensorFlow, which is currently 2.16.1)

Legal Acknowledgement

By contributing to this software project, I agree to the following terms and conditions for my contribution:

I agree my contributions are submitted under the copyright and license terms described in the LICENSE.md file at the top level of this directory.
I represent I am authorized to make the contributions and grant the license. If my employer has rights to intellectual property that includes these contributions, I represent that I have received permission to make contributions and grant the required license on behalf of that employer.

…ompatible TensorFlow

lbianchi-lbl · 2024-06-04T20:04:10Z

@bpaul4 to enable the ML/AI tests on versions of Python other than 3.10, you can try to add items to the job matrix include section:

FOQUS/.github/workflows/checks.yml

Lines 62 to 91 in 7a04a6b

    
           name: pytest (py${{ matrix.python-version }}/${{ matrix.os }}) 
        
           needs: [spell-check, code-formatting, pylint] 
        
           runs-on: ${{ matrix.os-version }} 
        
           strategy: 
        
             fail-fast: false 
        
             matrix: 
        
               python-version: 
        
                 - '3.8' 
        
                 - '3.9' 
        
                 - '3.10' 
        
                 - '3.11' 
        
                 - '3.12' 
        
               os: 
        
                 - linux 
        
                 - win64 
        
                 - macos-x86_64 
        
                 - macos 
        
               include: 
        
                 - os: macos-x86_64 
        
                   os-version: macos-13 
        
                 - os: macos 
        
                   os-version: macos-14 
        
                 - os: linux 
        
                   os-version: ubuntu-20.04 
        
                 - os: win64 
        
                   os-version: windows-2019 
        
                 - python-version: '3.10'  # avoid uploading coverage for full matrix 
        
                   use_coverage: true 
        
                 - python-version: '3.10'  # this is to avoid installing optional dependencies in all environments 
        
                   optional-dependencies: -r requirements-mlai.txt

        include:
          - os: macos-x86_64
            os-version: macos-13
          - os: macos
            os-version: macos-14
          - os: linux
            os-version: ubuntu-20.04
          - os: win64
            os-version: windows-2019
          - python-version: '3.10'  # avoid uploading coverage for full matrix
            use_coverage: true
          - python-version: '3.10'  # this is to avoid installing optional dependencies in all environments
            optional-dependencies: -r requirements-mlai.txt
+         - python-version: '3.9'
+           optional-dependencies: -r requirements-mlai.txt
+         - python-version: '3.11'
+           optional-dependencies: -r requirements-mlai.txt

bpaul4 · 2024-06-05T16:40:43Z

@lbianchi-lbl per the log, it appears that the "savedmodel" zip file is copied over and extracted to the test directory, but is not picked up by the model_files fixture in the test module itself. It picks up all other models and all Python files, but not the "savedmodel" folder. I'm currently investigating why, as this doesn't occur locally, only in CI runs.

lbianchi-lbl · 2024-06-05T17:50:42Z

@bpaul4 thanks for looking into this. When running locally, have you tried running the tests in a freshly created directory? e.g.

mkdir testing-foqus-1223
cd testing-foqus-1223
git clone https://github.com/bpaul4/FOQUS
cd FOQUS
git switch update-keras
# activate environment
# run tests

One key difference between local development and the CI environment is that the latter always starts from a "clean slate", which might be one of the factors that results in the different behavior that we're seeing.

bpaul4 · 2024-06-05T17:54:21Z

@lbianchi-lbl I did some more debugging, and it looks like in the fixture below

@pytest.fixture(scope="session")
def model_files(
    foqus_ml_ai_models_dir: Path,
    install_ml_ai_model_files,
    suffixes: Tuple[str] = (".py", ".keras", ".h5", ".json", ".pt", ".pkl"),
) -> List[Path]:
    paths = []
    for path in sorted(foqus_ml_ai_models_dir.glob("*")):
        if all(
            [
                ((path.is_file() and path.suffix in suffixes) or path.is_dir()),
                path.stat().st_size > 0,
                path.name != "__init__.py",
            ]
        ):
            paths.append(path)
    return paths

the "savedmodel" folder is picked up when path.stat().st_size > 0 and path.name != "__init__.py" are excluded. Assuming the first one is the cause, is there a reason it would exclude the folder and is it safe to remove this condition?

Update: just saw your prior comment, I haven't tried that but can try that next.

lbianchi-lbl · 2024-06-05T18:00:57Z

@lbianchi-lbl I did some more debugging, and it looks like in the fixture below
@pytest.fixture(scope="session")
def model_files(
    foqus_ml_ai_models_dir: Path,
    install_ml_ai_model_files,
    suffixes: Tuple[str] = (".py", ".keras", ".h5", ".json", ".pt", ".pkl"),
) -> List[Path]:
    paths = []
    for path in sorted(foqus_ml_ai_models_dir.glob("*")):
        if all(
            [
                ((path.is_file() and path.suffix in suffixes) or path.is_dir()),
                path.stat().st_size > 0,
                path.name != "__init__.py",
            ]
        ):
            paths.append(path)
    return paths
the "savedmodel" folder is picked up when path.stat().st_size > 0 and path.name != "__init__.py" are excluded. Assuming the first one is the cause, is there a reason it would exclude the folder and is it safe to remove this condition?

Update: just saw your prior comment, I haven't tried that but can try that next.

Great point @bpaul4. If path points to a directory, path.stat().st_size should be 4096 and thus greater than zero. However, I've just checked on both Linux and Windows and, while this is true on Linux, on Windows the same snippet returns 0 🙃 I think you can go ahead and remove that condition and push the changes, and we can check if that resolves the failure.

bpaul4 · 2024-06-05T20:08:59Z

@lbianchi-lbl everything is green, so I think it's ready for your review per your availability. Thank you for your suggestions!

The changes within the folder examples/other_files/ML_AI_Plugin/ are mostly files being moved around. I moved old TensorFlow files into a "deprecated" folder, regenerated all of the models with the latest packages (both TensorFlow and non-TensorFlow models), and moved the current models into "supported Keras" and "other" folders. The only big change is that the example without a custom layer is now also the mea column model, as the autothermal reformer model couldn't readily be regenerated.

.github/workflows/checks.yml

lbianchi-lbl · 2024-06-12T14:35:35Z

@bpaul4 comparing the GHA logs for a passing test (above, Windows/Python 3.10) and failing test (below, macOS/Python 3.10), it looks like something between TestImports::test_import_tensorflow_success and TestPymodelMLAI::test_build_and_run_as_expected_1 is hanging indefinitely on macOS, causing the GHA step to time out after the built-in limit of 6 hours:

Do you have any idea why this is happening? What type of test is TestPymodelMLAI::test_build_and_run_as_expected_1? (e.g. is it a GUI test?)

Since this seems to be happening consistently for all macOS runners (both Intel and Apple Silicon), I wonder if it's possible to someone with a macOS machine to try to reproduce this locally.

bpaul4 · 2024-06-12T16:06:11Z

@lbianchi-lbl thanks for investigating further. The TestPymodelMLAI::test_build_and_run_as_expected_ tests just call the ML AI model class, which is done directly as shown below to avoid calling GUI commands:

def test_build_and_run_as_expected_1(self, example_1):
        # test that the loaded models run with no issues without modifications
        # as in subsequent tests, an alias is created to preserve the fixture
        # no custom layer, so set keras_has_custom_layer to False
        pytest.importorskip("tensorflow", reason="tensorflow not installed")
        test_pymodel = pymodel_ml_ai(
            example_1, trainer="keras", keras_has_custom_layer=False
        )
        test_pymodel.run()

It can't be due to just importing one of the machine learning packages or executing the models, as the foqus_lib/gui/tests/test_ml_ai.py::TestMLAIPluginFlowsheetRun tests and foqus_lib/unit_test/node_test.py::testImports tests are passing.

However, I see that

foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-keras_nn-tensorflow.keras] FAILED [ 27%]
foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-pytorch_nn-torch.nn] FAILED [ 28%]
foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-scikit_nn-sklearn.neural_network] FAILED [ 29%]

appears on the macOS runs but not in Windows or Linux, and this may be a more useful error to look into and see what's happening.

lbianchi-lbl · 2024-06-12T16:14:59Z

However, I see that

foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-keras_nn-tensorflow.keras] FAILED [ 27%]
foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-pytorch_nn-torch.nn] FAILED [ 28%]
foqus_lib/gui/tests/test_surrogate.py::TestFrame::test_run_surrogate[tutorial_files/Flowsheets/Tutorial_4/Simple_flow.foqus-scikit_nn-sklearn.neural_network] FAILED [ 29%]

appears on the macOS runs but not in Windows or Linux, and this may be a more useful error to look into and see what's happening.

Ah, great point, I didn't notice that, good catch. We're not getting more information about those failures because of the timeout, but I agree that it's worth looking into it more in detail. If you're OK, I'll push a change to the CI workflow file so that pytest exits after 3 failures so that we can see the stack traces for those tests.

lbianchi-lbl · 2024-06-12T17:19:29Z

@bpaul4 it looks like the failure in the ML/AI surrogate plugin is due to a timeout being reached by pytest-qt when waiting for the successful completion of the run() method. So it seems there might be some general ML/AI issue at play, since we're seeing problems for both the Pymodule and surrogate plugin parts of the FOQUS codebase.

Unfortunately, I'm not sure can get much more information from CI runs since (at least for the ML/AI surrogate) the output is redirected to the "pseudo-console" within the GUI to be accessible to the user without having to check the logs, so I think the next step would be for someone to run this locally and check visually and/or take a screenshot.

Is there any macOS-using generous soul who'd be willing to check? @ksbeattie @boverhof @kbuma @franflame It should be enough to check out the code in this branch, install the ML/AI dependencies, and run the tests. I can write a complete set of commands but it'll have to wait a bit.

UPDATE the following should work:

cd "$(mktemp -d)"
git clone https://github.com/CCSI-Toolset/FOQUS.git && cd FOQUS
git fetch origin pull/1223/merge:foqus-1223
git switch foqus-1223
conda create -c conda-forge -c CCSI-Toolset -n test-foqus-1223 python=3.10 psuade-lite=1.9 --yes
conda activate test-foqus-1223
pip install -r requirements-dev.txt -r requirements-mlai.txt

# run all tests, exiting after 3 failures
pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir --maxfail=3

# only run GUI tests for the ML/AI surrogate plugins
pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir -k test_surrogate

# only run the Pymodel non-GUI tests
pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir -k node_test

franflame · 2024-06-18T02:00:41Z

I was able to use the above commands to set up a clean environment and did some manual testing. At least locally, pytorch_nn and scikit_nn are both working fine with no issues; only keras_nn doesn't complete as expected. I traced through keras_nn's execution by printing to the GUI log, and it appears that model.fit() is where it gets stuck.

franflame · 2024-06-19T00:48:31Z

@bpaul4 @lbianchi-lbl Here's what I got from running the automated tests locally:
pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir --maxfail=3 Exits after failing the tests for keras_nn, pytorch_nn, and scikit_nn.

pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir -k test_surrogate Exits with three tests failing (timeout)

pytest foqus_lib/ --pyargs foqus_lib --verbose --basetemp=./pytest-temp-dir -k node_test
Gets stuck on foqus_lib/unit_test/node_test.py::TestPymodelMLAI::test_build_and_run_as_expected_1, doesn't seem to have a timeout(?) I had to manually exit after it was stuck for 25+ minutes.

Side note: from watching the GUI tests execute, the pytorch_nn and scikit_nn tests fail because the keras_nn test is still being run, and a new process for pytorch_nn or scikit_nn can't be started when keras_nn isn't finished. This is comparable to what I observed from manual testing, described in my previous comment.

lbianchi-lbl · 2024-06-25T20:02:48Z

macOS tests are failing with Tensorflow 2.16.1, which is the version that's required for the changes in this PR to work correctly
The failure seems to be caused by the execution hanging indefinitely when the .fit() method is called on any TF model
After some discussion, we're leaning toward skipping or xfailing the tests that are currently failing for macOS, and hope that a fix will be released upstream to make macOS work again with tensorflow>2.16.1
@lbianchi-lbl will push to this PR with the changes needed to skip/xfail these tests

codecov · 2024-06-25T21:57:35Z

Codecov Report

Attention: Patch coverage is 30.23256% with 30 lines in your changes missing coverage. Please review.

Project coverage is 37.88%. Comparing base (8da0631) to head (1531097).
Report is 20 commits behind head on master.

Files with missing lines	Patch %	Lines
foqus_lib/framework/graph/node.py	28.57%	22 Missing and 8 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1223      +/-   ##
==========================================
- Coverage   38.54%   37.88%   -0.66%     
==========================================
  Files         164      164              
  Lines       37032    37048      +16     
  Branches     6132     6153      +21     
==========================================
- Hits        14274    14036     -238     
- Misses      21619    21835     +216     
- Partials     1139     1177      +38

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bpaul4 added 4 commits May 21, 2024 11:48

reorganize ML_AI example files and regenerate models with latest H5-c…

3d8503b

…ompatible TensorFlow

update tests to verify new file structure is working

c76e3ec

first attempt at updating examples, methods, tests to support Keras 3

f0f2904

pickup copied zip file

e9bc448

bpaul4 added the ML/AI Machine Learning/Artificial Intelligence label May 21, 2024

bpaul4 self-assigned this May 21, 2024

bpaul4 linked an issue May 21, 2024 that may be closed by this pull request

Resolve Keras File Format Issues #1202

Closed

ksbeattie added the Priority:High High Priority Issue or PR label May 28, 2024

bpaul4 added 6 commits June 3, 2024 09:08

run black

d5fa107

unfix TensorFlow to use latest Keras

6987d4d

add example of mea column model with no custom layer

c7854d6

add handling for no custom layer and for TFSM models

498b8d4

run black

d30cb4d

update documentation

65a94e0

bpaul4 marked this pull request as ready for review June 4, 2024 17:50

bpaul4 requested review from lbianchi-lbl and franflame June 4, 2024 17:50

bpaul4 changed the title ~~[WIP] Update ML AI Plugin To Support Keras 3~~ Update ML AI Plugin To Support Keras 3 Jun 4, 2024

bpaul4 added 2 commits June 4, 2024 13:55

test running ML AI tests on Python 3.9 and 3.11

b37134d

add prints to help debugging

4f96645

bpaul4 added 2 commits June 5, 2024 10:24

more print statements and items for debugging

c4adaab

run black, again

9bc0cb5

revert debugging changes, remove file size condition

87a696e

bpaul4 commented Jun 5, 2024

View reviewed changes

.github/workflows/checks.yml Outdated Show resolved Hide resolved

Update checks.yml

b0979eb

lbianchi-lbl added 3 commits June 12, 2024 11:18

Temporarily set pytest to exit after 3 failures to debug macOS timeouts

dd258d7

Remove job dependencies to work around repeated spell-check errors

125e4f8

Try increasing timeout causing test failures on macOS

8318c8d

Try requiring different TF version for macOS

d51eba5

This was referenced Jun 25, 2024

Adding more tests to SDoE #1225

Merged

Add first implementation of scaling functions #1197

Draft

lbianchi-lbl added 3 commits June 25, 2024 15:38

Avoid installing TF at all on macOS

7ae0177

Update expected exception for TF import

ce2d6fa

Merge branch 'master' into update-keras

1531097

ksbeattie merged commit 06fe4b7 into CCSI-Toolset:master Jul 2, 2024
29 of 31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ML AI Plugin To Support Keras 3 #1223

Update ML AI Plugin To Support Keras 3 #1223

bpaul4 commented May 21, 2024

lbianchi-lbl commented Jun 4, 2024 •

edited

Loading

bpaul4 commented Jun 5, 2024

lbianchi-lbl commented Jun 5, 2024 •

edited

Loading

bpaul4 commented Jun 5, 2024 •

edited

Loading

lbianchi-lbl commented Jun 5, 2024

bpaul4 commented Jun 5, 2024

lbianchi-lbl commented Jun 12, 2024

bpaul4 commented Jun 12, 2024

lbianchi-lbl commented Jun 12, 2024

lbianchi-lbl commented Jun 12, 2024 •

edited

Loading

franflame commented Jun 18, 2024

franflame commented Jun 19, 2024 •

edited

Loading

lbianchi-lbl commented Jun 25, 2024 •

edited

Loading

codecov bot commented Jun 25, 2024 •

edited

Loading

Update ML AI Plugin To Support Keras 3 #1223

Update ML AI Plugin To Support Keras 3 #1223

Conversation

bpaul4 commented May 21, 2024

Fixes/Addresses:

Summary/Motivation:

Changes proposed in this PR:

Legal Acknowledgement

lbianchi-lbl commented Jun 4, 2024 • edited Loading

bpaul4 commented Jun 5, 2024

lbianchi-lbl commented Jun 5, 2024 • edited Loading

bpaul4 commented Jun 5, 2024 • edited Loading

lbianchi-lbl commented Jun 5, 2024

bpaul4 commented Jun 5, 2024

lbianchi-lbl commented Jun 12, 2024

bpaul4 commented Jun 12, 2024

lbianchi-lbl commented Jun 12, 2024

lbianchi-lbl commented Jun 12, 2024 • edited Loading

franflame commented Jun 18, 2024

franflame commented Jun 19, 2024 • edited Loading

lbianchi-lbl commented Jun 25, 2024 • edited Loading

codecov bot commented Jun 25, 2024 • edited Loading

Codecov Report

lbianchi-lbl commented Jun 4, 2024 •

edited

Loading

lbianchi-lbl commented Jun 5, 2024 •

edited

Loading

bpaul4 commented Jun 5, 2024 •

edited

Loading

lbianchi-lbl commented Jun 12, 2024 •

edited

Loading

franflame commented Jun 19, 2024 •

edited

Loading

lbianchi-lbl commented Jun 25, 2024 •

edited

Loading

codecov bot commented Jun 25, 2024 •

edited

Loading