Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS-Climate - Establish minimum versions of tools / packages for Dev Cluster #234

Open
65 tasks
MichaelTiemannOSC opened this issue Nov 16, 2022 · 14 comments
Open
65 tasks

Comments

@MichaelTiemannOSC
Copy link
Contributor

MichaelTiemannOSC commented Nov 16, 2022

xref #98

High-level question: if we use conda as our base installation system, users can install shell-level packages such as ghostscript without needing to beg for special install help. With pip/pipenv, we are entirely limited to Python. What, really, is the best choice here?

  • python packages for data pipeline ingestion

    • pandas 2.2.1
    • numpy 1.26.2
    • SqlAlchemy 2.0.29 released (pipelines need updating, but works with dbt>=1.4.9)
    • Trino Python Client 0.328.0
    • Pydantic 2.7.1 (requires pydantic_core 2.18.2)
    • Pint 0.23 (with uncertainties support)
    • Pint-Pandas 0.5 (0.6rc0 with uncertainties support)
    • python-pachyderm 7.6.0 (pinned to protobuf < 4.0.0)
    • multidict 6.0.2 (Badly-behaved python-pachyderm looks for this version rather than current latest 6.0.4)
    • pyyaml 5.4.1
    • uncertainties 3.1.7 (3.1.8? with uncertainties support)
    • boto3 (latest version)
    • osc-ingest-tools (>=0.5.2; requires some fixes to work with SQLAlchemy 2.0.x)
    • python-dotenv (latest version)
    • pyarrow (>=9.0.0)
    • fastparquet (?)
    • iam-units (latest version)
    • openscm-units (latest version)
    • pycountry (latest version)
    • openpyxl (>=3.1.2)
    • dash (latest version)
    • dash_bootstrap_components (latest version)
    • multiprocess (latest version--needed for ITR_UI.py)
    • diskcache (latest version--needed for ITR_UI.py)
    • matplotlib (>=3.7.2)
    • ydata_profiling (latest version; "pandas-profiling[notebook]" is going to be deprecated April 1st)
    • pygithub (latest version)
    • xlrd (latest version)
    • camelot-py or camelot-fork
    • consensus correct one of four options for installing opencv-python (see https://pypi.org/project/opencv-python/)
  • rpm/yum packages

    • ghostscript (needed for camelot, but check RCE security problem)
    • tkinter (needed for camelot)
  • [dev-packages]

    • flake8 = "*"
    • coverage = "*"
    • Sphinx = "*"
    • What else is needed for general pre-commit consistency (black formatting?) and CI/CD happiness (pip audit, setup tools, pytest unit tests, etc)?
    • Can we protect AWS keys from finding their way to GitHub via a pre-commit process that forbids credentials.env files from being committed, and/or any files that have well-known secrets labels or content. If users cannot commit bad changes, they cannot push bad commits.
  • ODH sub-packages

    • jupyterlab 4.0 (see https://blog.jupyter.org/jupyterlab-4-0-is-here-388d05e03442)
    • Trino 445 (several Hive/Iceberg and also BigQuery fixes since currently installed 398 version)
    • TileDB 2.18.5 updates from C++17 to C++20, with lots of breaking changes. Check with @joemoorhouse how TileDB may play into the 2023 PhysRisk platform.
    • Elyra 0.2.X (with fix for Custom Elyra Notebooks) vs. 0.3.13 or greater?
    • Superset 4.0.0
    • Python 3.10 (need CI/CD for 3.11 and 3.12; 3.9 @DataClass doesn't support slots)
    • Pachyderm 2.4 (?, does this belong here or in software libraries bundle below?)
  • Finalize list of required software libraries, packages for dev cluster (these are best guesses as of 2024-04-24)

    • openmetadata-ingestion 1.3.x
    • OpenShift X.X
    • Fybrik X.X
    • Inception 25.4 (?)
    • Datasette?
    • dbt-core 1.4.x (1.8.0b3 just released, but needed due to Pachyderm's protobuf pin)
    • dbt-trino (depends on dbt-core)
    • trino-utils 0.3.0 (companion to dbt-utils 1.0.0)
    • dbt-utils 1.0.0 (these are installed via dbt, not PyPi nor conda)
  • Baseline a default Jupyter Notebook (update required libraries, remove unnecessary config info, etc., version 7 expected in July 2023: 7.0 Release Plan jupyter/notebook#6307)

  • Ensure documentation is accurate for data ingestion pipeline processes

    • creation of pipelines for new data sources
    • updates to pipelines for existing data sources (remove, add, change)
    • creation of new metadata
    • updates to existing metadata (remove, add, change)
  • Update the OS-Climate Data Commons Developer Guide

@HeatherAck
Copy link
Contributor

HeatherAck commented Nov 28, 2022

remove highlander. Categorize and group, note installed version. Create a standard config for notebooks (default for users)

  • Architecture diagram of interdependencies/layers: call out dependencies and inheritance, @redmikhail to share starting point from Humair
    -[ ] Phase 1: open metadata, dbt
  • Python-related
  • Individual packages not under ODH: e.g. fybrik
  • ODH-related: (go back to default operator config - @redmikhail to check with @HumairAK to see what problems will occur if we leverage default)
    • Need to determine validation methodolgy, who will perform UAT (e.g. Trino connectors)
    • Test cases for each upgrade (e.g. smoke test demo with expected results) (Heather to create new issues for each workstream)

@HeatherAck
Copy link
Contributor

main reason for separation - capacity concerns, limited functionality - how ODH treats and updates subcomponents needed updating. Need to understand where ODH going - treat trino, superset separately as this is also being done within the ODH community.

@redmikhail to consolidate approvers - create separate team to perform PR merge; regain control

@HeatherAck
Copy link
Contributor

@HeatherAck
Copy link
Contributor

HeatherAck commented Dec 19, 2022

@redmikhail to update list this week

Meeting held with Marcel. Operate First shrinking - consider use of managed service from Redhat / AWS (Open Shift specifically - SREs); stable cluster is a better candidate for that but not all svcs are covered under managed svc - e.g. GPU usage split.

In January meeting,

  • Need to meet and discuss pros/cons on moving to ODH as primary implementation (so can get the latest Jupyter release) with overlays for specific packages (e.g. Open Metadata) versus Operate First (where certain releases lag behind latest ones).
  • Need plan for Open Shift operational support.
  • Need to understand any custom configurations required by OS-C
  • Need to understand definition of stable cluster (what is support model for applications as well as cluster itself)

Complexity of dealing with a platform (OS-Climate) on top of a platform (ODH) on top of a platform (Operate First)

@HeatherAck to schedule meeting week of 9-Jan to align on pros/cons and discuss path forward.

@MichaelTiemannOSC
Copy link
Contributor Author

As an open source software project, OS-Climate provides the raw materials for users to contribute to and/or fork project elements as they see fit. If users have their own ideas about what it means to run the Data Commons within their own local environment, it should be those users doing the legwork of what that actually means, and committing the resources necessary to push patches they want to see into the upstream source code (which OS-Climate should review and potentially accept). But I don't think the OS-Climate project should try too hard to imagine and prototype those use cases itself. Rather it should help guide users to do that work for themselves.

@HeatherAck
Copy link
Contributor

Need to determine pace and sizing / priorities for each element on 10-Jan, consistent developer process/implementation

@MichaelTiemannOSC
Copy link
Contributor Author

I've been updating version numbers, but calling to attention that Open Metadata release 0.13.1.3 Jan 9th that provides important fixes vs. 0.13.1.

@HeatherAck HeatherAck changed the title OS-Climate Dev instance cluster OS-Climate - Establish minimum versions of tools / packages for Dev Cluster Jan 30, 2023
@HeatherAck
Copy link
Contributor

@redmikhail @ryanaslett @MightyNerdEric @erikerlandson to focus on upgrading system level software (see ODH sub packages above: e.g. Trino, Jupyter, Python)

@HeatherAck
Copy link
Contributor

@caldeirav - do we need to use Elyra pipeline features?
@erikerlandson / @redmikhail - to define standard developer notebook SW configs needed - libraries - Heather to schedule mtg with @erikerlandson - Guillome Moutier may be able to provide support

@MichaelTiemannOSC
Copy link
Contributor Author

openmetadata team has been busily walking their versions forward. 0.13.2.1 just released (see https://github.com/open-metadata/OpenMetadata/releases for info about 0.13.2).

@HeatherAck
Copy link
Contributor

@ryanaslett to start investigating latest ODH version compared to installed. Prep work - manifest storage, look at operate first manifest. (week of 6-Feb)

@HeatherAck
Copy link
Contributor

@ryanaslett - trying to define what ODH contributions is OS-C going to make, but still want to move forward with core component upgrade. No easy way to upgrade. Need to install from scratch and migrate functionality (e.g., notebooks, etc.) and access/authentication. May require diff packages - such as superset. Start with CL1, eliminate old ODH and install new one. Get feedback on new version / verify functioning as expected.

Open question: @caldeirav will ODH be a stable version that we will use long term? please confirm that Red Hat team will contribute to ODH going forward.

@HeatherAck
Copy link
Contributor

note: fork it to OS-Climate (not operate first). keep as separate repo. See if ODH supports SQL Alchemy 2.0

@HeatherAck
Copy link
Contributor

@ryanaslett reviewed superset, no dependencies - only api's; will review the ODH components, figure out core offering as part of ODH. Recommended next steps: (1) Install new core ODH on CL1 with tier 0 components (MUST HAVE). (2) Bring over jupyter hub images (verify that authentication works), then (3) bring over other components after review (Trino, Open MetaData)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

5 participants