Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS-Climate - Establish the base software versions and tools for Stable instance cluster #98

Closed
redmikhail opened this issue Oct 31, 2021 · 18 comments
Assignees

Comments

@redmikhail
Copy link

redmikhail commented Oct 31, 2021

As OS-Climate environment system administrator I need an environment where I can test changes to the configuration and new versions of the applications without interrupting the work of data scientists.

Proposal:

Provision new OpenShift Cluster for OS-Climate project that will become production environment while keeping existing cluster with scaled down capacity for testing of new frameworks, components and configuration changes.

Some of the proposed requirements:

  • Should include separate infrastructure nodes to host monitoring/alerting, authentication workload
  • OCP cluster needs to be provisioned in availability zones that provide GPU enabled EC2 instances (p3 instances)
  • Nodes hosting jupyter notebooks should have taint to prevent any other workload to be scheduled to these nodes. Notebooks that require GPU resources will need to have toleration for gpu-enabled taints
@erikerlandson
Copy link
Contributor

cc @MichaelTiemannOSC @caldeirav

@durandom
Copy link

durandom commented Dec 1, 2021

what's the actual purpose here? from what I read we want a test env and declare the existing one stable?

@redmikhail
Copy link
Author

@durandom We want to have test/staging environment and stable/production environment for data scientists. New cluster will become this stable/production and existing cluster will become dev/staging .We probably will need to be re-built dev cluster eventually to reflect any changes in configuration and processes that have been used for provisioning of production environment

@caldeirav
Copy link
Contributor

@redmikhail I would actually recommend having two new clusters for DEV and PROD, and leave the existing MVP cluster running until we confirm that the new clusters work (starting with DEV), and retire it once confirmed. The reason is we want to have a clean transition to a new environment (with new address) including starting with clean storage volumes as well. This was discussed in our OS-C call yesterday and would make it easier to manage the change without impacting developers who are still on the existing cluster.

@redmikhail
Copy link
Author

@caldeirav Sorry I was under impression that we want to re-use existing cluster as dev , will add it to the tasks

@caldeirav
Copy link
Contributor

@redmikhail we would like to move forward with the creation of the production cluster now, before upgrading the Trino component on DEV (i.e. initially the two clusters will be in sync, then we will start working a validating changes in Dev before promoting to Prod). Do we have the required resources to create Prod? Let us know if any dependency please.

@eoriorda eoriorda changed the title OS-Climate production cluster OS-Climate Stable instance cluster Apr 11, 2022
@eoriorda
Copy link

Calculate quota increase and look at needs . Airbus will decide to move based on capacity and requirements and access control. Aribus are still looking at the components and how they are deployed. Internal work right now , not ready to run models. When they are ready to run multiple models they will need to move stable. Decision on the "tobe " stable environment are still being made. Is it exlusive use for Airbus or is it shared. Target is to create stable environment for Datacommons. Airibus usecase for stable environment to be determined Mikhails update

@eoriorda
Copy link

Discuss with Heather how to transition this to LF team. Eileen to discuss with Heather.

@HeatherAck
Copy link
Contributor

Hi @eoriorda, we should include @rynofinn and @MightyNerdEric in the planning. If Red Hat team can provide documentation on how the environment is set up, we can determine what knowledge transfer is needed to support transition

@eoriorda
Copy link

@HeatherAck Develop a plan for the stable instance environment.
When will Airbus be fully operational and PRR would be the next up to migrate to the stable instance.

@HeatherAck
Copy link
Contributor

@HeatherAck to follow up with Matthieu, dependent on superset and tool versions

@HeatherAck
Copy link
Contributor

Task on hold until we have stable superset/tool versions; need final snapshot from Airbus as well - next step create plan and checklist along with criteria. Heather to schedule mtg with @caldeirav , @MichaelTiemannOSC , @redmikhail and @sostrades-matthieu-meaux

@HeatherAck
Copy link
Contributor

CL2 will serve as baseline for stable cluster; but timing to do so still too early (need libraries to be included in notebook (pre-reqs for pachyderm), need info removed (old version of python), need open metadata issues resolved, Trino upgrades, etc.). @HeatherAck to create pre-req checklist for stable cluster

@HeatherAck
Copy link
Contributor

  • Finalize list of required software libraries, packages for stable cluster
    - [ ] Open MetaData X.X
    - [ ] Trino XXX
    - [ ] OpenShift X.X
    - [ ] Pachyderm X.X
    - [ ] Fybrik X.X
    - [ ] Superset X.X
    - [ ] Inception X.X
    - [ ] Python X.X
    - [ ] Datasette?
    - [ ] DBT X.X
    - [ ] Pandas X.X
    - [ ] Sqlalchemy X.X
    - [ ] Trinodb X.X
    - [ ]
  • Upgrade Trino to XXX (XXX addresses Google big query issue)
  • Baseline a default Jupyter Notebook (update required libraries, remove unnecessary config info, etc.)
  • Ensure documentation is accurate for data ingestion pipeline processes
    - [ ] creation of pipelines for new data sources
    - [ ] updates to pipelines for existing data sources (remove, add, change)
    - [ ] creation of new metadata
    - [ ] updates to existing metadata (remove, add, change)
  • Update the OS-Climate Data Commons Developer Guide

@HeatherAck
Copy link
Contributor

@HeatherAck to work with @erikerlandson to finalize list

@HeatherAck
Copy link
Contributor

@HeatherAck to follow up with @erikerlandson on versions, etc

@HeatherAck
Copy link
Contributor

see also #234

@HeatherAck HeatherAck changed the title OS-Climate Stable instance cluster OS-Climate - Establish the base software versions and tools for Stable instance cluster Jan 30, 2023
@HeatherAck
Copy link
Contributor

closing dup of 234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

6 participants