Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudify reV #335

Closed
MRossol opened this issue Oct 20, 2021 · 3 comments
Closed

Cloudify reV #335

MRossol opened this issue Oct 20, 2021 · 3 comments
Assignees
Labels
feature New feature or request

Comments

@MRossol
Copy link
Collaborator

MRossol commented Oct 20, 2021

Why this feature is necessary:
Enable the reV team and general public to run reV at scale in the cloud using OEDI datasets as inputs.

A possible solution is:
Current cloud readiness of reV's modules:

  • Gen: Cloud ready using HSDS and WTK/NSRDB data in OEDI
  • Econ: Cloud ready using HSDS and WTK/NSRDB data in OEDI
  • Aggregation: Needs exclusion files to be loaded into HSDS, plan is to load FY20 and FY21 final inclusion layers into OEDI before 10/29. Complex aggregation schemes would require all exclusion layers to be loaded into HSDS/OEDI.
  • Supply Curve: For internal runs we can use the xmission tables we've generated. For the public we need to determine:
    a) If they can use our xmission tables or if we can "anonymize" our xmission tables for public use, otherwise
    b) We need to create a lower fidelity open-source set of xmission tables

Potential performance issue:

  • HSDS currently has very limited throughput interms of number of concurrent requests. This will be an issue for generation. Fortunately we hope to ameliorate the issue by moving HSDS to lambda. This should be done by the end of FY21 Q1 and should enable nearly infinite scalability in terms of parallel requests. To properly take advantage of this, it could be useful to implement this feature in rex: parallel gets for list slices rex#111
  • Using gen/econ .h5 files in agg and supply curve. To solutions:
    1. Move the files from S3 to local storage (not needed with LFS below)
    2. Implement cloud_fs/s3fs into rex: Add ability to use cloud_fs/s3fs along with HSDS rex#112

Cloud hardware / submission ideas:

  • The simplest solution would be to use AWS's HPC like infrastructures either Parallel Cluster or Batch.
    • Pros:
      • Can use the SLURM infrastructure in reV/rex
      • Has a luster (LFS) filesystem that syncs to S3 attached to support writing outputs
    • Cons:
      • More complicated/expensive than using a single EC2 instance
  • A more flexible solution would be to dockerize reV and integrate cloud_fs to handle transfer of output files from the compute storage to S3.
    • Pros:
      • Runnable on nearly any hardware (EC2, ECS, Kubernetes, Lambda)
      • Likely cheaper
    • Cons:
      • Need to consider attached storage for writing outputs
      • Would need a third party solution to launch complicated (multi-node/job) runs

Charge code
reV

Urgency / Timeframe
FY22 design Doc

@grantbuster
Copy link
Member

Notes and stuff:
Configure aws hpc: https://www.hpcworkshops.com/03-hpc-aws-parallelcluster-workshop/04-configure-pc.html
Ssh into login node and squeue access: https://www.hpcworkshops.com/03-hpc-aws-parallelcluster-workshop/07-logon-pc.html
Sbatch commands just like eagle: https://www.hpcworkshops.com/03-hpc-aws-parallelcluster-workshop/08-run-1stjob.html
high performance file storage and transfer to s3 (not even really sure if necessary if reading from s3/hsds?): https://www.hpcworkshops.com/04-amazon-fsx-for-lustre.html

In this lab, the cluster has 0 compute nodes when starting and maximum size set to 8 instances. AWS ParallelCluster will grow and shrink between the min and max limits based on the cluster utilization and job queue backlog.

A GP2 Amazon EBS volume will be attached to the head-node then shared through NFS to be mounted by the compute nodes on /shared. It is generally a good location to store applications or scripts. Keep in mind that the /home directory is shared on NFS as well.

SLURM will be used as a job scheduler

@grantbuster
Copy link
Member

PR #339

@grantbuster
Copy link
Member

Implemented here: https://github.com/NREL/reV/tree/main/examples/aws_pcluster

Still need work to upload data for exclusions / transmission costs but no technical barriers for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants