Skip to content

Meeting Notes April 2020

Tim Randles edited this page Apr 29, 2020 · 13 revisions

Attendees - Rusty Davis (rstyd), Pat Grubel (pagrubel), Tim Randles (trandles-lanl), Jake Tronge (jtronge)

Agenda

PR Review

  1. NONE

Issue Review

  1. 146 - Pat has a fix. Will just check for expected filenames and error out for anything else.

Discussion (ToDo?)

  1. Quincy Wofford starting May 18
  2. Steven starting June 15
  3. Things we need to define
    • (TM) Abstract interface to HPC resource managers (worker) (Slurm, LSF, Torque, PBS, etc.)
    • (TM) Abstract interface to container runtimes
      • (TM) Does TM call container interface to build body of job script and then pass that to the resource manager interface, or does the resource manager interface (worker) call the container interface when building a job script?
    • (WfM/TM) how to run via WSGI (web server gateway interface), usually Nginx or uWSGI
    • (WfM) Abstract interface to placement engine
    • (WfM) Resource monitor - what is it, how does it work
  4. Add ability for parser to handle return codes (https://www.commonwl.org/v1.1/CommandLineTool.html#Execution)

Around the room

  • Pat
    • container filename extensions
    • container runtime options
  • Rusty
    • pytest continuing
    • drafting document on possible issues and mitigation strategies
      • primarily between WfM and TM
  • Jake
    • BEEStart
      • looking at logging
  • Tim

Attendees - Pat Grubel (pagrubel), Qiang Guan, Al McPherson (mcpherson), Tim Randles (trandles-lanl), Jake Tronge (jtronge)

Agenda

PR Review

  1. #150 (trandles-lanl) - Beeconfig 149
    • pagrubel will test and comment

Issue Review

  1. #139 (trandles-lanl) - BEEStart: A script to start BEE components

Discussion (ToDo?)

  1. NONE

Around the room

  • Pat
    • back on BEE this week
  • Al
    • cwltool investigations
    • waiting on VASP container from trandles
  • Qiang
    • scheduler interface
    • #143 (rstyd) - Write spec for Workflow Manager / Placement Engine communication
    • #98 (rstyd) - Develop Bee Scheduler / Placement Engine
  • Jake
    • talk to trandles-lanl about BEEStart
  • Tim

NO MEETING


Attendees - Rusty David (rstyd), Pat Grubel (pagrubel), Qiang Guan, Al McPherson (mcpherson), Tim Randles (trandles-lanl), Jake Tronge (jtronge)

Agenda

PR Review

  1. NONE

Issue Review

  1. NONE

Discussion (ToDo?)

  1. FY21 ECP activities off to L3 for approval, can add more later
  2. See what OSC is using for a resource manager (qguan)
  3. Time to define logging standard for BEE (running of the system, not necessarily workflow-specific stuff)
  4. Start list of what goes into the graph database (e.g. job script as metadata on task node)

Around the room

  • Pat
    • will fix container name extension issue #146
    • CWL questions about container runtime options #148
    • Add debug in TaskManager code
  • Al
    • working on database refactor
    • waiting on trandles-lanl to get VASP container then will write CWL for Sven's workflow
    • thinking about parser strategy (maybe a hack of cwltool)
    • will email someone about cwltool
  • Qiang
    • almost finished paper describing scheduling algorithms
    • continue discussion of tasks for Jake - container integration (discuss Wednesday)
  • Jake
    • looking at issue 124 (task status reporting)
    • job-building/script-building to test individual commands in job for success/failure
  • Tim
    • wrapping up basic BEEStart to push to repo
    • planning activities with Qiang
  • Rusty
    • wrapping up pytest activities
    • refine REST APIs

Attendees - Rusty Davis (rstyd), Pat Grubel (pagrubel), Qiang Guan, Al McPherson (mcpherson), Tim Randles (trandles-lanl), Jake Tronge (jtronge)

PR Review

  1. #143 (pagrubel) - Fix slurm unit tests
    • rstyd and trandles-lanl will run tests to confirm, if pass then approve

Issue Review

  1. #144 (trandles-lanl, mcpherson) - Create VASP Charliecloud container
    • mcpherson to review past emails with srudin and comment on issue

Discussion

  1. BEE docker image - jtronge
    • README.md on mattermost chat describing use
    • works at Kent
    • fedora image
    • trandles-lanl, pagrubel, mcpherson will test it, provide feedback to jtronge
  2. WoWoHa - pagrubel
    • WoWoHa 2020 cancelled
    • will be a weekly "summer seminar series" June - August 2020
    • BEE will give a talk
  3. pushing to master and public BEE repo - pagrubel
    • getting closer to public release
    • need to define criteria for first release (documentation, workflow limitations/supported CWL, etc.)
    • trandles-lanl will create milestone issue for first public release - target end of FY
  4. trandles-lanl will create issue for supporting MPI applications using Charliecloud and BEE

Around the room

  • Rusty
  • Pat
    • jtronge test PR #143
    • Issue #124 - jtronge discuss with pagrubel
  • Al
    • working on database refactor
    • chasing down VASP stuff
  • Qiang
    • tasks for jtronge
    • discuss FY activities with trandles-lanl
  • Jake
  • Tim
    • push BEEStart ASAP and let others hack on it

Attendees - Pat Grubel (pagrubel), Qiang Guan, Al McPherson (mcpherson), Tim Randles (trandles-lanl), Jake Tronge (jtronge)

Agenda

PR Review

  1. NONE

Issue Review

  1. NONE

Discussion

  1. Thoughts on FY21 cloud milestone
    • using ORNL or Chameleon cloud for target platform

Around the room

  • Pat
    • working to get pyslurm tests running
    • using DockerRequirement from CWL
  • Al
    • getting on darwin and fog
  • Qiang & Jake
    • got examples running that were in milestone documentation
    • Jake will document his scripts and dockerfiles for setting up their test environment
    • Jake will get things running on group server
    • Qiang to send thoughts on FY21 cloud milestone
  • Tim
    • continue working on BEEStart script

Attendees - Rusty Davis (rstyd), Pat Grubel (pagrubel), Al McPherson (mcpherson), Tim Randles (trandles-lanl), Jake Tronge (jtronge)

Agenda

  1. discuss CWL and container support
  2. TaskManager design for modular support of container runtimes and resource managers
  3. discuss proposed FY21 ECP P6 Activites
    • BEE- FY21 P6-1 Develop the ability to archive, clone, and re-run workflows (start 10/01/20, due 3/31/21)
    • BEE- FY21 P6-2 Run BEE jobs on private cloud infrastructure (due 9/31/21)

PR Review

  1. NONE

Issue Review

  1. NONE

Discussion

  1. APPROVED April 6, 2020 meeting notes
  2. TaskManager discussion mostly shelved for now, revisit next week
  3. CWL support for containers
  4. FY21 ECP Activities are documented at
    • Tim starting on design document for the activities

Around the room

  • Jake
    • neo4j issues (Task already exists)
    • Rusty knows how to fix itj
    • close to being able to run test workflows
  • Rusty
    • starting test work
    • looking at PyTest for integration testing
    • maybe pexpect for client testing
    • Flask has some testing framework (Jake)
    • BEE should start a document of what CWL is supported by project
  • Pat
    • question for Rusty about passing Task object to worker from TaskManager
    • will need to think about how to pass things around when there's more data (requirements and hints)
  • Al
    • refactoring database and building new API to it
      • no way to version python APIs
      • API changes only affect WorkflowManager
    • next use case CWL example
      • maybe BLAST workflow again
      • keep scope of parsing to HPC use cases, not "generic everything CWL"
      • Do srudin VASP workflow (parameter study) #66

Action Items

  1. Tim - get VASP containers that work with Charliecloud (Power9, x86_64)

Attendees - Rusty Davis (rstyd), Pat Grubel (pagrubel), Qiang Guan (guanxyz), Tim Randles (trandles-lanl), Jake Tronge (jtronge)

PR Review

  1. #138 APPROVED (trandles-lanl) - Use bee.conf to configure listen ports for BEEWorkflowManager and BEETaskManager
    • Pat approves of merging this PR, but into master instead of develop. The rationale is the functionality is simple and enables everyone to do development work at the same time on the same system.

Issue Review

  1. #137 (pagrubel) - Slurm worker to properly check DockerRequirment

Discussion

  1. extending CWL for other container runtimes (rstyd)
    • discuss on Wednesday
    • guanxyz had some ideas
  2. next ECP milestones up on wiki

Around the room

  • Jake
    • got a test environment set up at KSU
    • initial problems with PySlurm due to having a too-new Slurm installed
  • Rusty
    • working on unittest and CI tests for client/WorkflowManager
    • not a lot of time for BEE this week (very understandable, everyone prioritized BEE the past 2 week (trandles-lanl))
  • Pat
    • unittest for TaskManager
    • issue #137 above
    • not much time for BEE this week
  • Tim
    • issue #139 planning to discuss on Wednesday
    • ECP milestone housekeeping