Skip to content

zoom 20201127

Zarquan edited this page Nov 27, 2020 · 4 revisions

Zoom meeting 27th November

20201127 16:00

Nigel, Stelios, Dave

Nigel has developed a working distributed ML example using a random forest classifier.

This is a good example of the kind of thing our users will want to do. It complements the HDBscan example because this is distribute and HDBscan is single node.

Two was to take this further:

  1. Deploy a larger cluster enabling Nigel to work with the full size data set.
  2. Experiment with scaling cluster size (cpu, memory, disc) to find the minimum resources needed to work with the full dataset, completing the process in ballpark 10 minutes.

Immediate priority is accessible disc space for Nigel to use to import and process new datasets.

  1. Blocked by issue #227 "Ceph shares not visible from Openstack "test" project.
  2. Fixed by PR #228 "20201117 zrq hadoop yarn".
  3. This is blocking work to prepare for EDR3 which will be available next Thursday.

Current systems rely on manually entered usernames and passwords in the Zeppelin config. Next stage of work is to integrate Zeppelin and Drupal user accounts to provide on-demand account creation with editable properties.

  1. Integrate Zeppelin and Drupal user accounts
  2. Integrate Drupal and IRIS IAM OAuth accounts

Targets for public release:

  1. Suggestion from Nick Walton - run a small invite-only workshop in Q1 2021, working interactively with users to solve issues as they develop their notebooks.
  2. Suggestion from Nigel - public release at National Astronomy meeting in July 2021.

Nigel reported on meeting with colleague from ESAC who have developed a Java library that can read Gaia GBIN files into Spark, making the bulk Gaia data available to ML algorithms. Developed for the Gaia validation team on a bare metal Spark deployment.

Tasks for the next week:

  1. stv - Merge PR #225 and #228 to bring separate copies of Ansible Zeppelin-Hadoop-Yarn deployment together into shared version.
  2. stv - Delete everything from gaia_dev Openstack project and deploy a large enough system for Nigel to work with, including SSH access to shared directory for importing new data.
  3. stv - Delete everything from gaia_prod Openstack project and use that to experiment with notebook 2FRPC4BFS to find minimum resources needed to handle full dataset in ~10min.
  4. nch - Waiting for new cluster on gaia_dev to be able to import additional data sets and prepare for Gaia EDR3.
  5. zrq - Using gaia_test to experiment with integrating Zeppelin and Drupal user accounts and IRIS IAM OAuth.