Minimum Viable Product for our First Release

Minimum Viable Product - First Release (December 2020)

See issue #130

The following describes what we define as an MVP (Minimum Viable Product) for our first release which is due in December 2020.

The main purposes of defining this are that it will help us:

Ensure that we deliver a working service by the set deadline
Minimize distractions that may move critical tasks to later dates
Deliver our service early, so that it can be exercised by our test users, and we may discover possible usability issues. Early beta-testing feedback may alter further tasks and goals.
Decide on a critical path, meaning a set of tasks that will lead to our MVP. Anything not on our critical path can still be a target that we want to achieve, but can be considered an enhancement to the MVP, rather than a task that is required for us to deliver the basic working service.

By defining an MVP for our service, we need to make sure we separate what we "want" and what we "need" it do be able to do.

Our plan for the final service ideally will be much more than what we define as minimal, but the path to get there will start with working on building the minimal, and then iteratively adding functionality and features, both user-facing and service supporting ones.

We define our MVP as a working service, that is reliable, usable, properly engineered and documented, and easy to redeploy.

MVP Definition:

Our MVP is a operational, reasonably scalable and stable Spark cluster with a Zeppelin User interface connected to it, that allows users to utilize our cluster resources to access and query GAIA data using notebooks.

User Interface Features:

The following describes the type of actions a user will be able to take when using our service:

Register to gain access
Log in using OAuth or service-specific credentials
Spawn a "base/minimal" Spark + Zeppelin service. (This will be done using a custom Service portal)
Book a Spark + Zeppelin service of a specific shape/size for a timeframe (using a calendar like interface)
Select from existing set of example notebooks, that demonstrate how to access and explore Gaia data using PySpark
Run Python / PySpark / Spark notebooks that access, process & filter Gaia data in a distributed way
Visualize results using Dynamic Tables & Plots
Download query data (csv, ..)

Non-Functional Features

All standard features that you get with a Zeppelin install (latest version) will be available:
Our service will provide a set of Python libraries that are to be determined (i.e. pandas, hdbscan, matplotlib)
Base image will be available to all users without booking. Available != Guaranteed.
If there are no resources available, notification will be shown to user to let them know that "You are currently Queue position: 15", or alternatively simply a friendly message saying that there are no available resources.
- the message text implies a queueing system [dm]
Reserved services are guaranteed for the given timeframe. (With exceptions of service failure, which should happen gracefully)
- 'reserved services' implies a booking system to reserve them [dm]
Users can move data from intermediate to persistent storage.

Here we expand on some of the non-functional requirements for our MVP such as storage, performance and scalability:

Persistent Storage

User Notebooks will be stored permanently (unless deleted) in a persistent data store to be determined. User accounts will be assigned a quota for persistent storage that cannot be exceeded. Intermediate data that is produced when running individual cells in Zeppelin will be temporarily stored for a given amount of time. When cleared, the service will clear and reset the notebook status so that we avoid any issues with missing intermediate Spark/Zeppelin data.

Scalability

It will be able to support concurrent Spark jobs from multiple users (tbd.. 10?) by a resource booking system, or queue, or potentially a combination of the two, i.e. allow any user to run a query on a minimal resource allocation (i.e. small cluster), but also allow registered users to book resources for a given timeframe. If the service fails it should fail gracefully, for example in the case of no available resources, this information should be displayed to the user in a friendly manner. As the volume of Science data, as well potentially the number of users is expected to grow, we should build the system in a way that scales horizontally by adding worker nodes as much as possible.

"allow users to run immediately" doesn't require a queue, just a single click short cut to book resources and use them now [dm]

Performance

The performance of our service will depend on each particular Spark job, but we will define a set of example notebooks as benchmark tests, with maximum runtime values for each, and will expect the production service to run all queries below this max value, regardless of the number of concurrent users.

Data:

We will be storing and providing access to GAIA EDR3 and GAIA DR2. Access to this data will be reasonably fast, dependent on the types of queries and the selection criteria.

Features & Targets not included in our MVP

Github Integration (Import from/Export to/Notebook version control)
Scripted Access to Cluster
Groups & Sharing
Full Optimization & Optimal Performance. (We can still have an MVP without finding the optimal performance settings for our cluster setup)
Others..

Tasks not covered in MVP

In this page we do not cover the set of tasks and deliverables that describe how we build and deploy this service. For example, our current project plan is to build a easily deployable and configurable service, that is deployed on Openstack & Kuberenetes using scripts and configuration files, as well as an continuous integration system that runs our benchmark queries in an automated way. These tasks may be required for us to achieve our MVP goals like scalability & performance, but are not part of our definition of the MVP. They define how we achieve it, but not what we will produce. It is still important however that we decide which of these are part of the critical path that will lead to our MVP, and that we complete them in the right order.

Questions to be decided

As this is a first draft of an MVP, some of the above may be subject to discussion and features mentioned may be non-critical. Some questions that we need to think about:

Can we have a working service that meets our target goals without:

Guarantee of availability of resources ? We can have a working service without guaranteed availability
- _What does that mean exactly ? [dm] _
Allowing any registered user to spawn a Notebook and run Spark jobs without reserving ahead of time ? No we cant have a working service without that
- double negative means we do need this ? [dm]
A Reservation System ? To be determined. Not required if we can provide an alternative way of queueing jobs, and having this information fed back to the user. We may still build it in our system early on as a prototype to evaluate
- We still need a detailed description of the user experience the queuing system would provide [dm]
OAuth integration ? Not required for MVP, we still want it ideally if all goes well and MVP is finished early.
Guarantee of performance (i.e. example jobs may run longer than our expected max) ? No Guarantee
Allowing automatic registration and access (i.e. we manually create accounts) ? Manually setting up accounts would be sufficient

What would meet our user's expectations in terms of:

Performance. (e.g. A Spark job that filters DR2 on a set of params and returns 20million rows should return within 5 minutes) No performance targets will be set
How our service behaves with incorrect Spark usage? (e.g. Trying to fetch all DR2 into driver node) System should be as fault-tolerant as possible, incorrect spark usage from one job shouldn't bring down whole system
How difficult it is to start creating jobs immediately? It should be possible to immediately start creating jobs for registered users
How quickly after they log in, should we have a cluster available ? Immediately
How long would they be willing to wait for results of a job? Overnight is a reasonable wait
- Do these requirements make sense together? [dm]
- "Immediately start creating jobs"
- "Overnight is a reasonable wait for results" _
Would they be expecting us to support the Scala spark version as much as Pyspark (i.e. installing supporting libs) PySpark is all we are promising for an MVP
How much data they want to store? 10 Gb? Needs to be further discussed
How and in what formats they would like to be able to download data? Some method of getting data off the system is required, but format not as important

Update:

After a meeting on the 19th of June, here are some updates & answers to the above questions based on our discussions:

We will not provide access to anonymous users.
We will limit the number of concurrent notebook users from the top-level landing page to, e.g. 10 users (as appropriate to the OpenStack resource allocation).
We need levels (once you registered you are level 0, power users get higher account level)
Once we have multiple users, something like a queuing system or reservation system.
We need to define a minimum set of libraries and their versions our MVP
If we can, an informative message showing that your Spark job is in a queue would be sufficient
- Is this inside a Zeppelin notebook, implementing a queue for Spark jobs ?[dm]
If a new user is just just getting familiar with the system, they will be willing to wait (even overnight) as long as they can see that their Spark job is in a queue.
Running a Spark job and having it wait with a running status for X+ hours (X~=5) without feedback would not be sufficient. If the feedback is that your job is in a queue, then it would be ok
Fuctionality > immediacy
- This implies we need to build a list of the science functionality [dm]
They might want to extract up to 100 GB or more off the system
We should provide 10 GB at a minimum per user for scratch space
Communal scratch space that is cleared should be fine for MVP
Look at Gavip for anonymous vs level 1 accounts
Some method of getting data off the system is required

Provide feedback

Saved searches

Use saved searches to filter your results more quickly