Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config 2.0 unified storage description #30

Open
xyhuang opened this issue Jul 23, 2021 · 1 comment
Open

Config 2.0 unified storage description #30

xyhuang opened this issue Jul 23, 2021 · 1 comment

Comments

@xyhuang
Copy link

xyhuang commented Jul 23, 2021

This proposes a unified storage description for config 2.0.

Today, MLCube relies on a simple "file path" approach to describe the inputs and outputs of their tasks. However, for many platforms, such like Kubernetes, it is not possible to use a single file path, because they either have complex storage backend, or use their own layer of storage abstractions, which do not use "paths" to refer to the corresponding locations in the data storage. This proposal aims to address this problem by providing a unified way of describing storage that can cover both local file systems and more complex storage solutions.

A storage backend can be described in the platform section of the config, which is supplied by the user at run-time. The storage description consists of 2 main parts: a name that will be used as a reference in the tasks' I/O paths, and a platform-specific spec that provides the details of the storage backend in the target platform, so that the runner can use it to find the right location of data.
We do not change the "path"-like descriptions of task inputs/outputs in order to keep that simple, however, we do introduce a "variable"-like component as a part of the path, so that we can use this "variable" as a reference to the corresponding storage backend and use the rest of the path as a relative path to the given storage.
A most straight-forward example of such "variable" is "$WORKSPACE" which is currently being used to refer to a specific dir in local file system. With the new proposal, the "$WORKSPACE", or any "$CUSTOM_NAME" defined by user, can refer to an arbitrary storage backend as specified in the platform section.

Since the detailed spec of the storage is specified in the platform part, it can be decoupled from the shared MLCube config and only appear in the user's config. This also means that how the spec of a given storage backend is writtern should be agreed between a user and a runner, and not relevant to the MLCube publisher.
While we do not have to provide a standard for that specs, we may provide some "guidelines/examples" for popular platforms so that there can be a convention for runner implementors.

The following is an example of how the storage backend can be defined, notice the specs in the platform section and how they are used in the tasks section. Notice also that if we give the storage a name of "WORKSPACE" then we may redirect our default workspace to the specified storage backend, without change the values in the task I/Os.

name: example-mlcube
platform:
  storage:
  - name: K8S_DATA
    spec:
      kubernetes:
        pvc_name: my-pvc
  - name: NFS_DATA
    spec:
      nfs:
        host: 127.0.0.1
        port: 2049
        path: some/nfs/path
container:
  image: mlcommons/mnist:0.0.1
  build_context: "mnist"
  build_file: "Dockerfile"
tasks:
  download:
    io:
    - {name: data_dir, type: directory, io: output, default: $NFS_DATA/data}
    - {name: log_dir, type: directory, io: output, default: $NFS_DATA/logs}
  train:
    io:
    - {name: data_dir, type: directory, io: input, default: $K8S_DATA/data}
    - {name: parameters_file, type: file, io: input, default: $K8S_DATA/parameters/default.parameters.yaml}
    - {name: log_dir, type: directory, io: output, default: $K8S_DATA/logs}
    - {name: model_dir, type: directory, io: output, default: $K8S_DATA/model}
@sergey-serebryakov
Copy link
Contributor

@xyhuang @bitfort @dfeddema @davidjurado

This really looks doable. Couple comments.

  1. The more I think about value format (e.g. ${STORAGE_NAME}/RELATIVE_PATH ($NFS_DATA/data)), the more I am getting convinced that this may be confusing. Is NFS_DATA data an internal variable, environmental variable or some other identifier (like in our case).
  2. Should storage section be a list or a dictionary?

Talking about first item. Since values here are some kind of identifiers for either directories or files, can we partially adapt URI approach? We can introduce a schema named storage that will be referring to MLCube-supported storages listed in user (most likely) configuration file?
Here is the example (I use dictionary instead of list just to provide an alternative approach):

name: example-mlcube
platform:
  storage:
    K8S_DATA:
      spec:
        kubernetes:
          pvc_name: my-pvc
    NFS_DATA: 
      spec:
        nfs:
          host: 127.0.0.1
          port: 2049
          path: some/nfs/path
    workspace:
      spec:
        local:
          path: ${runtime.root}/workspace
    tmp:
      spec:
        local:
          path: ${oc.env:TMP}/mlcube/workspace/${name}
    home: 
      spec:
        local:
          path: ${oc.env:HOME}/.mlcube/workspace/${name}
container:
  image: mlcommons/mnist:0.0.1
  build_context: "mnist"
  build_file: "Dockerfile"
tasks:
  download:
    io:
    - {name: data_dir, type: directory, io: output, default: "storage:NFS_DATA/data"}
    - {name: log_dir, type: directory, io: output, default: "storage:NFS_DATA/logs"}
  train:
    io:
    - {name: data_dir, type: directory, io: input, default: "storage:K8S_DATA/data"}
    - {name: parameters_file, type: file, io: input, default: "storage:K8S_DATA/parameters/default.parameters.yaml"}
    - {name: log_dir, type: directory, io: output, default: "storage:K8S_DATA/logs"}
    - {name: model_dir, type: directory, io: output, default: "storage:K8S_DATA/model"}

We can also use it with --workspace CLI argument. If users just specify relative paths (which are relative to workspace root by default) (data, logs, model, ...), then we can do something like:

mlcube run ... --workspace=storage:home

to keep data in user's home directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants