Entry-points and API for "Loaders" Need Models

We have an entry-point for text loader. The public API type for this corresponds to an interface, `ILearningPipelineLoader`. This interface and the entry-point wrapped by the implementation of this interface for the text loader has two major problems:

First, and most seriously, the output type does not contain the loader's model. We note that in the code, both `IDataTransform` and `IDataLoader` implement the `ICanSaveModel`. This is critical for, in the case of transforms, applying the same transforms to novel `IDataView`s, and for loaders, applying the same loader to new input files.

Currently, what we seem to do is to *respecify* the loader from scratch. The problems with this are not yet obvious in the current ML.NET codebase, I suppose, because the only loader that is introduced so-far is the `TextLoader`, and the trainable behavior of that loader is limited. As we introduce more loaders, (E.g., a loader to read data in SVM-light format, and close variants of that format), the problems of this approach will be more obvious.

Second, it accepts as its input type an `IFileHandle`, as opposed to what data-loaders *actually* accept in much of the runtime, an `IMultiStreamSource`. (This has come up also in issue #60, which complained among other things about lack of multi-file support, implications for Parquet loaders, etc.) Not only are multiple file scenarios impacted, but also scenarios where you want to load *no files at all* during loader specification are impacted as well. (To give an example of when this is important, since it's not obvious: distributed applications will first specify a shared model, made naturally out of untrained loaders and transforms, save the model, then distribute them to worker nodes so we are sure all workers have the same models.)

Both of these architectural problems are on display in [PR 106](https://github.com/dotnet/machinelearning/pull/106), when we are introducing something that absolutely is **not a loader at all**, that is, it loads no data and exports no model, but somehow still manages to conform to that interface seemingly with no problems, and we are naming a "loader" despite the fact that it reads nothing.

For this I'd suggest the following rather simple fixes:

1. We "standardize" runtime entrypoints for loaders by having them more closely resemble our existing practices for transforms, that is, they output both data, and model. The most natural thing to do is introduce a common output type akin to what we already have for tranform entrypoints (`CommonOutputs.ITransformOutput`), except `ILoaderOutput`, presumably.

2. This output type would include a `Model`, presumably of type `ILoaderModel`, that resembles `ITransformModel`. This interface would also have `Apply`, but instead of over `IDataView` would be `IMultiStreamSource`.

3. While we are at it, change the input type for existing entry points for the text loader to use `IMultiStreamSource` instead of `IFileHandle`, thus partially addressing the other part of the issue of =#60. Also *probably* introducing something into `CommonInputs`, similar to what happens for transforms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Entry-points and API for "Loaders" Need Models #119

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Entry-points and API for "Loaders" Need Models #119

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions