Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entry-points and API for "Loaders" Need Models #119

Closed
TomFinley opened this issue May 10, 2018 · 2 comments
Closed

Entry-points and API for "Loaders" Need Models #119

TomFinley opened this issue May 10, 2018 · 2 comments
Labels
enhancement New feature or request up-for-grabs A good issue to fix if you are trying to contribute to the project

Comments

@TomFinley
Copy link
Contributor

We have an entry-point for text loader. The public API type for this corresponds to an interface, ILearningPipelineLoader. This interface and the entry-point wrapped by the implementation of this interface for the text loader has two major problems:

First, and most seriously, the output type does not contain the loader's model. We note that in the code, both IDataTransform and IDataLoader implement the ICanSaveModel. This is critical for, in the case of transforms, applying the same transforms to novel IDataViews, and for loaders, applying the same loader to new input files.

Currently, what we seem to do is to respecify the loader from scratch. The problems with this are not yet obvious in the current ML.NET codebase, I suppose, because the only loader that is introduced so-far is the TextLoader, and the trainable behavior of that loader is limited. As we introduce more loaders, (E.g., a loader to read data in SVM-light format, and close variants of that format), the problems of this approach will be more obvious.

Second, it accepts as its input type an IFileHandle, as opposed to what data-loaders actually accept in much of the runtime, an IMultiStreamSource. (This has come up also in issue #60, which complained among other things about lack of multi-file support, implications for Parquet loaders, etc.) Not only are multiple file scenarios impacted, but also scenarios where you want to load no files at all during loader specification are impacted as well. (To give an example of when this is important, since it's not obvious: distributed applications will first specify a shared model, made naturally out of untrained loaders and transforms, save the model, then distribute them to worker nodes so we are sure all workers have the same models.)

Both of these architectural problems are on display in PR 106, when we are introducing something that absolutely is not a loader at all, that is, it loads no data and exports no model, but somehow still manages to conform to that interface seemingly with no problems, and we are naming a "loader" despite the fact that it reads nothing.

For this I'd suggest the following rather simple fixes:

  1. We "standardize" runtime entrypoints for loaders by having them more closely resemble our existing practices for transforms, that is, they output both data, and model. The most natural thing to do is introduce a common output type akin to what we already have for tranform entrypoints (CommonOutputs.ITransformOutput), except ILoaderOutput, presumably.

  2. This output type would include a Model, presumably of type ILoaderModel, that resembles ITransformModel. This interface would also have Apply, but instead of over IDataView would be IMultiStreamSource.

  3. While we are at it, change the input type for existing entry points for the text loader to use IMultiStreamSource instead of IFileHandle, thus partially addressing the other part of the issue of =Doesn't support partitioned directories. #60. Also probably introducing something into CommonInputs, similar to what happens for transforms.

@TomFinley TomFinley changed the title Entry-points and API for "Loaders" Needs Models Entry-points and API for "Loaders" Need Models May 10, 2018
@shauheen shauheen added the enhancement New feature or request label May 11, 2018
@shauheen shauheen added this to the 0518 milestone May 11, 2018
@shauheen shauheen modified the milestones: 0518, 0618 May 22, 2018
@shauheen shauheen added the up-for-grabs A good issue to fix if you are trying to contribute to the project label May 22, 2018
@shauheen shauheen modified the milestones: 0618, 0718 Jul 5, 2018
@shauheen shauheen removed this from the 0718 milestone Aug 4, 2018
@eerhardt
Copy link
Member

@TomFinley - this issue discusses a lot of the "old API" interfaces. Is this still an issue in the "new API"? If not, can this issue be closed?

@codemzs
Copy link
Member

codemzs commented Jun 30, 2019

With the new API in place this issue does not seem relevant.

@codemzs codemzs closed this as completed Jun 30, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request up-for-grabs A good issue to fix if you are trying to contribute to the project
Projects
None yet
Development

No branches or pull requests

4 participants