Entry-points and API for "Loaders" Need Models #119
Labels
enhancement
New feature or request
up-for-grabs
A good issue to fix if you are trying to contribute to the project
We have an entry-point for text loader. The public API type for this corresponds to an interface,
ILearningPipelineLoader
. This interface and the entry-point wrapped by the implementation of this interface for the text loader has two major problems:First, and most seriously, the output type does not contain the loader's model. We note that in the code, both
IDataTransform
andIDataLoader
implement theICanSaveModel
. This is critical for, in the case of transforms, applying the same transforms to novelIDataView
s, and for loaders, applying the same loader to new input files.Currently, what we seem to do is to respecify the loader from scratch. The problems with this are not yet obvious in the current ML.NET codebase, I suppose, because the only loader that is introduced so-far is the
TextLoader
, and the trainable behavior of that loader is limited. As we introduce more loaders, (E.g., a loader to read data in SVM-light format, and close variants of that format), the problems of this approach will be more obvious.Second, it accepts as its input type an
IFileHandle
, as opposed to what data-loaders actually accept in much of the runtime, anIMultiStreamSource
. (This has come up also in issue #60, which complained among other things about lack of multi-file support, implications for Parquet loaders, etc.) Not only are multiple file scenarios impacted, but also scenarios where you want to load no files at all during loader specification are impacted as well. (To give an example of when this is important, since it's not obvious: distributed applications will first specify a shared model, made naturally out of untrained loaders and transforms, save the model, then distribute them to worker nodes so we are sure all workers have the same models.)Both of these architectural problems are on display in PR 106, when we are introducing something that absolutely is not a loader at all, that is, it loads no data and exports no model, but somehow still manages to conform to that interface seemingly with no problems, and we are naming a "loader" despite the fact that it reads nothing.
For this I'd suggest the following rather simple fixes:
We "standardize" runtime entrypoints for loaders by having them more closely resemble our existing practices for transforms, that is, they output both data, and model. The most natural thing to do is introduce a common output type akin to what we already have for tranform entrypoints (
CommonOutputs.ITransformOutput
), exceptILoaderOutput
, presumably.This output type would include a
Model
, presumably of typeILoaderModel
, that resemblesITransformModel
. This interface would also haveApply
, but instead of overIDataView
would beIMultiStreamSource
.While we are at it, change the input type for existing entry points for the text loader to use
IMultiStreamSource
instead ofIFileHandle
, thus partially addressing the other part of the issue of =Doesn't support partitioned directories. #60. Also probably introducing something intoCommonInputs
, similar to what happens for transforms.The text was updated successfully, but these errors were encountered: