want to clean up the dsref.Loader and lib/load.go situation #1704

dustmop · 2021-03-12T03:11:36Z

Turning a dataset from a string like "dustmop/my_dataset" into an actual dataset object involves these steps:

parse: "dustmop/my_dataset" -> dsref.Ref{Username: "dustmop", Name: "my_dataset"}, simple textual parsing
resolve: dsref.Ref{...} -> dsref.Ref{..., Path: "/ipfs/QmWhatever"}, resolve the name to get the storage location, as well as initID, et al. Might use the local refstore, or might ask the registry
load: dsref.Ref{...} -> *dataset.Dataset, load the object from the storage layer (such as IPFS)

We achieve dependency inversion by defining an interface Loader in dsref/load.go (very low-level), which is implemented by lib.Instance (very high-level). This way, packages such as transform and sql can load datasets without knowing anything about the details of parsing and resolving references. This interface defines LoadDataset which only performs step 3.

However, there's also a more powerful function defined in lib/load.go called NewParseResolveLoadFunc which does all 3 steps. It does not live on an interface, but is passed down as a function value into subsystems such as transform and sql. This has been especially awkward during the conversion onto scope, as there's lots of places that talk about a Loader and sometimes they mean the former while other times they mean the latter.

I would like to fix this.

My motivation comes from the fact that starlark transforms have a function load_dataset which runs all 3 steps, contradicting the semantics of the Loader interface. I feel strongly that this public facing function should inform the terminology used throughout our codebase.

I would propose the following changes:

type Loader interface {
  LoadDataset(ctx context.Context, refstr string) (*dataset.Dataset, error)
  LoadFullyResolved(ctx context.Context, ref dsref.Ref, source string) (*dataset.Dataset, error)
}

Rename the old method LoadDataset to LoadFullyResolved. It must fail if the ref does not have a Path value, since that means it has not been fully resolved.

At the same time, add new method called LoadDataset, with the same signature as the old NewParseResolveLoadFunc. This method should match the semantics of the starlark load_dataset function, meaning it performs all 3 steps.

This change will allow us to greatly simplify the codepath from lib down to transform and sql, while also lining up our terminology much better through the codebase.

Finally, I feel I don't totally understand the source parameter, and whether both methods should have it, or whether neither should (moving it instead to part of the configuration of the Loader's implementer type, for example we would have Instance.GetLoader(source)). I wonder whether the users of the interface (specifically transform and sql) could ever pass in a meaningful source value. Rather, it seems as though low-level code shouldn't care, and the source should be treated as high-level configuration instead. Feedback is very welcome on that question.

The text was updated successfully, but these errors were encountered:

b5 · 2021-03-16T18:39:17Z

This sounds great! I'd love a Loader interface that abstracts what is currently called ParseResolveLoad, and totally agree it'll simplify things. It's a lot easier to say "to run SQL you need a loader". ParseResolveLoad came into existence for this exact purpose, and I totally agree the thing that you're pointing out here is ParseResolveLoad should really be "Load", and the other stuff should be lower level.

I think LoadFullyResolved should just be LoadResolved.

Finally, I feel I don't totally understand the source parameter, and whether both methods should have it, or whether neither should (moving it instead to part of the configuration of the Loader's implementer type, for example we would have Instance.GetLoader(source)). I wonder whether the users of the interface (specifically transform and sql) could ever pass in a meaningful source value.

Consider this CLI invocation:

$ qri sql --offline 'select * from b5/indeed_data_analysts'

The "offline" flag here is configuring the loader, saying "don't use the network to resolve datasets, instead fail if I don't have the dataset locally". The key thing here is the loader is configured for that invocation specifically. I think scope will help this type of configuration, where a Loader is created during scope construction and can accept this configuration parameter. That loader could even be dynamically constructed & cached when scope.Loader() is called.

I think the source param is very poorly named, and would welcome any thoughts on how to improve that.

Another example that doesn't exist yet but will need to one day:

$ qri sql --remote-source doug  'select * from b5/indeed_data_analysts'

Ideally we have better language than "source" to expose to users before we ship something like this, but this is a situation where you want to explicitly pull datasets you don't have from a specific peer / remote. I'm imagining big datasets where a user wants to know that it's coming from a friend / source that won't run up some API limit or bill for pulling a 30Gig dataset.

Anyway, all for these changes.

dustmop · 2021-03-16T19:38:32Z

Pardon, I think I was unclear. I understand source as far as it's eventual purpose, as you've explained it here, but I don't think it belongs on the interface at all. What you're saying seems to back up my point of view. The crux is that the callsite of Loader.LoadDataset never wants to set the source itself, but the concrete instantiation of the interface does. The source acts like a high-level configuration detail, not something that needs to vary on every call of the interface.

As a thought experiment, let's imagine something changing the source by using qri as a library:

inst := NewInstance(cfg)
inst.SetSource("my_remote")
size1 := inst.Datasets().Sql("select count(*) from me/my_dataset")
inst.SetSource("their_remote")
size2 := inst.Datasets().Sql("select count(*) from them/their_dataset")

This would work if we operate as you described, where scope constructs a loader, which locks in the source that is used for resolution for the lifetime of that scope. The argument against doing it this way is if we want a single sql statement (or transform) that uses two different resolution sources, such as:

# starlark setup
load_dataset("me/my_dataset", source="my_remote")
load_dataset("them/their_dataset", source="their_remote")

Should this use case be supported? It would mean that the top-level could not be sure that it is configuring its own resolver, which feels bad.

I don't have a better idea for the name at the moment, but I'll keep it in mind.

dustmop · 2021-03-30T23:37:31Z

Question for @b5

In the current world, or at least prior to #1729, functions would pass a source to ParseAndResolveRef, which would use the source to create a resolver, and then attempt to resolve the reference. On success, ParseAndResolveRef returns the ref (fully resolved) and the source where the ref was found, and a nil error (because it succeeded).

Now, it seems that the source input to ParseAndResolveRef is different from the one returned by it. The return value is supposed to be the place where the dataset can be found, right? Specifically it can be a url like that of the Registry, according to the function LoadDataset in lib/load.go. I'm going to refer to it as the location instead.

After ParseAndResolveRef returns, the caller uses this location by passing it into the LoadDataset function (talked about earlier in the thread, the one we want to rename to LoadResolved). It feels like an anti-pattern to me that ParseAndResolveRef returns a value that is immediately sent back to the loader, and that using any other value in its place is probably a bug. If the loader/resolver already knows where the dataset is available, it should keep that fact to itself, and not force clients to connect the dots. Thus, I am proposing to remove this last parameter (the location) from LoadDataset/LoadResolved, which leads to my actual question:

When the function that will be soon named LoadResolved is going to load the actual dataset data (for example, from local ipfs, or by pulling from the registry), is it correct to just take the original source value and once again convert it into the location? In other words, is that conversion stable? The only situation where I think it isn't is in the p2p context, where location could be some peer on the network, but that functionality is not implemented yet, and I think there's better ways to handle it down the road, once we get there. Are these any other cases where it's incorrect to convert source to location a second time?

b5 · 2021-04-02T13:28:57Z

Now, it seems that the source input to ParseAndResolveRef is different from the one returned by it.

They are different. The input "source" is configuration. It will either be one of an enumerated set of strings like "network", "p2p", or "local", which dictate different resolution strategies. It can also be a multaddr (location address), which skips a resolution strategy and tries to resolve directly from a location. The returned "location" is always either an address or the empty string, which indicates local resolution.

The return value is supposed to be the place where the dataset can be found, right?

Close. This is the place where the dataset was resolved. Because dataset histories (logbook data) and the data itself can live in different places, we need two separate phases of resolution, one to resolve the reference, and another to fetch data.

It feels like an anti-pattern to me that ParseAndResolveRef returns a value that is immediately sent back to the loader, and that using any other value in its place is probably a bug.

This seems like an anti pattern, but isn't, for two reasons:

Resolvers can be inconsistent because we're a distributed system.
We have unfinished work around dataset loading.

If the loader/resolver already knows where the dataset is available, it should keep that fact to itself, and not force clients to connect the dots.

This is what ParseResolveLoad tries to solve: a function that abstracts these details for callers into a rational composition.

I am proposing to remove this last parameter (the location) from LoadDataset/LoadResolved

No thank you. On the CLI side we report that return value to users so they know where the reference was resolve from. We should be reinforcing that patten

When the function that will be soon named LoadResolved is going to load the actual dataset data (for example, from local ipfs, or by pulling from the registry), is it correct to just take the original source value and once again convert it into the location?

I think your question points to a need for spec around dataset loading, but the solution can probably be smaller, fixed one than what we use for reference resolution. We shouldn't need "strategies" the same way we do for reference resolution. Reference resolution needs to grapple with the question of "who are you going to trust as a source of truth"? (because we're mapping names to identifiers) Data doesn't have that problem so long as it's content-addressed. Loading data over the network is about speed, given that we use merkle-proofs for trust.

Yes, the location we resolved from should always be passed to the function in charge of loading the dataset. But the meaning of the value is different when passed to loading as it is when returned from resolving. That function should treat the location parameter as a gossipy place to check for data, and should accept additional gossipy locations to check.

dustmop · 2021-04-02T18:50:40Z

Now, it seems that the source input to ParseAndResolveRef is different from the one returned by it.

They are different. The input "source" is configuration.

Right, I was confused because they're both called "source" in many places.

It will either be one of an enumerated set of strings like "network", "p2p", or "local", which dictate different resolution strategies. It can also be a multaddr (location address), which skips a resolution strategy and tries to resolve directly from a location.

The multiaddr part is aspirational, right? I see a TODO for it in resolverForMode in lib/resolve.go, just want to make sure that's still current. Just a note, in that same function the name used for this value is "mode" instead of "source", we should settle on a single name.

Also, in addition to the enumerated values ("network", "p2p"...) looks like another valid choice is the name of a remote, as per those in the cfg.Remotes configuration. We should add a check to the Remotes setter that makes sure a remote can't be created using the same as one of these special values ("network", "p2p"...). Probably also a good idea to define constants for this enumeration.

The return value is supposed to be the place where the dataset can be found, right?

Close. This is the place where the dataset was resolved. Because dataset histories (logbook data) and the data itself can live in different places, we need two separate phases of resolution, one to resolve the reference, and another to fetch data.

Interesting. So in the case that a reference is resolved by "A", but the data itself lives at "B", the resolver will return "A"? How does qri know how to load the dataset in this situation (that it should go to "B")?

It feels like an anti-pattern to me that ParseAndResolveRef returns a value that is immediately sent back to the loader, and that using any other value in its place is probably a bug.

This seems like an anti pattern, but isn't, for two reasons:

Resolvers can be inconsistent because we're a distributed system.

We have unfinished work around dataset loading.

Right, I get that, and it's what I'm trying to untangle here, because it collides with both work on Dispatch/Scope, and with InputParam renaming. I'm not saying we get rid of the idea of separate resolving and loading steps, but that the calling code which is using each shouldn't have to connect data between the two calls when they already are calculating the info that they need.

Instead of doing this (simplified pseudo code):

ref, location, err = scope.ParseAndResolveRef(ctx, refstr)
ds, err = scope.LoadDataset(ctx, ref, location)

do this:

ref, err = scope.ParseAndResolveRef(ctx, refstr)
ds, err = scope.LoadDataset(ctx, ref)

with scope either retaining the location value internally (as a map or something), or adding it to the dsref.Ref struct.

If the loader/resolver already knows where the dataset is available, it should keep that fact to itself, and not force clients to connect the dots.

This is what ParseResolveLoad tries to solve: a function that abstracts these details for callers into a rational composition.

Right but not every function can use ParseResolveLoad, usually because they only want to resolve the ref, or because the resolution and loading are happening far apart. A lot of lib/datasets.go is like this; I'm specifically trying to understand those chunks of code.

I am proposing to remove this last parameter (the location) from LoadDataset/LoadResolved

No thank you. On the CLI side we report that return value to users so they know where the reference was resolve from. We should be reinforcing that patten

I proposed doing so in my last two comments and didn't receive this objection.

I'm not staying remove source / location concepts entirely, just that I don't think they need to be on the interface as parameters because they can be kept internal to whatever is implementing the Loader interface. Speaking specifically to reporting to the user, lib/load.go implements the behavior of auto-pulling from the registry, and since that's inside Instance it already knows where the dataset lives, allowing it to print the message "pulling [ref] from [location] ..." using it's own state. Similar behavior can be implemented for pull though admittedly it's a bit more work.

When the function that will be soon named LoadResolved is going to load the actual dataset data (for example, from local ipfs, or by pulling from the registry), is it correct to just take the original source value and once again convert it into the location?

I think your question points to a need for spec around dataset loading, but the solution can probably be smaller, fixed one than what we use for reference resolution. We shouldn't need "strategies" the same way we do for reference resolution. Reference resolution needs to grapple with the question of "who are you going to trust as a source of truth"? (because we're mapping names to identifiers) Data doesn't have that problem so long as it's content-addressed. Loading data over the network is about speed, given that we use merkle-proofs for trust.

Yes, the location we resolved from should always be passed to the function in charge of loading the dataset. But the meaning of the value is different when passed to loading as it is when returned from resolving. That function should treat the location parameter as a gossipy place to check for data, and should accept additional gossipy locations to check.

I don't think I understand most of this. I wanted to get the interface closer to a "final" version, but sounds like I should just make the easier changes for now and punt on the harder parts until the design is further along.

ramfox · 2021-04-07T23:54:45Z

I had a question about how we are getting the proper source via the api, b5 answered here #1749 (comment), thought it would be relevant to copy/link in this issue for posterity

Great question. I don't think this is properly documented, but the answer we came up when specifying these changes boils down to this:

We have a category of "runtime options" that affect the way qri behaves (source is a prime example, the current user is another). They don't map well to configuration or to input parameters.
When using the CLI, runtime options should map to global flags.
When using the HTTP api, runtime options should map to HTTP request headers.
So based on that we'd expect you to pass something like:

$ curl http://localhost:2503/render -H "Content-Type: application/json" -H "QriResolverSource: p2p" -d '{ "ref": "b5/world_bank_population/readme" }'
Where QriResolverSource overrides the default resolver source. This work is not yet implemented, but we can add it one time in the retuned http handler from lib & it will work with all methods.

dustmop · 2021-04-16T21:23:19Z

Closed by #1751.

dustmop added the discussion Open-ended request for feedback label Mar 12, 2021

b5 added the resolvers dataset reference resolving & loading label Mar 16, 2021

b5 mentioned this issue Mar 18, 2021

refactor(diff): use Dispatch for Diff and /diff #1708

Merged

dustmop self-assigned this Mar 29, 2021

ramfox mentioned this issue Apr 7, 2021

refactor(lib): fold RenderMethods into DatasetMethods & follow the dispatch pattern #1749

Merged

ramfox mentioned this issue Apr 8, 2021

Qri API v0.10.0 tracker #1731

Closed

29 tasks

dustmop mentioned this issue Apr 9, 2021

refactor(load): dsref.Loader overhaul #1751

Merged

dustmop closed this as completed Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

want to clean up the dsref.Loader and lib/load.go situation #1704

want to clean up the dsref.Loader and lib/load.go situation #1704

dustmop commented Mar 12, 2021

b5 commented Mar 16, 2021

dustmop commented Mar 16, 2021

dustmop commented Mar 30, 2021

b5 commented Apr 2, 2021

dustmop commented Apr 2, 2021

ramfox commented Apr 7, 2021

dustmop commented Apr 16, 2021

want to clean up the dsref.Loader and lib/load.go situation #1704

want to clean up the dsref.Loader and lib/load.go situation #1704

Comments

dustmop commented Mar 12, 2021

b5 commented Mar 16, 2021

dustmop commented Mar 16, 2021

dustmop commented Mar 30, 2021

b5 commented Apr 2, 2021

dustmop commented Apr 2, 2021

ramfox commented Apr 7, 2021

dustmop commented Apr 16, 2021