-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
want to clean up the dsref.Loader and lib/load.go situation #1704
Comments
This sounds great! I'd love a I think
Consider this CLI invocation:
The "offline" flag here is configuring the loader, saying "don't use the network to resolve datasets, instead fail if I don't have the dataset locally". The key thing here is the loader is configured for that invocation specifically. I think scope will help this type of configuration, where a I think the Another example that doesn't exist yet but will need to one day:
Ideally we have better language than "source" to expose to users before we ship something like this, but this is a situation where you want to explicitly pull datasets you don't have from a specific peer / remote. I'm imagining big datasets where a user wants to know that it's coming from a friend / source that won't run up some API limit or bill for pulling a 30Gig dataset. Anyway, all for these changes. |
Pardon, I think I was unclear. I understand As a thought experiment, let's imagine something changing the source by using qri as a library:
This would work if we operate as you described, where
Should this use case be supported? It would mean that the top-level could not be sure that it is configuring its own resolver, which feels bad. I don't have a better idea for the name at the moment, but I'll keep it in mind. |
Question for @b5 In the current world, or at least prior to #1729, functions would pass a Now, it seems that the After When the function that will be soon named |
They are different. The input "source" is configuration. It will either be one of an enumerated set of strings like "network", "p2p", or "local", which dictate different resolution strategies. It can also be a multaddr (location address), which skips a resolution strategy and tries to resolve directly from a location. The returned "location" is always either an address or the empty string, which indicates local resolution.
Close. This is the place where the dataset was resolved. Because dataset histories (logbook data) and the data itself can live in different places, we need two separate phases of resolution, one to resolve the reference, and another to fetch data.
This seems like an anti pattern, but isn't, for two reasons:
This is what
No thank you. On the CLI side we report that return value to users so they know where the reference was resolve from. We should be reinforcing that patten
I think your question points to a need for spec around dataset loading, but the solution can probably be smaller, fixed one than what we use for reference resolution. We shouldn't need "strategies" the same way we do for reference resolution. Reference resolution needs to grapple with the question of "who are you going to trust as a source of truth"? (because we're mapping names to identifiers) Data doesn't have that problem so long as it's content-addressed. Loading data over the network is about speed, given that we use merkle-proofs for trust. Yes, the |
Right, I was confused because they're both called "source" in many places.
The multiaddr part is aspirational, right? I see a TODO for it in Also, in addition to the enumerated values ("network", "p2p"...) looks like another valid choice is the name of a remote, as per those in the cfg.Remotes configuration. We should add a check to the Remotes setter that makes sure a remote can't be created using the same as one of these special values ("network", "p2p"...). Probably also a good idea to define constants for this enumeration.
Interesting. So in the case that a reference is resolved by "A", but the data itself lives at "B", the resolver will return "A"? How does qri know how to load the dataset in this situation (that it should go to "B")?
Right, I get that, and it's what I'm trying to untangle here, because it collides with both work on Dispatch/Scope, and with InputParam renaming. I'm not saying we get rid of the idea of separate resolving and loading steps, but that the calling code which is using each shouldn't have to connect data between the two calls when they already are calculating the info that they need. Instead of doing this (simplified pseudo code):
do this:
with
Right but not every function can use ParseResolveLoad, usually because they only want to resolve the ref, or because the resolution and loading are happening far apart. A lot of lib/datasets.go is like this; I'm specifically trying to understand those chunks of code.
I proposed doing so in my last two comments and didn't receive this objection. I'm not staying remove
I don't think I understand most of this. I wanted to get the interface closer to a "final" version, but sounds like I should just make the easier changes for now and punt on the harder parts until the design is further along. |
I had a question about how we are getting the proper
|
Closed by #1751. |
Turning a dataset from a string like "dustmop/my_dataset" into an actual dataset object involves these steps:
parse
: "dustmop/my_dataset" -> dsref.Ref{Username: "dustmop", Name: "my_dataset"}, simple textual parsingresolve
: dsref.Ref{...} -> dsref.Ref{..., Path: "/ipfs/QmWhatever"}, resolve the name to get the storage location, as well as initID, et al. Might use the local refstore, or might ask the registryload
: dsref.Ref{...} -> *dataset.Dataset, load the object from the storage layer (such as IPFS)We achieve dependency inversion by defining an interface
Loader
indsref/load.go
(very low-level), which is implemented bylib.Instance
(very high-level). This way, packages such astransform
andsql
can load datasets without knowing anything about the details of parsing and resolving references. This interface definesLoadDataset
which only performs step 3.However, there's also a more powerful function defined in
lib/load.go
calledNewParseResolveLoadFunc
which does all 3 steps. It does not live on an interface, but is passed down as a function value into subsystems such astransform
andsql
. This has been especially awkward during the conversion ontoscope
, as there's lots of places that talk about aLoader
and sometimes they mean the former while other times they mean the latter.I would like to fix this.
My motivation comes from the fact that starlark transforms have a function
load_dataset
which runs all 3 steps, contradicting the semantics of theLoader
interface. I feel strongly that this public facing function should inform the terminology used throughout our codebase.I would propose the following changes:
Rename the old method
LoadDataset
toLoadFullyResolved
. It must fail if the ref does not have aPath
value, since that means it has not been fully resolved.At the same time, add new method called
LoadDataset
, with the same signature as the oldNewParseResolveLoadFunc
. This method should match the semantics of the starlarkload_dataset
function, meaning it performs all 3 steps.This change will allow us to greatly simplify the codepath from
lib
down totransform
andsql
, while also lining up our terminology much better through the codebase.Finally, I feel I don't totally understand the
source
parameter, and whether both methods should have it, or whether neither should (moving it instead to part of the configuration of the Loader's implementer type, for example we would haveInstance.GetLoader(source)
). I wonder whether the users of the interface (specificallytransform
andsql
) could ever pass in a meaningfulsource
value. Rather, it seems as though low-level code shouldn't care, and thesource
should be treated as high-level configuration instead. Feedback is very welcome on that question.The text was updated successfully, but these errors were encountered: