-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize data sources/importers #593
Conversation
Cool idea -- kind of like a GUI-equivalent of the custom data loaders. Two pie-in-the-sky ideas along these lines:
|
That would be cool! Another pie-in-the-sky idea: importing from hardware/devices directly. I can take a stab at setting up the registry and infrastructure to get this to work, and then we can start to play around with different plugins. I wonder how we should deal with plugins in general - as we develop the ideas above, how do we decide which ones go in |
Just FYI, I'm taking a stab at this. |
Well, that was easy! Code attached. Of course, writing the data importers is going to be a lot more work ;) But at least the framework is in place. Here's a fun example that requires (goes in your
I won't include this importer in yet because really we should have a preview dialog, etc. and it's really just for testing. But the point is that we can import from anything. Need to add docs! |
I moved some of the ideas to #595 so that once this is merged we can still keep track of them. |
Awesome! So part of the image importer could be seen as a custom data loader (which already exist), with the distinction that it doesn't need a filename as input. So there are really two concepts:
Concept (2) is basically the notion of Glue's custom data loaders (and it would be easy to generalize them to take things besides filenames as inputs). Concept (1) is, perhaps, the new notion of a data importer (maybe we could call it a graphical loader or loader UI to reinforce the distinction, and prevent the two ideas from conceptually linking into each other) |
@ChrisBeaumont -yes, I agree with the distinction, and I would maybe use the terms data importer and data handler to clarify the difference (since loading can mean something similar to importing). So for example, a data importer that asks for a URL might return a file, and the data handlers then try and figure out what to do with it - is that what you mean? Now we could still allow data importers to return Another thing is that if you imagine the user selects a FITS file and a special FITS dialog then pops up to further select HDUs, etc. (if there are ambiguities), is the FITS-specific dialog then an importer or a handler? |
Another example where I'm trying to figure out the distinction is the example you mention where you can interactively say where you want the columns to be in an ASCII file (like the Excel CSV importer). Would that be a data handler and if so, does it mean that both importers and handlers can use GUIs? If that is the case, then the main distinction is that the importers tell you where to get the data from, and the handlers are an optional step that allow you to customize how to interpret it? |
One final note: if data importers can be required to return glue-independent objects, such as filenames, Numpy arrays, PIL Image objects, FITS HDUs, etc. then this opens the door for using them in different programs such as ginga, so that would be a plus. So we'd just have to figure out if there is anything we would lose by not allowing importers to return |
Thinking about this more, is the distinction that handlers are basically things that take:
and return Data objects, whereas data importers are an extra layer that go from some data source to one of the standard items listed above? If so, then I guess that the importers can be completely independent of the glue code (though they might use e.g. some of the Qt utils). Some handlers could also use GUIs. For example, we can have a FITS file handler that opens up a disambiguation dialog if needed. @ChrisBeaumont - if you agree with the above distinction, I can try and see how we can accommodate both. Ideally we should be able to register handlers and importers. I guess the handlers are similar to the Astropy unified I/O registry (each handler can define an 'identifier' function that says whether it understands the input). |
I had a slightly different separation of concerns in mind. In the general case loading data is a 4 step process:
Mapping this onto the current codebase:
I think the features discussed in this PR involve making steps 1 and 3 extensible with custom GUI code. The key distinction between step 1 and 3 is that step 1 doesn't yet have any knowledge about the content of a data source, whereas step 3 does (and conversely, step 3 doesn't care about where the data came from). I think that's a useful separation of concerns to retain, so maybe we should introduce two new plugin mechanisms. Furthermore, we might want a way of grouping custom modules for steps 1-3 together (for example, maybe I want to make a step-2 data factory that loosely parses tables, and hands this off to a step-3 GUI that shows a spreadsheet preview of the parsing, and lets the user refine column definitions, datatype inference, etc, like what Excel/Google spreadsheets do). |
This is actually already in place. Data factories don't have to return a list of Data objects, they just have to return something that |
@ChrisBeaumont - thanks for the clarifications, this is extremely helpful. Your comment about allowing steps 1-3 to be grouped is also something I was thinking about. I can try and take a stab at all this. I will implement a registry for step 3 then try and come up with concrete examples that we can test all this out with. |
@ChrisBeaumont - just a quick question - at the moment, data factories actually return a |
All the included data factories happen to return a Data object, but it's not a requirement. The output from a data factory is passed through the same machinery that qglue uses on its input. See this inner function in
And these would be auto-converted into data objects. We do not currently auto-convert HDULists into data objects, but we could (#532 is the feature request for this) So just to be clear: Step 1) User uses QFileDialog to pick a file. |
In my mind the main distinction between step 2 and step 3 is that step 2 contains most of the non-interactive IO logic, and step 3 is mostly GUI-driven. But there's a lot of flexibility in how you divide the workload across tasks. As you suggest you could imagine a scenario where step 2 does nothing, and step 3 is focused specifically on dealing with FITS files, and hence is the step that calls |
One argument to keep the HDUList-parsing logic in step 2 and not step 3: There are some custom flavors of WCS parsers to deal with more exotic transformations. HST comes to mind, they have a parallel WCS library to deal with very accurate pix2world mappings. So if you wrote an HST-specific loader in step 2 that returns a HDUList (with beefed up wcs info), then you could re-use the step 3 interface. You could also imagine converting image data in ALMA/JCMT format into an HDUList, for example. |
@ChrisBeaumont - thanks for the further clarifications! So to put this all in pseudo-code, would the following be correct? # user picks initial importer from Import menu
result1 = run_importer()
if isinstance(result1, Data):
data = result1
else:
result2 = core.data_factories.load_data(result1)
if isinstance(result2, Data):
data = result2
else:
data = parse_data(disambiguator(result2)) The disambiguator is your step 3 which can be for example to better define where to place the columns, which HDUs to pick, etc. For FITS files, load_data would return an HDUList which could be disambiguated after. However, one problem I foresee with this is that in the case where you want a GUI to pop up to determine where to place the columns in an ASCII file, how would we make it skip the data factory? We wouldn't want the table data factory to then get executed, right? Furthermore, we'd have to have a whole new system for saying that if the object is e.g. HDUList, call this disambiguator, etc. when this already exists for the data factories. [decided to snip out some more stuff to concentrate on more fun option below] Maybe a nicer (and fun) possibility is to allow data_factories to be called multiple times. So we could have a data factory that can deal with HDUList. And we can also have other data factories that can return HDULists. As long as data factories don't return something that # user picks initial importer from Import menu
result = run_importer()
if isinstance(result, Data):
data = result
else:
while not parsable_data(result):
try:
result = core.data_factories.load_data(result)
except NoDataFactories: # none can understand result
raise Exception("Data is not parsable nor recognized by data factories")
data = parse_data(result2) This allows us to use the same 'auto-finding' framework for steps 2 and 3. That is, sometimes a data factory will directly return something that can be parsed. Sometimes it will return e.g. HDUList and then we have a data factory that can spawn a GUI to disambiguate HDUList. Ultimately one could even imagine making parse_data a data factory itself! Then we keep looping until a Data object comes out of the data factories. Example of chaining:
This would allow 2, 3, 4, 5, 6, etc-step processes as needed. Finally, to clarify, this would mean two registries: DataImporters and DataFactories (and the latter already exists). What do you think? |
Hmm, thinking out loud here to see if an idea emerges: source = get_source() # step 1. The particular get_source() function probably selected via menu item
staging_structure = core.data_factories.load_data(source) # step 2
# some mechanism to choose UI on data type, type of staging structure, etc
refinement_ui = find_refinement_ui(source, staging_structure, get_source)
result = refinement_ui.refine(staging_structure) # step 3
return qglue.parse_data(result) # step 4 This actually is pretty similar to your second listing, the main difference being that there are two stages of dispatch (one for the data factory, one for the UI) and no looping. I wonder if the looping approach would be too ambiguous -- for example, maybe I have two alternative step 3 disamiguators/refiners that both take as input astropy tables, but provide different interfaces. How would |
Yes (though I'm still not sure about the names :)) |
There are several things to consider here:
In the scenario I'm proposing, the idea is that since disambiguators would be data factories with potentially high priorities, they could 'short-cut' the loading of the data and ensure the disambiguation happens first if needed. If the looping is worrying, then we could do something in between, which is to allow say up to two or three data factories to be chained. At the end of the day, I'm just suggesting that step 2, 3, and 4 all become data factories and we can re-use the same registry and customize the whole workflow. This allows us to do things like skipping step 2, skipping step 4, skipping 2 and 3, etc. |
(Sometimes the best way to know is to try though, so I would be happy to draft up a prototype of what I'm suggesting with actual real-life importers and disambiguators - if it doesn't work, the importers and disambiguators will be useful anyway) |
I think the only difference between our suggestions is that in your case you have two separate data factory registries in effect. So I think if we come up with real importers/disambiguators I don't think it will be much work to try both our workflows. |
Yeah that sounds good to me too |
Just for info, I've started writing a few importers/plugins here:
The next one I'll try is a Vizier table browser/importer. Of course, these need more polish, but the idea is just to have something functional to try out. |
@ChrisBeaumont - it seems that we are in any case in agreement with the importer part existing (step 1), so how about I try and finalize this PR and then we can already merge in the ability to register importers, and the importers are allowed to return Data. Later, we can then settle on a framework for dealing with disambiguation, etc. and start allowing importers to not return Data. Does that sound like a reasonable way forward? Basically if we start off restrictive (importers have to return Data) then any change in future should be backward-compatible? |
@@ -518,6 +524,27 @@ def _create_actions(self): | |||
a.triggered.connect(nonpartial(self._choose_save_session)) | |||
self._actions['session_save'] = a | |||
|
|||
from glue.config import importers | |||
if len(importers) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astrofrog I think that if you don't define a custom importer, then the "load dataset" menu item disappears from the file menubar (at least that's what happens for me).
@ChrisBeaumont - I addressed your comment and added some docs. I also allowed the registry item to be used as a decorator. Would you have any objections to going ahead and merging this if tests pass? (I can open a separate issue to keep track of this discussion and steps 2-3-4. |
Looks good to me! I might slightly prefer that the "Import Data -> Load from file" submenu be promoted to a non-nested "Open Dataset" menu item if there's no other importer, since it's one less hop. There's also a reference to this old name in the getting started page, so we should update this text if you'd rather keep it nested. |
@ChrisBeaumont - how about having both Open Dataset always there (non-nested) and then also Load from file in the import menu if that menu is present? I don't think the duplication would be harmful, but what do you think? (just because otherwise we'd have to mention the two cases in the docs otherwise) |
sounds good to me |
455b5c2
to
4d7daf8
Compare
Generalize data sources/importers
In the same way that we can have plug-in exporters, it would be nice to have plug-in imports that can bring up different or additional dialogs. Examples include:
This would open up all kinds of new fun ways to bring data into glue. All these could be developed as plugins of course, and wouldn't clutter up the main code base.