Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize data sources/importers #593

Merged
merged 6 commits into from
Mar 30, 2015

Conversation

astrofrog
Copy link
Member

In the same way that we can have plug-in exporters, it would be nice to have plug-in imports that can bring up different or additional dialogs. Examples include:

  • An additional dialog for e.g. FITS files which can allow one to select the HDU
  • A dialog instead of the 'open file' which could get data from online sources (e.g. dataverse, Vizier, etc)

This would open up all kinds of new fun ways to bring data into glue. All these could be developed as plugins of course, and wouldn't clutter up the main code base.

@ChrisBeaumont
Copy link
Member

Cool idea -- kind of like a GUI-equivalent of the custom data loaders.

Two pie-in-the-sky ideas along these lines:

  • something akin to a data wranger interface http://vis.stanford.edu/wrangler/ to parse arbitrarily formatted data interactively
  • A more modest version that parses files, but then lets you interactively determine how to split up all the data across Glue Data and Component classes, rename the components, etc.

@astrofrog
Copy link
Member Author

That would be cool! Another pie-in-the-sky idea: importing from hardware/devices directly.

I can take a stab at setting up the registry and infrastructure to get this to work, and then we can start to play around with different plugins.

I wonder how we should deal with plugins in general - as we develop the ideas above, how do we decide which ones go in glue.plugins vs other repositories? Would it make sense to develop these plugins in different repositories and include them via sub-modules? This would allow us to include/exclude plugins in the default app without having to move code around between repositories.

@astrofrog
Copy link
Member Author

Just FYI, I'm taking a stab at this.

@astrofrog
Copy link
Member Author

Well, that was easy! Code attached. Of course, writing the data importers is going to be a lot more work ;) But at least the framework is in place. Here's a fun example that requires opencv to be installed (it is installable with conda for Python 2.7):

(goes in your config.py file)

from glue.config import importers
from glue.core import Data

def webcam_importer():
    import cv2
    video_capture = cv2.VideoCapture(0)
    ret, frame = video_capture.read()
    frame = frame[::-1,::-1,::-1]
    return [Data(x=frame.swapaxes(0,2).swapaxes(1,2), label="Webcam snapshot")]

importers.add('Import from webcam', webcam_importer)

screen shot 2015-03-21 at 8 58 52 pm

I won't include this importer in yet because really we should have a preview dialog, etc. and it's really just for testing. But the point is that we can import from anything.

Need to add docs!

@astrofrog
Copy link
Member Author

I moved some of the ideas to #595 so that once this is merged we can still keep track of them.

@ChrisBeaumont
Copy link
Member

Awesome!

So part of the image importer could be seen as a custom data loader (which already exist), with the distinction that it doesn't need a filename as input. So there are really two concepts:

  1. GUIs for importing data
  2. Functions for taking some kind of input and building Glue Data objects.

Concept (2) is basically the notion of Glue's custom data loaders (and it would be easy to generalize them to take things besides filenames as inputs). Concept (1) is, perhaps, the new notion of a data importer (maybe we could call it a graphical loader or loader UI to reinforce the distinction, and prevent the two ideas from conceptually linking into each other)

@astrofrog
Copy link
Member Author

@ChrisBeaumont -yes, I agree with the distinction, and I would maybe use the terms data importer and data handler to clarify the difference (since loading can mean something similar to importing). So for example, a data importer that asks for a URL might return a file, and the data handlers then try and figure out what to do with it - is that what you mean?

Now we could still allow data importers to return Data objects which require no further handling and shortcut the data handling process. There might be cases where the importer knows exactly how the data should be handled - in the case of the webcam, you could imagine making it so that it always returns a Data objects with three components (R, G, and B) [although it could simply return a PIL Image object and the data handlers would deal with it in the same way as a normal RGB image]

Another thing is that if you imagine the user selects a FITS file and a special FITS dialog then pops up to further select HDUs, etc. (if there are ambiguities), is the FITS-specific dialog then an importer or a handler?

@astrofrog
Copy link
Member Author

Another example where I'm trying to figure out the distinction is the example you mention where you can interactively say where you want the columns to be in an ASCII file (like the Excel CSV importer). Would that be a data handler and if so, does it mean that both importers and handlers can use GUIs? If that is the case, then the main distinction is that the importers tell you where to get the data from, and the handlers are an optional step that allow you to customize how to interpret it?

@astrofrog
Copy link
Member Author

One final note: if data importers can be required to return glue-independent objects, such as filenames, Numpy arrays, PIL Image objects, FITS HDUs, etc. then this opens the door for using them in different programs such as ginga, so that would be a plus. So we'd just have to figure out if there is anything we would lose by not allowing importers to return Data objects directly.

@astrofrog
Copy link
Member Author

Thinking about this more, is the distinction that handlers are basically things that take:

  • Filenames
  • File objects
  • In-memory objects (e.g. Numpy array, PIL object, etc.)

and return Data objects, whereas data importers are an extra layer that go from some data source to one of the standard items listed above?

If so, then I guess that the importers can be completely independent of the glue code (though they might use e.g. some of the Qt utils).

Some handlers could also use GUIs. For example, we can have a FITS file handler that opens up a disambiguation dialog if needed.

@ChrisBeaumont - if you agree with the above distinction, I can try and see how we can accommodate both. Ideally we should be able to register handlers and importers. I guess the handlers are similar to the Astropy unified I/O registry (each handler can define an 'identifier' function that says whether it understands the input).

@ChrisBeaumont
Copy link
Member

is the distinction that handlers are basically... return Data objects, whereas data importers are an extra layer that go from some data source to one of the standard items listed above?

I had a slightly different separation of concerns in mind. In the general case loading data is a 4 step process:

  1. User specifies a source to load data from.
  2. Computer code loads this data into a "staging" data structure.
  3. User refines the import from step 2 using a GUI. Examples include specifying which columns in a catalog to keep, how to split a plain text catalog into columns, what HDUs to keep, how to group HDUs into multiple Data objects, which records to drop or mask, etc.
  4. Computer combines the information from 2 and 3 into a list of Data objects (or a glue-agnostic data structure that can be straightforwardly converted into a list of Data objects, using the parse_data function that qglue uses.)

Mapping this onto the current codebase:

  1. We don't have any way to customize this. We always just show a QFileDialog
  2. This is what data factories do, and there is a registry for users to add custom data factories
  3. We skip step 3 currently.
  4. We rely on the data_factory to return something that parse_data is able to assemble into a list of Data objects.

I think the features discussed in this PR involve making steps 1 and 3 extensible with custom GUI code. The key distinction between step 1 and 3 is that step 1 doesn't yet have any knowledge about the content of a data source, whereas step 3 does (and conversely, step 3 doesn't care about where the data came from). I think that's a useful separation of concerns to retain, so maybe we should introduce two new plugin mechanisms. Furthermore, we might want a way of grouping custom modules for steps 1-3 together (for example, maybe I want to make a step-2 data factory that loosely parses tables, and hands this off to a step-3 GUI that shows a spreadsheet preview of the parsing, and lets the user refine column definitions, datatype inference, etc, like what Excel/Google spreadsheets do).

@ChrisBeaumont
Copy link
Member

One final note: if data importers can be required to return glue-independent objects, such as filenames, Numpy arrays, PIL Image objects, FITS HDUs, etc. then this opens the door for using them in different programs such as ginga, so that would be a plus

This is actually already in place. Data factories don't have to return a list of Data objects, they just have to return something that qglue.parse_data can turn into Data objects. However, it's often useful to construct Data objects directly, so that a file importer can create ComponentLinks (which are pretty glue-specific). For example Glue's FITS loader does this

@astrofrog
Copy link
Member Author

@ChrisBeaumont - thanks for the clarifications, this is extremely helpful. Your comment about allowing steps 1-3 to be grouped is also something I was thinking about. I can try and take a stab at all this. I will implement a registry for step 3 then try and come up with concrete examples that we can test all this out with.

@astrofrog
Copy link
Member Author

@ChrisBeaumont - just a quick question - at the moment, data factories actually return a Data object directly, so are you suggesting changing that? Imagine the case of a FITS file with multiple HDUs - what would step 2 actually do? Step one would return e.g. a filename but why do we then need step 2 - can't we then go straight to step 3 and have a FITS disambiguator? Is step 2 really needed?

@ChrisBeaumont
Copy link
Member

All the included data factories happen to return a Data object, but it's not a requirement. The output from a data factory is passed through the same machinery that qglue uses on its input. See this inner function in load_data (maybe this could be more clearly documented in the code). So for example, a data factory could return:

  • A dict of arrays
  • A dataframe
  • An astropy table

And these would be auto-converted into data objects. We do not currently auto-convert HDULists into data objects, but we could (#532 is the feature request for this)

So just to be clear:

Step 1) User uses QFileDialog to pick a file.
Step 2) core.data_factories.load_data dispatches to a data factory to load the file, then passes the result through a translation step to coerce into a Data object.
Step 3) Skipped
Step 4) Skipped

@ChrisBeaumont
Copy link
Member

Step one would return e.g. a filename but why do we then need step 2 - can't we then go straight to step 3 and have a FITS disambiguator? Is step 2 really needed?

In my mind the main distinction between step 2 and step 3 is that step 2 contains most of the non-interactive IO logic, and step 3 is mostly GUI-driven. But there's a lot of flexibility in how you divide the workload across tasks. As you suggest you could imagine a scenario where step 2 does nothing, and step 3 is focused specifically on dealing with FITS files, and hence is the step that calls fits.open. Alternatively, you could imagine a more generic step 3 that takes as input a more general list-like object of arrays and gives the user an interface to select which arrays to keep.

@ChrisBeaumont
Copy link
Member

One argument to keep the HDUList-parsing logic in step 2 and not step 3: There are some custom flavors of WCS parsers to deal with more exotic transformations. HST comes to mind, they have a parallel WCS library to deal with very accurate pix2world mappings. So if you wrote an HST-specific loader in step 2 that returns a HDUList (with beefed up wcs info), then you could re-use the step 3 interface.

You could also imagine converting image data in ALMA/JCMT format into an HDUList, for example.

@astrofrog
Copy link
Member Author

@ChrisBeaumont - thanks for the further clarifications! So to put this all in pseudo-code, would the following be correct?

# user picks initial importer from Import menu

result1 = run_importer()

if isinstance(result1, Data):
    data = result1
else:
    result2 = core.data_factories.load_data(result1)
    if isinstance(result2, Data):
        data = result2
    else:
        data = parse_data(disambiguator(result2))

The disambiguator is your step 3 which can be for example to better define where to place the columns, which HDUs to pick, etc.

For FITS files, load_data would return an HDUList which could be disambiguated after. However, one problem I foresee with this is that in the case where you want a GUI to pop up to determine where to place the columns in an ASCII file, how would we make it skip the data factory? We wouldn't want the table data factory to then get executed, right?

Furthermore, we'd have to have a whole new system for saying that if the object is e.g. HDUList, call this disambiguator, etc. when this already exists for the data factories.

[decided to snip out some more stuff to concentrate on more fun option below]

Maybe a nicer (and fun) possibility is to allow data_factories to be called multiple times. So we could have a data factory that can deal with HDUList. And we can also have other data factories that can return HDULists. As long as data factories don't return something that parse_data can parse, we throw it back in to load_data (with some recursion limit of course). So e.g. (pseudo-code)

# user picks initial importer from Import menu

result = run_importer()

if isinstance(result, Data):
    data = result
else:
    while not parsable_data(result):
        try:
            result = core.data_factories.load_data(result)
        except NoDataFactories:  # none can understand result
            raise Exception("Data is not parsable nor recognized by data factories")
    data = parse_data(result2)

This allows us to use the same 'auto-finding' framework for steps 2 and 3. That is, sometimes a data factory will directly return something that can be parsed. Sometimes it will return e.g. HDUList and then we have a data factory that can spawn a GUI to disambiguate HDUList.

Ultimately one could even imagine making parse_data a data factory itself! Then we keep looping until a Data object comes out of the data factories.

Example of chaining:

  • Importer returns filename -> data factory returns HDUList -> data factory returns HDU -> data factory returns Data
  • Importer returns filename -> data factory detects ASCII table and launches GUI to place columns, returns astropy Table -> data factory returns Data

This would allow 2, 3, 4, 5, 6, etc-step processes as needed.

Finally, to clarify, this would mean two registries: DataImporters and DataFactories (and the latter already exists).

What do you think?

@ChrisBeaumont
Copy link
Member

Hmm, thinking out loud here to see if an idea emerges:

source = get_source()  # step 1. The particular get_source() function probably selected via menu item
staging_structure = core.data_factories.load_data(source)  # step 2
# some mechanism to choose UI on data type, type of staging structure, etc
refinement_ui = find_refinement_ui(source, staging_structure, get_source) 
result = refinement_ui.refine(staging_structure) # step 3
return qglue.parse_data(result) # step 4

This actually is pretty similar to your second listing, the main difference being that there are two stages of dispatch (one for the data factory, one for the UI) and no looping. I wonder if the looping approach would be too ambiguous -- for example, maybe I have two alternative step 3 disamiguators/refiners that both take as input astropy tables, but provide different interfaces. How would core.data_factories.load_data(result) know which utility to dispatch to?

@ChrisBeaumont
Copy link
Member

Finally, to clarify, this would mean two registries: DataImporters and DataFactories (and the latter already exists).

Yes (though I'm still not sure about the names :))

@astrofrog
Copy link
Member Author

How would core.data_factories.load_data(result) know which utility to dispatch to?

There are several things to consider here:

  • There can be ambiguities even without the loop. This is a generic problem that can be solved by allowing data factories to be assigned priorities, and user-defined data factories could default to a higher priority.
  • Internally we can make sure that all the built-in data factories (in my scenario) never result in non-deterministic behavior - there is always a 'better' path (and this can be enforced with the priorities).
  • If users register multiple disambiguators for the SAME format, then they should be making a decision about which one is more important.
  • Your latest example is actually very similar to my first pseudo-code example, and there is still the issue that let's say you have an ASCII table that is fixed width, and you don't know where to put the columns - the disambiguation has to happen straight away, not after load_data because load_data will fail since it doesn't know how to parse it. The staging structure would then have to be the filename and so you might as well have called the disambiguator straight away.

In the scenario I'm proposing, the idea is that since disambiguators would be data factories with potentially high priorities, they could 'short-cut' the loading of the data and ensure the disambiguation happens first if needed.

If the looping is worrying, then we could do something in between, which is to allow say up to two or three data factories to be chained.

At the end of the day, I'm just suggesting that step 2, 3, and 4 all become data factories and we can re-use the same registry and customize the whole workflow. This allows us to do things like skipping step 2, skipping step 4, skipping 2 and 3, etc.

@astrofrog
Copy link
Member Author

(Sometimes the best way to know is to try though, so I would be happy to draft up a prototype of what I'm suggesting with actual real-life importers and disambiguators - if it doesn't work, the importers and disambiguators will be useful anyway)

@astrofrog
Copy link
Member Author

I think the only difference between our suggestions is that in your case you have two separate data factory registries in effect. So I think if we come up with real importers/disambiguators I don't think it will be much work to try both our workflows.

@ChrisBeaumont
Copy link
Member

Sometimes the best way to know is to try though

Yeah that sounds good to me too

@astrofrog
Copy link
Member Author

Just for info, I've started writing a few importers/plugins here:

The next one I'll try is a Vizier table browser/importer. Of course, these need more polish, but the idea is just to have something functional to try out.

@astrofrog
Copy link
Member Author

@ChrisBeaumont - it seems that we are in any case in agreement with the importer part existing (step 1), so how about I try and finalize this PR and then we can already merge in the ability to register importers, and the importers are allowed to return Data. Later, we can then settle on a framework for dealing with disambiguation, etc. and start allowing importers to not return Data. Does that sound like a reasonable way forward? Basically if we start off restrictive (importers have to return Data) then any change in future should be backward-compatible?

@@ -518,6 +524,27 @@ def _create_actions(self):
a.triggered.connect(nonpartial(self._choose_save_session))
self._actions['session_save'] = a

from glue.config import importers
if len(importers) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astrofrog I think that if you don't define a custom importer, then the "load dataset" menu item disappears from the file menubar (at least that's what happens for me).

@astrofrog
Copy link
Member Author

@ChrisBeaumont - I addressed your comment and added some docs. I also allowed the registry item to be used as a decorator. Would you have any objections to going ahead and merging this if tests pass? (I can open a separate issue to keep track of this discussion and steps 2-3-4.

@ChrisBeaumont
Copy link
Member

Looks good to me! I might slightly prefer that the "Import Data -> Load from file" submenu be promoted to a non-nested "Open Dataset" menu item if there's no other importer, since it's one less hop. There's also a reference to this old name in the getting started page, so we should update this text if you'd rather keep it nested.

@astrofrog
Copy link
Member Author

@ChrisBeaumont - how about having both Open Dataset always there (non-nested) and then also Load from file in the import menu if that menu is present? I don't think the duplication would be harmful, but what do you think? (just because otherwise we'd have to mention the two cases in the docs otherwise)

@ChrisBeaumont
Copy link
Member

sounds good to me

@astrofrog
Copy link
Member Author

Done:

screen shot 2015-03-30 at 8 27 34 pm

I don't think the duplication is really a problem though we can see if users report any confusion.

astrofrog added a commit that referenced this pull request Mar 30, 2015
Generalize data sources/importers
@astrofrog astrofrog merged commit fb42fed into glue-viz:master Mar 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants