-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT: Refactoring changes to CSV adapter + CSVArrayAdapter #803
Conversation
I see two structural changes here:
Some time ago, we encountered these examples and some others in the wild. We did not like the look of these Adapters: Thus, we added additional columns to the Asset--DataSource many-to-many relation table which encode which role and order each Asset plays. To be specific, with each row in the many-to-many relation table we also store, in addition to
I really like having standard separate paths for introspection (ii) and construction from DataSource/Asset(iii). That's a huge improvement. I see in the implementation of Now on shakier ground, just brainstorming... Can we get the best of both worlds? As a starting point, convenience function like this might work: def from_node_and_data_source(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
"Usage example: from_data_source(CSVAdapter, node, data_source)"
parameters = defaultdict(list)
for asset in data_source.assets:
if asset.num is None:
# This asset is associated with a parameter that takes a single URI.
parameters[asset.parameter] = asset.data_uri
else:
# This asset is associated with a parameter that takes a list of URIs.
parameters[asset.parameter].append(asset.data_uri)
return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters) (A further convenience could wrap this and figure out the Thus, we would drop the We would retain two great aspects of this PR:
There may be better implementations of these goals available, but I hope this is a promising example. |
Two minor weak points in both the implementation on
|
Notes from discussion with @genematx def from_catalog(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
"Usage example: from_data_source(CSVAdapter, node, data_source)"
parameters = defaultdict(list)
for asset in data_source.assets:
if asset.num is None:
# This asset is associated with a parameter that takes a single URI.
parameters[asset.parameter] = asset.data_uri
else:
# This asset is associated with a parameter that takes a list of URIs.
parameters[asset.parameter].append(asset.data_uri)
return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters)
asset-datasource relation
asset_id data_source_id parameter num
103 1 tiff_image 1
103 2 calib_image 1
parameter num
--------- ---
# HDF5
hdf5_file NULL
# TIFF sequence
tiff_images 1
tiff_images 2
# TIFF stack
tiff_stack NULL
"hdf5_file"
["image1", "image2"]
Asset:
- data_uri
- size
- hash_meth
HDF5Adapter(hdf5_file: str, ...)
TIFFSequence(tiff_images: List[str], ...)
TIFFStack(tiff_image: str, ...)
class Storage:
filesystem: str
sql: str
class MyAdapter:
def __init__(self, foo_file: str, metadata, structure, specs, ...):
"Construct Adapter from info extracted from node, data_source and its assets"
if foo_file == "":
raise ValueError
self._foo_file = foo_file
def init_storage(self, storage, data_source) -> DataSource:
"Allocate assets for writing data."
# Always creates some Assets and attaches them.
# Sometimes adds/alters data_source.parameters.
# Could look at data_source.mimetype...
return data_source # includes data_source.assets
@classmethod
def from_uris(cls, *files) -> "MyAdapter":
"Accept inherently heterogeneous/unsorted files with unknown structure and introspect."
...
return cls(...)
@classmethod
def from_catalog(cls, node, data_source):
return from_catalog(cls, node, data_source)
# in tiled/catalog/adapter.py
adapter_cls = adapters_by_mimetype[data_source.mimetype]
adapter = from_catalog(adapter_cls, node, data_source)
from_catalog(CSVAdapter, node, data_source)
CSVAdapter.from_catalog(node, data_source)
def asset_parameters_to_adapter_kwargs(data_source):
"Transform database representation to Python representation."
parameters = defaultdict(list)
for asset in data_source.assets:
if asset.num is None:
# This asset is associated with a parameter that takes a single URI.
parameters[asset.parameter] = asset.data_uri
else:
# This asset is associated with a parameter that takes a list of URIs.
parameters[asset.parameter].append(asset.data_uri)
return parameters
def from_catalog(adapter_cls: type[Adapter], node: Node, data_source: DataSource) -> Adapter
"Usage example: from_data_source(CSVAdapter, node, data_source)"
parameters = asset_parameters_to_adapter_kwargs(data_source)
return adapter_cls(metadata=node.metadata, specs=node.specs, structure=data_source.structure, **parameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's coming together! A couple comments on fine-tuning the interfaces.
adapter = await anyio.to_thread.run_sync( | ||
partial(adapter_factory, **adapter_kwargs) | ||
partial( | ||
adapter_class.from_catalog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To cleanly separate the parameters
namespace from arguments used by Tiled itself, should we consider using positional-only parameters in the init_adapter_from_catalog
and the from_catalog
constructors?
def init_adapter_from_catalog(cls, data_source, node, /, **kwargs):
...
I don't actually foresee collisions here, but I somewhat like the idea of keeping the namespaces formally separate.
We have yet to introduce these in Tiled (or in any Bluesky project AFAIK) but this is exactly the use case. They are available in all supported versions of Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes total sense. I don't think I ever use them as from_catalog(data_source=data_source, node=node)
, but I'll doublecheck.
Do you think data_source
should take precedence? Or node
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by precedence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just meant from_catalog(data_source, node, /, **kwargs)
or from_catalog(node, data_source, /, **kwargs)
. Which one makes more sense? It seemed to me that data_source
is more "important", because node just supplies metadata and specs, but I'm happy to switch the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Made data_source
and node
positional only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Yes, I agree with that reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this for a spin and failed to find any regressions.
I have so minor cleanup for the Zarr adapter (no fixes or changes in behavior, just tightening up) that I 'll put up in a separate PR.
CSVAdapter
to accept kwargs forpd.read_csv
(e.g. separator)dataframe_adapter
property fromCSVAdapter
multipart/related;type=text/csv
mimetypeCSVArrayAdaper
backed by anArrayAdapter
(instead ofTableAdapter
). It can be used to load homogeneous numerical arrays stored as scv files. The distinction between the two is intended to be done by the mimetype: "text/csv;header=present" -- for tables, and "text/csv;header=absent" -- for arrays.from_assets
andfrom_uris
being the two primary methods).Checklist
Add the ticket number which this PR closes to the comment section