Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool to migrate data between SpatialData versions #680

Open
aeisenbarth opened this issue Aug 15, 2024 · 5 comments
Open

Tool to migrate data between SpatialData versions #680

aeisenbarth opened this issue Aug 15, 2024 · 5 comments
Labels

Comments

@aeisenbarth
Copy link
Contributor

Is your feature request related to a problem? Please describe.
When the specification of SpatialData is changed, existing datasets do not have these changes, and under circumstances may even become incompatible.

For the in-memory representation, the library's reader functions support reading older versions (in most cases). However, users may want to upgrade the on-disk data to the latest version. Another case is when something is not covered by backward-compatibility of readers, e.g. due to errors in the data (#655).

Describe the solution you'd like
As discussed earlier, we want a tool for migrating data, if more format changes occur in future. I open this new issue for tracking this, since #655 was too specific and is closed.

Describe alternatives you've considered

  • Relying on backward-compatible readers, then using write to save in the latest format: Some cases may not be covered by readers.
  • Adding a new "upgrade" function to the spatialdata library: Migration is not a frequent use case and would bloat the library, especially if supporting special cases (one-time issues, erronous data). Additionally, backward-compatibility of certain features can be easily deprecated in the library, while still preserving it in a separate tool.
  • Separate spatialdata-migrate tool
@LucaMarconato
Copy link
Member

Thanks for tracking this here. An alternative to consider opening some GitHub discussions for less common issues, as done here: #657.

For more common conversion steps, like migrating from the ShapesFormatV01 (Zarr ragged-array geopandas representation) to ShapesFormatV02 (GeoParquet), a migration tool would indeed be preferable.

@aeisenbarth
Copy link
Contributor Author

aeisenbarth commented Aug 15, 2024

Current format changes:

Current versioning support in SpatialData:

  • Data format is versioned per element type (ShapesFormatV02 from element.group.attrs["version"]), starting with spatialdata>=0.2.2
  • The reader reads the old format and converts into the current version's in-memory datastructure.
  • Some format changes did not increase the version number (#655, #624), not making use of this mechanism.
  • Each format has a different if-branch as implementation, a minor version change (V02.1) cannot easily inherit the implementation of a previous version (v02).

Types of changes in SpatialData format:

  • Zarr group (or hierarchy) changed: key renamed, added, removed
  • Underlying file format changed (Zarr, Parquet, etc.)
  • Metadata key changed: key renamed, added, removed, moved, value changed
  • Validation: Zarr group key, metadata key or value becomes invalid

Existing tools:

  • Django has a migration tool for databases (manage.py migrate)
  • migrate-anything is a simple generic implementation (not actively developed).
    Both work in principle like this:
    • Software version X includes migrations for all previous changes. Each migration has code for "up" (forward) and "down" (backward).
    • The state of applied migrations is determined from a migrations log (with code) stored with the data.
    • If data version Y is <X, it applies all migrations from Y to X, and stores the applied migration code with the data.
    • If data version Y is >X (software was downgraded), the software does not include code of its future migrations. It reads migration code from the data and unapplies all surplus migrations.

Aims:

  • Create a tool to make format changes persistent on disk (migrate on-disk data to latest format).
  • Support special cases that are not in scope of the main project (e.g. issues specific to an data source like Visium).
  • (?) Ensure reader can read all old data (#655, #624) that has not been migrated.

Requirements:

  • We don't want to store code with the data. Thus backward migration is not possible (e.g. using an old software version to migrate a newer data version back).
  • Changes are not always idempotent. Version numbers are needed to determine whether to apply a migration. (example: A migration "Replace \ by \\" must not be applied if already applied before.)
  • Some changes may require user input (renaming, or when there is ambiguity), but should also be possible in a non-interactive mode.
  • We probably don't want to modify on-disk data in-place, so writing the converted data to a new destination is fine.

@LucaMarconato
Copy link
Member

LucaMarconato commented Aug 19, 2024

Thanks for the detailed summary of the format changes. I would proceed as follows.

I would not include these three points in a migration tool:

  • Instead, I would open 3 GItHub discussions, show the URL to the user when the problem is detected in the code (for instance here) and explain in the GItHub discussion how to fix the problem.

For the example above this would roughly be: "Open a terminal and move table into tables, or if both are present, manually open them (they are standard AnnData objects) and choose which one to keep (or merge them). In doing that check that the metadata keys and the region_key, instance_key columns are the one you need".

  • I would then add a migration tool only for the Zarr -> Parquet change (the only one that is reflected into the format).

I think that having a migration tool dealing with the first 3 problems would be complex and it's better to explain to the user what the problem is and how to build a solution, so that they know what happens and they can choose a solution that suits them. For instance, there is no canonical way to fix missing radii because they were negative, so I'd let the user manually choose how to address them.

What do you think about this way to proceed?

@giovp
Copy link
Member

giovp commented Sep 5, 2024

super useful summary @aeisenbarth , I also agree with @LucaMarconato that a separate tool is maybe too much of an overkill. I think the best way would be to stick to the format version as much as possible, and reflect this in the code, without changing the API.

(?) Ensure reader can read all old data (#655, #624) that has not been migrated.

I think this ideally would be true, we should strive to make this possible imho.

@giovp giovp added the format label Sep 6, 2024
@LucaMarconato
Copy link
Member

New format changes:

  • Adding attrs at the SpatialData object level #711 adds a .attrs slot for SpatialData objects which can contain Python data that is serializable to a json string. This data is now automatically read and written to disk. To reflect this format change, the .zattrs at the root of the SpatialData object contains a version string (currently 0.1). This version string is independent from the versioning of the various elements (points, tables, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants