Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to update curation_reference field or re-upload dataset after deletion #222

Open
roman-bushuiev opened this issue Nov 24, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@roman-bushuiev
Copy link

Polaris version

0.9.1

Python Version

3.11.9

Operating System

Mac

Installation

pip install polaris-lib

Description

Hi!

I was following the official documentation to upload a new dataset (MassSpecGym) to Polaris. Everything worked fine, and my dataset appeared on the Polaris Hub. Then, I wanted to submit my dataset for certification by opening a new discussion in polaris-recipes.

The discussion form required me to check the box I confirm that I filled out at least the readme, source and curation_reference fields for my Polaris dataset.. I realized that I hadn't filled in the curation_reference field because it wasn't specified in the documentation example. I wasn't able to find any programmatic way to add it or a way to add it through the Polaris Hub web page. So, I deleted my dataset via the Polaris Hub web page and attempted to upload it again, this time including the curation_reference field specified in the polaris.dataset.Dataset constructor. Unfortunately, now I am unable to do so because I receive an error saying that a dataset with my name (massspecgym) already exists, even though it does not appear to be present in the Polaris Hub.

Is there a way to clean up and upload it again? I would appreciate your help! I am pasting the traceback below.

File ~/miniconda/envs/massspecgym/lib/python3.11/site-packages/httpx/_models.py:761, in Response.raise_for_status(self)
    [760](https://file+.vscode-resource.vscode-cdn.net/Users/roman/MassSpecGym/notebooks/~/miniconda/envs/massspecgym/lib/python3.11/site-packages/httpx/_models.py:760) message = message.format(self, error_type=error_type)
--> [761](https://file+.vscode-resource.vscode-cdn.net/Users/roman/MassSpecGym/notebooks/~/miniconda/envs/massspecgym/lib/python3.11/site-packages/httpx/_models.py:761) raise HTTPStatusError(message, request=request, response=self)

HTTPStatusError: Client error '409 Conflict' for url 'https://polarishub.io/api/v1/dataset/roman-bushuiev/massspecgym'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/409

The above exception was the direct cause of the following exception:

PolarisHubError                           Traceback (most recent call last)
Cell In[4], [line 4](vscode-notebook-cell:?execution_count=4&line=4)
      [1](vscode-notebook-cell:?execution_count=4&line=1) from polaris.hub.client import PolarisHubClient
      [3](vscode-notebook-cell:?execution_count=4&line=3) with PolarisHubClient() as client:
----> [4](vscode-notebook-cell:?execution_count=4&line=4)     client.upload_dataset(dataset=dataset)

File ~/miniconda/envs/massspecgym/lib/python3.11/site-packages/polaris/hub/client.py:560, in PolarisHubClient.upload_dataset(self, dataset, access, timeout, owner, if_exists)
    [555](https://file+.vscode-resource.vscode-cdn.net/Users/roman/MassSpecGym/notebooks/~/miniconda/envs/massspecgym/lib/python3.11/site-packages/polaris/hub/client.py:555)     raise InvalidDatasetError(
    [556](https://file+.vscode-resource.vscode-cdn.net/Users/roman/MassSpecGym/notebooks/~/miniconda/envs/massspecgym/lib/python3.11/site-packages/polaris/hub/client.py:556)         f"\nPlease specify a supported license for this dataset prior to uploading to the Polaris Hub.\nOnly some licenses are supported - {get_args(SupportedLicenseType)}."
...
PolarisHubError: The request to the Polaris Hub failed. The Hub responded with:
{
  "message": "Dataset 'massspecgym', with slug 'massspecgym', already exists"
}

Steps to reproduce

  1. Create a new dataset following the official documentation.
  2. Delete the dataset via Polaris Hub web page.
  3. Try to create the dataset with the same name again.

Additional output

No response

@roman-bushuiev roman-bushuiev added the bug Something isn't working label Nov 24, 2024
@cwognum
Copy link
Collaborator

cwognum commented Nov 25, 2024

Hey @roman-bushuiev , thanks for reporting!

There are two things happening here:

  • We currently don't support editing artifact metadata. This has been on our roadmap for a long time, but it has not been prioritized yet.
  • We intentionally soft delete artifacts, which means your dataset actually still exists.

I've just hard deleted your dataset, so you should be able to recreate it now.


One comment: Although your dataset is not too large in terms of the actual number of bytes, I did notice that your dataset has quite a high number of datapoints (200k+). With the current implementation of benchmarks, the way we represent the split is quite inefficient with such large numbers of datapoints. We are working on a Benchmark V2 implementation which we're expecting to announce soon (in the next ~2 weeks) that solves this. However, the Benchmark V2 implementation will only work with the Dataset V2 implementation.

All of that is to say that I would recommend you to recreate your dataset as a V2 dataset. This is not documented yet, but the code should look very similar. The biggest change is that you'll have to create a Zarr archive.

Starting from your pd.DataFrame, it should be easy to convert to Zarr. You may have to tweak the dtype for some of these arrays (e.g. for strings), but that should be minimal.

import zarr
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

root = zarr.open("path/to/archive.zarr", "w")
for col in df.columns:
    root.array(col, data=df[col].values)

If you want to go down the V2 road, let me know if you need help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants