Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version implementation of preload_data and the rest of CLMS data store #6

Closed
wants to merge 92 commits into from

Conversation

b-yogesh
Copy link
Collaborator

@b-yogesh b-yogesh commented Dec 2, 2024

This PR introduces the initial mechanism of preloading data, including cache management, downloading, and file processing.

The following classes (components) are responsible for the mechanism:

CLMS

  • Serves as the main interface to interact with the CLMS API. This class coordinates with the PreloadData class to preload the data into a local filestore.

CacheManager

  • Manages the local cache of preloaded data.
  • Maintains a dictionary (cache) that maps data_ids to their respective file paths.
  • Handles file store from the xcube data store in a local directory and refreshes the cache when necessary.

DownloadTaskManager

  • Handles the download process, including managing download requests and checking their statuses.
  • Retrieves task statuses based on dataset and file IDs or task IDs, determining whether the download is pending, completed, or cancelled.
  • Initiates data downloads in chunks and manages zip file extraction, looking specifically for geo data. Definition of geo data is defined in the function docstring in the notes.

CLMSAPITokenHandler

  • Handles the creation and refreshing of the CLMS API token given the credentials which can be obtained following the steps here

FileProcessor

  • Handles the postprocessing of downloaded data, extracting, stacking and storing geo files from downloaded zip files.

PreloadData

  • The main class responsible for orchestrating the preloading of datasets.
  • It coordinates with CacheManager, DownloadTaskManager, CLMSAPITokenHandler and FileProcessor classes to handle the complete process of caching, data downloading, making sure token is valid and post-processing of downloaded data.
  • Utilizes threading for handling multiple data preloading tasks concurrently.
  • Uses notebook.tqdm for displaying progress bars

Currently, only logs are shown in the jupyter cells that let the user know of the request status which is updated every 60 seconds to avoid sending several requests to the CLMS API. This is configurable, so it can be changed if we need a lower or higher waiting time between request status. tqdm is used here that shows the progress bar for the actual download, unzipping, and postprocessing.

A tutorial notebook has also been added to the examples folder to show how to use this data store.

There are some missing tests for a few classes, but they will be added in the next PRs

Closes #3, #4, #5

Copy link
Member

@forman forman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store Implementation

  • The preload mechanism lacks the preload handle concept (but that's ok for time being)
  • Do not limit the caching data store to file. Allow for any MutableDataStore instead.

Coding

  • Using camel-case names, do not concatenate upper-case abbreviations, use camel-case instead: CLMSAPITokenHandler is not user friendly, use ClmsApiTokenHandler instead
  • Don't use os, use fsspec instead. Actually, don't use specific local filesystem calls at all, if not urgently required.

Testing

  • Test code is often hard to read because of the bloating mock setup
  • Use mocking more carefully! It can cause architectural code changes to be very expensive in the end!
  • Test with respect to correct results or behavious, not implementation.
  • Build test class names like so: ${class_or_func}Test, not Test${class_or_func}
  • Using plain assert doesn't distinguish between expected and actual values - which makes it harder to understand what went wrong if tests fail. Prefer test classes that use assert methods from unitest.TestCase.

Docstrings

  • Should start with a short, one-line sentence that follows immediately the """ and ends with a dot .. Use active speech in this sentence.
  • Void methods are supposed to return None. Don't document this.

FYI @konstntokas

README.md Outdated Show resolved Hide resolved
# for testing
- numpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you import a package in non-test code, add it as a true project dependency.
Do not rely on transitive dependencies.

Copy link
Collaborator Author

@b-yogesh b-yogesh Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I only use numpy in the tests, so I have not added it in the project dependency.

@@ -28,8 +34,12 @@ exclude = ["test*", "doc*"]

[project.optional-dependencies]
dev = [
"numpy",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move into dependencies list and potentially also others.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only use numpy in the tests, so I have not added it in the dependencies list. By others, do you mean the pytest, black, flake8 ... etc.?

pyproject.toml Outdated Show resolved Hide resolved
test/test_cache_manager.py Outdated Show resolved Hide resolved
xcube_clms/store.py Outdated Show resolved Hide resolved
xcube_clms/store.py Show resolved Hide resolved
xcube_clms/utils.py Outdated Show resolved Hide resolved
xcube_clms/utils.py Show resolved Hide resolved
xcube_clms/processor.py Show resolved Hide resolved
@b-yogesh b-yogesh requested a review from forman December 16, 2024 08:05
Copy link
Collaborator

@konstntokas konstntokas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved (since the preload api will change) with some comments which you may address or note down for a later PR.

xcube_clms/clms.py Outdated Show resolved Hide resolved
Comment on lines 47 to 49
PENDING = "PENDING"
COMPLETE = "COMPLETE"
UNDEFINED = "UNDEFINED"
CANCELLED = "CANCELLED"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to keep them? There was a comment in Normans review last time that this is redundant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what I understand on Norman's comment was to let the constants be here if they are used in multiple files, which is the case for these.

xcube_clms/preload.py Outdated Show resolved Hide resolved
xcube_clms/processor.py Show resolved Hide resolved
@@ -61,9 +60,6 @@ def get_data_store_params_schema(cls) -> JsonObjectSchema:
)

params = dict(
url=JsonStringSchema(
title="URL of CLMS API",
),
credentials=JsonObjectSchema(
dict(**credentials_params),
required=("client_id", "user_id", "token_uri", "private_key"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the credentials json which I got from the CLMS store, there are more fields. Are the others not needed for the store? Please double check if required is correct here.

Comment on lines +38 to +40
_PORTAL_TYPE = {"portal_type": "DataSet"}
_METADATA_FIELDS = "metadata_fields"
_FULL_SCHEMA = "fullobjects"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I would like all the hardcoded strings to be defined as constants for easier manipulation later (if required).

Comment on lines 120 to 169
response.raise_for_status()
response.raise_for_status()

except HTTPError as e:
# This is to make sure that the user gets to see the actual error
# message which raise_for_status does not show
except HTTPError:
error_details = response.text
if "application/json" in response.headers.get("Content-Type", "").lower():
error_details = response.text
if "application/json" in response.headers.get("Content-Type", "").lower():
try:
error_details = response.json()
raise HTTPError(f"HTTP error {response.status_code}: {error_details}")

except JSONDecodeError as e:
raise JSONDecodeError(f"Invalid JSON: {e}", response.text, 0)
except HTTPError as eh:
raise HTTPError(f"HTTP error occurred: {eh}")
except Timeout as et:
raise Timeout(f"Timeout error occurred: {et}")
except RequestException as e:
raise RequestException(f"Request error occurred: {e}")
except JSONDecodeError as json_e:
LOG.error(f"Failed to decode JSON error response: {json_e}")
new_error_message = (
f"HTTP error {response.status_code}: {error_details}. Original error: {e}"
)
LOG.error(new_error_message)
raise HTTPError(new_error_message, response=e.response) from e

except (
Timeout,
RequestException,
) as e:
LOG.error(f"An error occurred during the request to {url}: {e}")
raise

except Exception as e:
raise Exception(f"Unknown error occurred: {e}")
LOG.error(f"Unknown error occurred: {e}")
raise

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot you do something like this?

        response = requests.request()
        if response.ok:
            return response # or do something else
        else:
            raise DataStoreError(response.raise_for_status())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be done but if I dont do this, the actual error message is lost leaving the user confused about the reason for the exception.
I have added a comment as well explaining the reason I do this.

@b-yogesh b-yogesh mentioned this pull request Jan 10, 2025
@b-yogesh
Copy link
Collaborator Author

Closing this PR as #12 refactors this PR.

@b-yogesh b-yogesh closed this Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add CI
3 participants