First version implementation of preload_data and the rest of CLMS data store #6

b-yogesh · 2024-12-02T13:26:18Z

This PR introduces the initial mechanism of preloading data, including cache management, downloading, and file processing.

The following classes (components) are responsible for the mechanism:

CLMS

Serves as the main interface to interact with the CLMS API. This class coordinates with the PreloadData class to preload the data into a local filestore.

CacheManager

Manages the local cache of preloaded data.
Maintains a dictionary (cache) that maps data_ids to their respective file paths.
Handles file store from the xcube data store in a local directory and refreshes the cache when necessary.

DownloadTaskManager

Handles the download process, including managing download requests and checking their statuses.
Retrieves task statuses based on dataset and file IDs or task IDs, determining whether the download is pending, completed, or cancelled.
Initiates data downloads in chunks and manages zip file extraction, looking specifically for geo data. Definition of geo data is defined in the function docstring in the notes.

CLMSAPITokenHandler

Handles the creation and refreshing of the CLMS API token given the credentials which can be obtained following the steps

FileProcessor

Handles the postprocessing of downloaded data, extracting, stacking and storing geo files from downloaded zip files.

PreloadData

The main class responsible for orchestrating the preloading of datasets.
It coordinates with CacheManager, DownloadTaskManager, CLMSAPITokenHandler and FileProcessor classes to handle the complete process of caching, data downloading, making sure token is valid and post-processing of downloaded data.
Utilizes threading for handling multiple data preloading tasks concurrently.
Uses notebook.tqdm for displaying progress bars

Currently, only logs are shown in the jupyter cells that let the user know of the request status which is updated every 60 seconds to avoid sending several requests to the CLMS API. This is configurable, so it can be changed if we need a lower or higher waiting time between request status. tqdm is used here that shows the progress bar for the actual download, unzipping, and postprocessing.

A tutorial notebook has also been added to the examples folder to show how to use this data store.

There are some missing tests for a few classes, but they will be added in the next PRs

Closes #3, #4, #5

…oad_request

forman

Store Implementation

The preload mechanism lacks the preload handle concept (but that's ok for time being)
Do not limit the caching data store to file. Allow for any MutableDataStore instead.

Coding

Using camel-case names, do not concatenate upper-case abbreviations, use camel-case instead: CLMSAPITokenHandler is not user friendly, use ClmsApiTokenHandler instead
Don't use os, use fsspec instead. Actually, don't use specific local filesystem calls at all, if not urgently required.

Testing

Test code is often hard to read because of the bloating mock setup
Use mocking more carefully! It can cause architectural code changes to be very expensive in the end!
Test with respect to correct results or behavious, not implementation.
Build test class names like so: ${class_or_func}Test, not Test${class_or_func}
Using plain assert doesn't distinguish between expected and actual values - which makes it harder to understand what went wrong if tests fail. Prefer test classes that use assert methods from unitest.TestCase.

Docstrings

Should start with a short, one-line sentence that follows immediately the """ and ends with a dot .. Use active speech in this sentence.
Void methods are supposed to return None. Don't document this.

FYI @konstntokas

README.md

forman · 2024-12-11T14:16:54Z

environment.yml

  # for testing
+  - numpy


If you import a package in non-test code, add it as a true project dependency.
Do not rely on transitive dependencies.

Currently, I only use numpy in the tests, so I have not added it in the project dependency.

forman · 2024-12-11T14:17:48Z

pyproject.toml

@@ -28,8 +34,12 @@ exclude = ["test*", "doc*"]

 [project.optional-dependencies]
 dev = [
+    "numpy",


Move into dependencies list and potentially also others.

I only use numpy in the tests, so I have not added it in the dependencies list. By others, do you mean the pytest, black, flake8 ... etc.?

pyproject.toml

test/test_cache_manager.py

xcube_clms/store.py

xcube_clms/utils.py

xcube_clms/processor.py

Co-authored-by: Norman Fomferra <norman.fomferra@brockmann-consult.de>

konstntokas

Approved (since the preload api will change) with some comments which you may address or note down for a later PR.

xcube_clms/clms.py

konstntokas · 2024-12-17T09:43:07Z

xcube_clms/constants.py

 PENDING = "PENDING"
 COMPLETE = "COMPLETE"
-UNDEFINED = "UNDEFINED"
 CANCELLED = "CANCELLED"


Why do you want to keep them? There was a comment in Normans review last time that this is redundant.

Based on what I understand on Norman's comment was to let the constants be here if they are used in multiple files, which is the case for these.

xcube_clms/preload.py

xcube_clms/processor.py

konstntokas · 2024-12-17T09:54:10Z

xcube_clms/store.py

@@ -61,9 +60,6 @@ def get_data_store_params_schema(cls) -> JsonObjectSchema:
        )

        params = dict(
-            url=JsonStringSchema(
-                title="URL of CLMS API",
-            ),
            credentials=JsonObjectSchema(
                dict(**credentials_params),
                required=("client_id", "user_id", "token_uri", "private_key"),


in the credentials json which I got from the CLMS store, there are more fields. Are the others not needed for the store? Please double check if required is correct here.

konstntokas · 2024-12-17T09:55:07Z

xcube_clms/utils.py

+_PORTAL_TYPE = {"portal_type": "DataSet"}
+_METADATA_FIELDS = "metadata_fields"
+_FULL_SCHEMA = "fullobjects"


Are the needed?

Yes. I would like all the hardcoded strings to be defined as constants for easier manipulation later (if required).

konstntokas · 2024-12-17T09:59:19Z

xcube_clms/utils.py

-            response.raise_for_status()
+        response.raise_for_status()
+
+    except HTTPError as e:
        # This is to make sure that the user gets to see the actual error
        # message which raise_for_status does not show
-        except HTTPError:
-            error_details = response.text
-            if "application/json" in response.headers.get("Content-Type", "").lower():
+        error_details = response.text
+        if "application/json" in response.headers.get("Content-Type", "").lower():
+            try:
                error_details = response.json()
-            raise HTTPError(f"HTTP error {response.status_code}: {error_details}")
-
-    except JSONDecodeError as e:
-        raise JSONDecodeError(f"Invalid JSON: {e}", response.text, 0)
-    except HTTPError as eh:
-        raise HTTPError(f"HTTP error occurred: {eh}")
-    except Timeout as et:
-        raise Timeout(f"Timeout error occurred: {et}")
-    except RequestException as e:
-        raise RequestException(f"Request error occurred: {e}")
+            except JSONDecodeError as json_e:
+                LOG.error(f"Failed to decode JSON error response: {json_e}")
+        new_error_message = (
+            f"HTTP error {response.status_code}: {error_details}. Original error: {e}"
+        )
+        LOG.error(new_error_message)
+        raise HTTPError(new_error_message, response=e.response) from e
+
+    except (
+        Timeout,
+        RequestException,
+    ) as e:
+        LOG.error(f"An error occurred during the request to {url}: {e}")
+        raise
+
    except Exception as e:
-        raise Exception(f"Unknown error occurred: {e}")
+        LOG.error(f"Unknown error occurred: {e}")
+        raise
+


cannot you do something like this?

response = requests.request() if response.ok: return response # or do something else else: raise DataStoreError(response.raise_for_status())

That can be done but if I dont do this, the actual error message is lost leaving the user confused about the reason for the exception.
I have added a comment as well explaining the reason I do this.

b-yogesh · 2025-01-13T08:31:06Z

Closing this PR as #12 refactors this PR.

b-yogesh added 30 commits November 4, 2024 20:54

Implement initial methods in clms.py

4082cab

Refactor code

a55d133

Refactor code again

52b1e2a

Implement get_data_ids generator

31ccd70

Tiny typo fix

b124ef1

Implement has_data

39d8b37

Initial version of describe_data (not working entirely)

3b34c3a

Implement describe_data

70a0456

Implement get_open_data_params_schema

c075479

Implemented access token class

49cd69b

Implemented access token class

4ffa9ab

[In progress] - open_data implementation

f8ccbfb

[In progress] - open_data implementation - implemented _prepare_downl…

94ff10d

…oad_request

[In progress] - open_data implementation - more impl.

26a6322

[In progress] - open_data implementation - more impl.

f6645f2

Update make_api_request

7951b2e

temporary bbox and crs handling

e950bb7

fix condition

5ccede3

fix get_data_store_params_schema

24e93d1

Add constants

2e5f46f

Refactoring

fc938bc

Add schema for preload_data

76169d4

Remove error message truncation

d680129

Update get_metadata method

a91bd5c

Add initial unsupported datasets check

8aa6014

Raise exceptions for unsupported data and add preload params schema

13febb6

Update schema methods

63ba00c

Add TODOs

a049218

Implement first TODO: change data_id def

a6ccac0

Implement first TODO: add include_attr bool impl

3a1c329

Update README.md

a7a612d

b-yogesh requested review from konstntokas and forman December 10, 2024 17:04

b-yogesh added 5 commits December 11, 2024 09:21

Update README.md

b9c9e03

Update CLMSDataStoreTutorial.ipynb

dc622e9

Update .gitignore

f3efb86

Add numpy for tests

0850fb6

Add missing license text

8ebf8f7

b-yogesh mentioned this pull request Dec 11, 2024

Implement initial methods in clms.py #2

Closed

forman requested changes Dec 11, 2024

View reviewed changes

b-yogesh and others added 6 commits December 12, 2024 14:52

Apply suggestions from code review

7243361

Co-authored-by: Norman Fomferra <norman.fomferra@brockmann-consult.de>

Rename classes and fix tests

36583c4

Convert file_store and cache to properties

78c8943

Remove test_store.py

da3928e

Remove None return doc

54e2f7f

Improve docstrings

02f0261

b-yogesh mentioned this pull request Dec 12, 2024

Extract interface from FileProcessor and use that #11

Open

b-yogesh added 7 commits December 12, 2024 16:44

Update README.md

bdb362d

Move functions away from utils to respective files

b637c77

Move constants to their respective classes

cc44bdb

Improve make_api_request

f178dce

Improve make_api_request #2

b30329c

Improve tests

bb57381

Datastore to MutableDataStore

706c6df

b-yogesh requested a review from forman December 16, 2024 08:05

konstntokas approved these changes Dec 17, 2024

View reviewed changes

Remove init comments

27ab2ef

b-yogesh mentioned this pull request Jan 10, 2025

Refactoring Preload API #12

Merged

b-yogesh closed this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First version implementation of preload_data and the rest of CLMS data store #6

First version implementation of preload_data and the rest of CLMS data store #6

b-yogesh commented Dec 2, 2024 •

edited

Loading

forman left a comment

forman Dec 11, 2024

b-yogesh Dec 12, 2024 •

edited

Loading

forman Dec 11, 2024

b-yogesh Dec 12, 2024

konstntokas left a comment

konstntokas Dec 17, 2024

b-yogesh Jan 6, 2025

konstntokas Dec 17, 2024

konstntokas Dec 17, 2024

b-yogesh Jan 6, 2025

konstntokas Dec 17, 2024

b-yogesh Jan 6, 2025

b-yogesh commented Jan 13, 2025

First version implementation of preload_data and the rest of CLMS data store #6

First version implementation of preload_data and the rest of CLMS data store #6

Conversation

b-yogesh commented Dec 2, 2024 • edited Loading

forman left a comment

Choose a reason for hiding this comment

Store Implementation

Coding

Testing

Docstrings

Choose a reason for hiding this comment

b-yogesh Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

konstntokas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b-yogesh commented Jan 13, 2025

b-yogesh commented Dec 2, 2024 •

edited

Loading

b-yogesh Dec 12, 2024 •

edited

Loading