Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version implementation of preload_data and the rest of CLMS data store #6

Closed
wants to merge 92 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
4082cab
Implement initial methods in clms.py
b-yogesh Nov 4, 2024
a55d133
Refactor code
b-yogesh Nov 7, 2024
52b1e2a
Refactor code again
b-yogesh Nov 7, 2024
31ccd70
Implement get_data_ids generator
b-yogesh Nov 7, 2024
b124ef1
Tiny typo fix
b-yogesh Nov 7, 2024
39d8b37
Implement has_data
b-yogesh Nov 7, 2024
3b34c3a
Initial version of describe_data (not working entirely)
b-yogesh Nov 7, 2024
70a0456
Implement describe_data
b-yogesh Nov 8, 2024
c075479
Implement get_open_data_params_schema
b-yogesh Nov 8, 2024
49cd69b
Implemented access token class
b-yogesh Nov 8, 2024
4ffa9ab
Implemented access token class
b-yogesh Nov 8, 2024
f8ccbfb
[In progress] - open_data implementation
b-yogesh Nov 11, 2024
94ff10d
[In progress] - open_data implementation - implemented _prepare_downl…
b-yogesh Nov 12, 2024
26a6322
[In progress] - open_data implementation - more impl.
b-yogesh Nov 12, 2024
f6645f2
[In progress] - open_data implementation - more impl.
b-yogesh Nov 14, 2024
7951b2e
Update make_api_request
b-yogesh Nov 15, 2024
e950bb7
temporary bbox and crs handling
b-yogesh Nov 15, 2024
5ccede3
fix condition
b-yogesh Nov 15, 2024
24e93d1
fix get_data_store_params_schema
b-yogesh Nov 15, 2024
2e5f46f
Add constants
b-yogesh Nov 15, 2024
fc938bc
Refactoring
b-yogesh Nov 15, 2024
76169d4
Add schema for preload_data
b-yogesh Nov 15, 2024
d680129
Remove error message truncation
b-yogesh Nov 15, 2024
a91bd5c
Update get_metadata method
b-yogesh Nov 15, 2024
8aa6014
Add initial unsupported datasets check
b-yogesh Nov 15, 2024
13febb6
Raise exceptions for unsupported data and add preload params schema
b-yogesh Nov 18, 2024
63ba00c
Update schema methods
b-yogesh Nov 18, 2024
a049218
Add TODOs
b-yogesh Nov 19, 2024
a6ccac0
Implement first TODO: change data_id def
b-yogesh Nov 19, 2024
3a1c329
Implement first TODO: add include_attr bool impl
b-yogesh Nov 19, 2024
efc0c16
Fix has_data and describe_data
b-yogesh Nov 19, 2024
d4810bd
[WIP] queue download refactor
b-yogesh Nov 19, 2024
e821dd2
[WIP] refactor existing code into preload and clms classes
b-yogesh Nov 20, 2024
6d640f2
[WIP] implement queue downloads.
b-yogesh Nov 22, 2024
5fbd5aa
[WIP] Further impl. preload
b-yogesh Nov 25, 2024
7926ce4
[WIP] Further impl. preload
b-yogesh Nov 26, 2024
d064097
[WIP] Working download impl.
b-yogesh Nov 27, 2024
50b4c52
[WIP] Add initial cancel handler
b-yogesh Nov 27, 2024
faaec47
[WIP] Add initial cancel handler
b-yogesh Nov 29, 2024
20e44b4
[WIP] Refactoring
b-yogesh Dec 2, 2024
167baf7
[WIP] Improved download data
b-yogesh Dec 2, 2024
3ea8e7f
[WIP] Improved download data + merging
b-yogesh Dec 3, 2024
0be046b
Finish implementation
b-yogesh Dec 5, 2024
16ee337
Added tests for utils.py
b-yogesh Dec 5, 2024
e793038
Added tests for api_token.py
b-yogesh Dec 5, 2024
3abef8e
Refactor clms.py
b-yogesh Dec 6, 2024
425780d
Add docstrings clms.py
b-yogesh Dec 6, 2024
db902a3
Add tests for clms.py
b-yogesh Dec 6, 2024
3e3ce73
add preload const
b-yogesh Dec 6, 2024
5d7125d
Refactor
b-yogesh Dec 6, 2024
dceb47d
Add Docstrings and Type Annotations
b-yogesh Dec 6, 2024
7dbdb4b
Add example notebook
b-yogesh Dec 6, 2024
0d4018d
Minor fixes
b-yogesh Dec 6, 2024
865ead4
Add missing docs
b-yogesh Dec 6, 2024
3e1b545
Fix tests
b-yogesh Dec 10, 2024
975e498
Update error message
b-yogesh Dec 10, 2024
2176db7
Modify list to iterator in get_data_ids
b-yogesh Dec 10, 2024
715d1cf
Update CLMSDataStoreTutorial.ipynb
b-yogesh Dec 10, 2024
c7c79a8
Update README.md
b-yogesh Dec 10, 2024
a88eaf9
Create unittest.yml
b-yogesh Dec 10, 2024
70188bd
Add test_cache_manager.py
b-yogesh Dec 10, 2024
56fcf42
Add test_token_handler.py
b-yogesh Dec 10, 2024
65a271f
Rename .github/unittest.yml to .github/workflows/unittest.yml
b-yogesh Dec 10, 2024
7a36783
Add test_processor.py
b-yogesh Dec 10, 2024
d98c712
Merge remote-tracking branch 'origin/yogesh_preload-data' into yogesh…
b-yogesh Dec 10, 2024
ef40d9a
Update env
b-yogesh Dec 10, 2024
1c7888b
Remove redundant api class
b-yogesh Dec 10, 2024
4546279
Update pyproject.toml
b-yogesh Dec 10, 2024
f91ecd1
add ipywidgets
b-yogesh Dec 10, 2024
2e2250a
fix test_clms.py
b-yogesh Dec 10, 2024
dd3229a
Add CLMS url as constant
b-yogesh Dec 10, 2024
5b0eda2
Fix tests
b-yogesh Dec 10, 2024
a7a612d
Update README.md
b-yogesh Dec 10, 2024
b9c9e03
Update README.md
b-yogesh Dec 11, 2024
dc622e9
Update CLMSDataStoreTutorial.ipynb
b-yogesh Dec 11, 2024
f3efb86
Update .gitignore
b-yogesh Dec 11, 2024
0850fb6
Add numpy for tests
b-yogesh Dec 11, 2024
8ebf8f7
Add missing license text
b-yogesh Dec 11, 2024
7243361
Apply suggestions from code review
b-yogesh Dec 12, 2024
36583c4
Rename classes and fix tests
b-yogesh Dec 12, 2024
78c8943
Convert file_store and cache to properties
b-yogesh Dec 12, 2024
da3928e
Remove test_store.py
b-yogesh Dec 12, 2024
54e2f7f
Remove None return doc
b-yogesh Dec 12, 2024
02f0261
Improve docstrings
b-yogesh Dec 12, 2024
bdb362d
Update README.md
b-yogesh Dec 12, 2024
b637c77
Move functions away from utils to respective files
b-yogesh Dec 12, 2024
cc44bdb
Move constants to their respective classes
b-yogesh Dec 12, 2024
f178dce
Improve make_api_request
b-yogesh Dec 13, 2024
b30329c
Improve make_api_request #2
b-yogesh Dec 13, 2024
bb57381
Improve tests
b-yogesh Dec 13, 2024
706c6df
Datastore to MutableDataStore
b-yogesh Dec 13, 2024
27ab2ef
Remove init comments
b-yogesh Jan 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .github/workflows/unittest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Unittest xcube-clms

on:
push:
release:
types: [published]

jobs:
unittest:
runs-on: ubuntu-latest
steps:
- name: checkout xcube-clms
uses: actions/checkout@v4

- name: Set up MicroMamba
uses: mamba-org/setup-micromamba@v1
with:
environment-file: environment.yml

- name: Run unit tests
shell: bash -l {0}
run: |
pytest --cov=xcube_clms --cov-report=xml

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v4
with:
verbose: true
token: ${{ secrets.CODECOV_TOKEN }}
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,5 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

examples/notebooks/preload_cache/
106 changes: 105 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,105 @@
# xcube-clms
# xcube-clms

[![Unittest xcube-clms](https://github.com/xcube-dev/xcube-clms/actions/workflows/unittest.yml/badge.svg)](https://github.com/xcube-dev/xcube-clms/actions/workflows/unittest.yml)
[![Codecov xcube-clms](https://codecov.io/gh/xcube-dev/xcube-clms/graph/badge.svg?token=n6X9zQIkXb)](https://codecov.io/gh/xcube-dev/xcube-clms)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![License](https://img.shields.io/github/license/dcs4cop/xcube-smos)](https://github.com/xcube-dev/xcube-clms/blob/main/LICENSE)

The `xcube-clms` Python package provides an
[xcube data store](https://xcube.readthedocs.io/en/latest/api.html#data-store-framework)
that enables access to datasets hosted by the
[Copernicus Land Monitoring Service (CLMS)](https://land.copernicus.eu/en).
The data store is called `"clms"` and implemented as an [xcube plugin](https://xcube.readthedocs.io/en/latest/plugins.html).
It uses the [CLMS API](https://eea.github.io/clms-api-docs/introduction.html)
under the hood.

## Setup <a name="setup"></a>

### Installing the xcube-clms plugin from the repository <a name="install_source"></a>

To install xcube-clms directly from the git repository, clone the repository,
`cd` into `xcube-clms`, and follow the steps below:

```bash
conda env create -f environment.yml
conda activate xcube-clms
pip install .
```

This sets up a new conda environment, installs all the dependencies required
for `xcube-clms`, and then installs `xcube-clms` directly from the repository
into the environment.

### Create credentials to access the CLMS API

Create the credentials as a `json` file required for the CLMS API following
the [documentation](https://eea.github.io/clms-api-docs/authentication.html).
The credentials will be required during the initialization of the CLMS data
store. Please follow the instructions in the
`example/notebooks/CLMSDataStoreTutorial.ipynb`,
on how to pass the credentials from the `json` file to the store.

## Testing <a name="testing"></a>

To run the unit test suite:

```bash
pytest
```

## Additional Notes about the data store

This data store introduces the initial mechanism of preloading data, including cache management, downloading, and file processing.
This is currently experimental and will be changing in the newer versions.

This new additon of a preload interface is due to the nature of the CLMS API which allows the user to create data requests, with undetermined time to wait in the queue for the request to be processed, followed by downloading zip files, unzipping them, extracting them in a cache which can be then opened using a file store.

Preloading allows the data store to request the datasets for download to the CLMS API (in this data store) in a non-blocking way which handles sending the download request, queueing for download, waiting in the queue, periodically checking for the request status, downloading the data, extracting and post-processing it.

The preload mechanism can be used using `.preload_data(*data_ids)` on the CLMS data store instance.

The following classes (components) are responsible for this mechanism:

**CLMS**

- Serves as the main interface to interact with the CLMS API. This class coordinates with the PreloadData class to preload the data into a local filestore.

**CacheManager**

- Manages the local cache of preloaded data.
- Maintains a dictionary (cache) that maps data_ids to their respective file paths.
- Handles file store from the xcube data store in a local directory and refreshes the cache when necessary.

**DownloadTaskManager**

- Handles the download process, including managing download requests and checking their statuses.
- Retrieves task statuses based on dataset and file IDs or task IDs, determining whether the download is pending, completed, or cancelled.
- Initiates data downloads in chunks and manages zip file extraction, looking specifically for geo data. Definition of geo data is defined in the function docstring in the notes.

**ClmsApiTokenHandler**

- Handles the creation and refreshing of the CLMS API token given the credentials which can be obtained following the steps here

**FileProcessor**

- Handles the postprocessing of downloaded data, extracting, stacking and storing geo files from downloaded zip files.

**PreloadData**

- The main class responsible for orchestrating the preloading of datasets.
- It coordinates with _CacheManager_, _DownloadTaskManager_, _ClmsApiTokenHandler_ and _FileProcessor_ classes to handle the complete process of caching, data downloading, making sure token is valid and post-processing of downloaded data.
- Utilizes threading for handling multiple data preloading tasks concurrently.
- Uses notebook.tqdm for displaying progress bars

## CLMS API

- Requires an EU account to register on the CLMS site.
- Once registered, the user should create an access token json file as described ![here](https://eea.github.io/clms-api-docs/authentication.html)
- The user can now use this json credentials file with the CLMS store (in development)

## CLMS API issues
This API has some problems as listed below

- The datasets which are made available via requests, contain a download link to a zip file, which is valid only for 3 days. But we found that this is not true and we cannot rely on this time to make sure that the download link still works. So, we have to create a workaround to manage our own expiry times. This issue has been raised with the CLMS service desk. Quoting their reply For the first issue mentioned by you: `The status is completed and there is indicated that there are 2 days for expiring, but the download link is already expired, we are going to investigate this bug.`
- We use the API to figure out if a certain data_id has already been requested to the CLMS server and its status so that we can get the download link directly or if it has not been requested yet or expired, we request it. But this is also not possible because although on their web UI, we cannot see the old downloads that have expired, the API does return the expired requests which were completed and do not contain any information that they are expired or when they will expire. Quoting the CLMS helpdesk replies `For the second issue mentioned by you: the @datarequest_search endpoint does not seem to be working as expected, we are going to consult the API experts so to check its functioning and in case an improvement is feasible in our side, we´ll let you know.` and its follow up after a week `After having analysed the possibility to improve the status of the downloads, our team answers the following: Currently, our download system is not able to extract information on whether the link has expired or not, therefore our API does not provide this information.. Due to this, we had to create workarounds to figure out if a certain dataset's link was expired or not.`
- The cancel endpoint for the API does not work and the issue was raised with the helpdesk team as well. Quoting their reply `Recently a new firewall of the CLMS Portal machine has been setup. This new firewall is blocking some of the process cancelation request. We've detected the issue and working with the IT team to solve it`.
6 changes: 6 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,13 @@ dependencies:
- python>=3.10
- xarray
b-yogesh marked this conversation as resolved.
Show resolved Hide resolved
- xcube >= 1.7.0
- cryptography
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you need this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required by the JWT library to create the grant which is required to get the access token for the CLMS API.

- tqdm
- ipywidgets
# for testing
- numpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you import a package in non-test code, add it as a true project dependency.
Do not rely on transitive dependencies.

Copy link
Collaborator Author

@b-yogesh b-yogesh Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I only use numpy in the tests, so I have not added it in the project dependency.

- black
- flake8
- pytest
- pytest-cov
- pytest-recording
Loading
Loading