- Free software: BSD license
- Documentation: https://WenyuOuyang.github.io/hydrodatasource
📜 中文文档
Although numerous public watershed hydrological datasets are available, there are still challenges in this field:
- Many datasets are not updated or included in subsequent versions.
- Some datasets remain uncovered by existing collections.
- Non-public datasets cannot be openly shared.
To address these issues, hydrodatasource provides a framework to organize and manage these datasets, making them more efficient for use in watershed-based research and production scenarios.
This repository complements hydrodataset, which focuses on public datasets. In contrast, hydrodatasource integrates a broader range of data resources, including non-public and custom datasets.
hydrodatasource processes data that primarily falls into three categories:
These datasets are organized and managed according to predefined formats, including:
- GIS Datasets: Geographic vector data, such as watershed boundaries and station shapefiles.
- Gridded Datasets: Includes datasets like ERA5Land, GPM, and AIFS, which are stored in a MinIO database.
These datasets are often proprietary or confidential and require specific tools for formatting and integration, including:
- Custom Station Data: User-prepared station data formatted into NetCDF for seamless model usage.
- Industry Data: Professionally integrated and formatted datasets.
Based on Category A and B data, custom hydrological datasets are created, adhering to predefined standard formats for specific research needs.
hydrodatasource provides standardized methods for:
- Structuring datasets according to predefined conventions.
- Integrating various data sources into a unified framework.
- Supporting data access and processing for hydrological modeling.
- Public Data: Supports data format conversion and local file operations.
- Non-Public Data: Provides tools to format and integrate user-prepared data.
- MinIO Integration: Efficient management of large-scale gridded data via API.
The repository structure supports diverse workflows, including:
- Category A GIS Data: Tools to organize and access GIS datasets.
- Category A Gridded Data: Large-scale grid data management via MinIO.
- Category B Data: Custom tools to clean and process station, reservoir, and basin time-series data.
- Custom Hydrological Datasets: Support for predefined dataset formats.
hydrodatasource interacts with the following components:
- hydrodataset: Provides necessary support for accessing public datasets.
- HydroDataCompiler: Supports semi-automated processing of non-public and custom data (currently not public).
- MinIO Database: Efficient storage and management of Category A gridded data (currently accessible only within the internal network).
Install the package via pip:
pip install hydrodatasource
Note: The project is still in the early stages of development, so development mode is recommended.
The repository adopts the following directory structure for organizing data:
├── datasets-origin # Public hydrological datasets
├── datasets-interim # Custom hydrological datasets
├── gis-origin # Public GIS datasets
├── grids-origin # Gridded datasets
├── stations-origin # Category B station data (raw)
├── stations-interim # Category B station data (processed)
├── reservoirs-origin # Category B reservoir data (raw)
├── reservoirs-interim # Category B reservoir data (processed)
├── basins-origin # Category B basin data (raw)
├── basins-interim # Category B basin data (processed)
origin
: Raw data, often from proprietary sources, in unified formats.interim
: Preprocessed data ready for analysis or modeling.
-
Public GIS Data:
- Store vector files in the
gis-origin
folder, such as watershed boundaries and station shapefiles. - Process the data using:
from hydrodatasource import gis gis.process_gis_data(input_path="gis-origin", output_path="gis-interim")
- Store vector files in the
-
Gridded Datasets:
- Store raw grid data in
grids-origin
, such as ERA5Land and GPM. - Use MinIO API to download or manage data stored in the database:
from hydrodatasource import grid grid.download_from_minio(dataset_name="ERA5Land", save_path="grids-interim")
- Store raw grid data in
-
Station Data:
- Store raw data in the
stations-origin
folder and processed data instations-interim
. - Check the standard format for station data:
from hydrodatasource import station station.get_station_format()
- Format and process the data:
station.process_station_data(input_path="stations-origin", output_path="stations-interim")
- Store raw data in the
-
Reservoir Data:
- Store raw reservoir data in the
reservoirs-origin
folder and cleaned data inreservoirs-interim
. - Specific tools are provided for integration and formatting.
- Store raw reservoir data in the
-
Basin Data:
- Store raw basin data in the
basins-origin
folder and processed data inbasins-interim
. - These datasets typically include attributes and spatial information supporting hydrological modeling.
- Store raw basin data in the
Custom datasets are stored in the datasets-interim
folder. They are organized according to predefined standard formats to facilitate integration and subsequent model use.
The MinIO database is primarily used for storing and managing large-scale gridded data (Category A data), such as ERA5Land and other dynamic datasets:
- Configure MinIO access in the
hydro_settings.yml
file. - Upload or download data:
from hydrodatasource import minio minio.upload_to_minio(local_path="grids-interim/ERA5Land", dataset_name="ERA5Land")
hydrodatasource bridges diverse hydrological datasets and advanced modeling needs by providing standardized workflows and modular tools. This ensures efficient data management and integration, supporting both research and operational applications.