The Big Data Platform is part of the Open Data Hub project. It collects and exposes data sets of various domains.
This platform collects heterogeneous data of different sources and different domains, does elaborations on it and serves the raw and elaborated data through a REST interface.
For a detailed introduction, see our Big Data Platform Introduction.
Table of Contents
- Big Data Platform
- Inbound API (writer)
- Flight rules
- I want to generate a new schema dump out of Hibernate's Entity classes
- I want to update license headers of each source file
- I want to see details of this project as HTML page
- I want to use client in my Java Maven project
- I want to publish a new client sdk on our maven repository
- I want to get started with a new data-collector
- Information
The core of the platform contains the business logic of an INBOUND API which handles connections to the database and provides an API for data collectors (see writer), in form of a REST interface and a Java SDK (see client).
Finally, dto which is a library containinig all Data Transfer Objects
used by the writer
and client
to exchange data in a standardized
format.
The OUTBOUND API is called Ninja.
The writer is a REST API, which takes JSON DTOs, deserializes and validates them and finally stores them in the database. Additionally, it sets stations to active/inactive according to their presence inside the provided data. The writer itself implements the methods to write data and is therefore the endpoint for all data collectors. It uses the persistence-unit of the DAL which has full permissions on the database.
The full API description can be found inside JsonController.java.
If you want to run the application using Docker, the environment is already set up with all dependencies for you. This is the recommended way to test and develop data collectors for this writer API. You only have to install Docker and Docker Compose and follow the instructions below.
In the root folder of this repository:
- Copy
.env.example
to.env
- Run
docker-compose up -d
- You can follow logs with
docker-compose logs -f
Now you have a Postgres instance running on port 5555 and the API on port 8999.
Lets test Postgres first:
-
Login to the DB
a) with Docker, do:
$ docker-compose exec db bash bash-5.1# psql -U bdp bdp
b) natively, do:
PGPASSWORD=password psql -h localhost -p 5555 -U bdp bdp
-
Test the installation as follows:
bdp=# set search_path to intimev2;
bdp=# \dt
List of relations
Schema | Name | Type | Owner
----------+--------------------------+-------+-------
intimev2 | edge | table | bdp
intimev2 | event | table | bdp
intimev2 | flyway_schema_history | table | bdp
intimev2 | location | table | bdp
intimev2 | measurement | table | bdp
intimev2 | measurementhistory | table | bdp
intimev2 | measurementjson | table | bdp
intimev2 | measurementjsonhistory | table | bdp
intimev2 | measurementstring | table | bdp
intimev2 | measurementstringhistory | table | bdp
intimev2 | metadata | table | bdp
intimev2 | provenance | table | bdp
intimev2 | station | table | bdp
intimev2 | type | table | bdp
intimev2 | type_metadata | table | bdp
(15 rows)
... if you see a similar output as above, then you are set!
Please use the curl
commands inside the chapter
Authentication to test the writer API.
If you do not want to use docker, you can also start this application manually.
You need Java 17 and maven, and a Postgres DB. Postgresql can eventually also be
started with our Docker setup. Just call
docker-compose up -d db
. It runs on port 5555. Alternatively, install and
start your own Postgresql instance.
The database, schema and the privileged user must already exist, if that is not the case create them:
-- These values are already set inside the application.properties file, so you do
-- not need to configure anything except the port if you keep them like this!
create database bdp;
create user 'bdp' with login password 'password';
create schema if not exists 'intimev2';
grant all on schema intimev2 to bdp;
To start the writer, do the following:
- Open
writer/src/main/resources/application.properties
and configure it, this step can be omitted if you use our dockerized Postgresql. For your own Postgres, just alter the port to 5432 and make sure you use the same names as shown above. Otherwise, configure also those parameters... - Start the Java application with
mvn spring-boot:run
The application itself will create tables and other database objects for you. If
you prefer to do that manually, set spring.flyway.enabled=false
and execute
the SQL files inside writer/src/main/resources/db/migration
yourself. Replace
${default_schema}
with your default schema, most probably intimev2
.
Please use the curl
commands inside the chapter
Authentication to test the writer API.
We use Keycloak to authenticate. That service provides an access_token
that
can be used to send POST requests to the writer. See the Open Data Hub
Authentication / Quick
Howto
for further details.
curl -X POST -L "https://auth.opendatahub.testingmachine.eu/auth/realms/noi/protocol/openid-connect/token" \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'grant_type=client_credentials' \
--data-urlencode 'client_id=odh-mobility-datacollector-development' \
--data-urlencode 'client_secret=7bd46f8f-c296-416d-a13d-dc81e68d0830'
With this call you get an access_token
that can then be used as follows in all
writer API methods. Here just an example to get all stations:
curl -X GET "http://localhost:8999/json/stations" \
--header 'Content-Type: application/json' \
--header 'Authorization: bearer your-access-token'
You should get an empty JSON list as result.
Write an email to help@opendatahub.com
, if you want to get the client_secret
and an Open Data Hub OAuth2 account for a non-development setup.
DAL is the Data Access Layer which communicates with the DB underneath used by the writer modules. The communication is handled through the ORM Hibernate and its spatial component for geometries. The whole module got developed using PostgreSQL as database and Postgis as an extension.
Connection pooling is handled by HikariCP for high speed connections to the DB.
In some cases geometry transformations and elaborations were needed to be executed on application level and therefore Geotools was added as dependency.
To configure the DAL module to communicate with your database you need to provide configuration and credentials inside
writer/src/main/resources/application.properties
Default can be found at:
writer/src/main/resources/META-INF/persistence.xml
Please note, values inside the application.properties
file, overwrite values
inside persistence.xml
.
We use a schema-generator to
generate the schema for the database. After that you can manually check what the
difference between that schema and the old one is and provide a new flyway
script inside writer/src/main/resources/db/migration
.
Hibernate, our object-relational-mapping (ORM) framework, handles the schema
validation only (for security reasons). Usually, we set the value
hibernate.hbm2ddl.auto = validate
during development and
hibernate.hbm2ddl.auto = none
at runtime for performance reasons on startup.
This chapter describes the most important DAL entities:
station
datatype
record
edge
Station
The station
represents the origin of the data which needs an identifier, a
name, a coordinate and a so called stationtype
. It also should contain the
origin of the data, the current active state (if actively used or not) and if
it has a parent station, used to model a hierarchical station structure. For all
remaining data, which enriches the station, we created a field metadata. It
can hold any kind of meta information in form of a JSON object. To understand
the functionality and the main job of this entity check the source code
Station.java.
Example:
A station can be of stationtype
MeteoStation
, has an identifier89935GW
and a positionlatitude":46.24339407235059,"longitude":11.199431152658656
. It can have additional information like address, municipality, opening times etc., which would be modelled as meta data entry.
DataType
The data type
represents the typology of the data in form of an unique name
and a unit. Description and metric of measurements can also be provided.
Example:
A
temperature
can have a unit°C
and can be anaverage
value of the last 300 seconds (calledperiod
).
Record
A record
represents a single measurement containing a value
, a timestamp
,
a data-type
, a station
, and a provenance
. Provenance indicates which data
collector in which version collected the data. It is needed to implement
traceability between collectors and inserted data, to identify data for
cleansing or bug fixes.
Example:
We measure on
Fri Dec 16 2016 10:47:33
a data typetemperature
(see data type example) of20.4
for a meteo station called89935GW
(see station example).
Edge
An edge represents the spatial geometry between two stations. We model this
internally as a station triple: origin, edge, destination
, because currently
only stations can be exposed through our API. We add a line-geometry to that
triple to describe the entity geographically. Hereby, origin
and destination
are two stations of any type that represent two points on the map. The edge
is
also a station of type LinkStation
, that has no coordinates. It is the
description of the edge.
Example:
A street between two stations, where the measured data could be how many cars passed it.
If you need more information about specific entities or classes, try to use the javadoc or source code inside DAL.
Data transfer objects (DTOs) are used to define the structure of the data
exchange. They are used between data provider and data persister (writer
).
They consist of fields which are all primitives and easily serializable. The
DTO module is a java library contained in all modules of the big data
platform, simply because it defines the communication structure in between.
The following chapters describe the most used DTOs.
Describes a place where measurements get collected. It is the origin of the data. We define the structure inside StationDto.java.
Describes a specific type of data. We define the structure inside DataTypeDto.java
Describes the measured value. We define the structure inside SimpleRecordDto.java
The client contains the API through which components can communicate with
the BDP writer. Just include the client
maven
dependency
in your project and use the existing JSON client
implementation.
The API contains several methods. We describe the most important methods here, for the rest see JSONPusher.java implementation.
Object getDateOfLastRecord(String stationCode,String dataType,Integer period)
This method is required to get the date of the last valid record
Object syncStations(List<StationDto> data)
This method is used to create, update, and deactivate stations of a specific
typology; data
must be a list of StationDto's
Object syncDataTypes(List<DataTypeDto> data)
This method is used to create and update(and therefore upsert) data types;
data
must be a list of DataTypeDto
Object pushData(DataMapDto<? extends RecordDtoImpl> dto)
This is the place, where you place all the data you want to pass to the writer. The data in here gets saved in form of a tree.
Each branch can have multiple child branches, but can also have data itself, which means it can have indefinite depth. Right now, by our internal conventions we store everything on the second level, like this:
+- Station
|
+- DataType
|
`-Data
As value you can put a list of SimpleRecordDto.java, which contains all the data points with a specific station and a specific type. Each point is represented as timestamp and value. To better understand the structure, see the DataMapDto.java source.
See README.md inside
/infrastructure/utils/schema-generator
.
To update license headers in each source code file run mvn license:format
. To
configure the header template edit LICENSE/templates/
files, and set the
correct attributes inside each pom.xml
. See the plugin
license-maven-plugin homepage
for details. Use the quicklicense.sh
script to update all source code license
headers at once.
Run mvn site
to create a HTML page with all details of this project. Results
can be found under <project>/target/site/
, entrypoint is as usual
index.html
.
Include the following snippet in your pom.xml
file:
<repositories>
<repository>
<id>maven-repo.opendatahub.com</id>
<url>https://maven-repo.opendatahub.com/snapshot</url>
</repository>
</repositories>
Include the dependency client
for data collectors:
<dependency>
<groupId>com.opendatahub.timeseries.bdp</groupId>
<artifactId>client</artifactId>
<version>7.3.0</version>
</dependency>
You can also use a version-range, like [7.3.0,8.0.0)
. Find the latest version
in our release channel on
GitHub.
This chapter is for the NOI team only. It describes how to publish a new client manually or via the Github Action workflow on our maven repo. Either as "release" or "snapshot" version...
SNAPSHOT RELEASES: If you push code to the main
branch, which changes
either dto
or client
the Github Action workflow deploys a new snapshot
version of those libraries. The version is then the latest version tag on the
prod
branch and a -SNAPSHOT
postfix. For example, if the version tag is
v7.4.0
, then the new snapshot version string is 7.4.0-SNAPSHOT
(the initial
v
will be removed).
PRODUCTION RELEASES: Push your code to the prod
branch and tag it with a
semantic versioning tag prefixed by v
. As you might notice in the past we had version tags without that prefix, but the new Github Action workflow requires it, so in future please always put it like this. For example, v7.5.0
.
Create a file ~/.m2/settings.xml
, and copy/paste the following code:
<settings>
<servers>
<server>
<id>maven-repo.opendatahub.com-release</id>
<username>your-remote-repos-username</username>
<password>your-remote-repos-password</password>
</server>
<server>
<id>maven-repo.opendatahub.com-snapshot</id>
<username>your-remote-repos-username</username>
<password>your-remote-repos-password</password>
</server>
</servers>
</settings>
Replace your-remote-repos-username
and your-remote-repos-password
with your
Maven repo credentials. We have a group on our AWS/IAM called
s3-odh-maven-repo
that gives permissions to push to the maven repo. Assign
that role to your user eventually, or search for s3-odh-maven-repo
on our
password server for credentials.
Update all pom.xml
files with the correct version. Here an example to create a
snapshot release with version 8.0.1
(do not put a v
prefix):
./infrastructure/utils/quickrelease.sh snapshot 8.0.1
Use ./infrastructure/utils/quickrelease.sh release 8.0.1
for a production release.
Call mvn --projects dto --projects client --also-make clean install deploy
Refer to the Contributing chapter and our [HelloWorld Example Data Collector] inside https://github.com/noi-techpark/bdp-commons to start a new data collector.
For support, please contact help@opendatahub.com.
If you'd like to contribute, please follow our Getting Started instructions.
More documentation can be found at https://docs.opendatahub.com.
The code in this project is licensed under the GNU GENERAL PUBLIC LICENSE Version 3 license. See the LICENSE file for more information.
This project is REUSE compliant, more information about the usage of REUSE in NOI Techpark repositories can be found here.
Since the CI for this project checks for REUSE compliance you might find it useful to use a pre-commit hook checking for REUSE compliance locally. The pre-commit-config file in the repository root is already configured to check for REUSE compliance with help of the pre-commit tool.
Install the tool by running:
pip install pre-commit
Then install the pre-commit hook via the config file by running:
pre-commit install