Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostgreSQL database for Open Data #21

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion opendata-python/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ venv:
pipenv install tox tox-pyenv twine

test: venv
pipenv run tox
docker-compose -f docker-compose.test.yaml up -d
- pipenv run tox
docker-compose down

build: venv
pipenv run python setup.py sdist bdist_wheel
Expand Down
62 changes: 62 additions & 0 deletions opendata-python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,3 +166,65 @@ activities[99].metadata
...
}}
```

### Connecting to a PostgreSQL database
Although having all the Open Data files available as plain files on your computer has advantages (especially for less tech-savvy users), querying the data is slow and can be complicated.
To overcome this, it is possible to store all the data in a [PostgreSQL](https://www.postgresql.org/) database as well.

Setting up PostgreSQL (documentation [here](https://www.postgresql.org/docs/11/tutorial-install.html)) can be hassle, so there is a `docker-compose.yaml` included in this repository that *should* work out of the box by running `docker-compose up` in the directory where the file is stored.
I am not going into the rabbit hole of explaining how to install docker and docker-compose here (a quick search will yield enough results for that). One comment: On MacOS and Linux installation is mostly painless, on Windows it not always is and I would advice against using docker there.
As an alternative, you can use a local installation of PostgreSQL (assuming username=opendata, password=password, database name=opendata by default).

When PostgreSQL is installed correctly and running, inserting data into the database is as easy as:
```python
from opendata import OpenData
from opendata.db.main import OpenDataDB
from opendata.models import LocalAthlete

od = OpenData()
opendatadb = OpenDataDB()
opendatadb.create_tables() # This is only needed once

athlete = od.get_remote_athlete('0031326c-e796-4f35-8f25-d3937edca90f')

opendatadb.insert_athlete(athlete)
```
Please note: This only inserts the athlete into the database, not the activities for this athlete.
To add al the activities too:
```python
for activity in athlete.activities():
opendatadb.insert_activity(activity, athlete)
```

At this point there are 2 tables in the opendata database: "athletes" and "activities".
The database schemas for both tables can be viewed [here](opendata/db/models.py).

If you are familiar with raw SQL you can query the database directly, but if you prefer to stay in Python land, I got you covered too: Under the hood this library uses the [SQLAlchemy](https://www.sqlalchemy.org/) ORM.
For some general documentation on how that works, see [here](https://docs.sqlalchemy.org/en/latest/orm/tutorial.html).
Querying the data is possible using SQLAlchemy's query language (documentation [here](https://docs.sqlalchemy.org/en/latest/orm/query.html)).

For example, to get a count of all activities that have power:
```python
from opendata.db import models
from sqlalchemy.sql import not_

session = opendatadb.get_session()
session.query(models.Activities).filter(not_(models.Activities.power.all('nan'))).count()
```

Filters can be [chained](https://docs.sqlalchemy.org/en/latest/glossary.html#term-method-chaining) to apply multiple filters in one query:
```python
from datetime import datetime

from opendata.db import models
from sqlalchemy.sql import not_

session = opendatadb.get_session()
session.query(models.Activities).filter(Activities.datetime <= datetime(2017, 1, 1)).\
filter(not_(models.Activities.power.all('nan'))).count()
```

You can also query for nested keys/values in the metadata (stored in the "meta" column because SQLAlchemy uses the metadata column internally):
```python
session.query(models.Activity).filter(models.Activity.metrics.contains({'workout_time': '2703.00000'})).count()
```
12 changes: 12 additions & 0 deletions opendata-python/docker-compose.test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
version: '3.3'

services:
postgres:
image: postgres
restart: always
ports:
- "5433:5432"
environment:
POSTGRES_USER: opendata
POSTGRES_PASSWORD: password
POSTGRES_DB: opendata
14 changes: 14 additions & 0 deletions opendata-python/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
version: '3.3'

services:
postgres:
image: postgres
restart: always
ports:
- "5432:5432"
volumes:
- ./postgres-data:/var/lib/postgresql/data
environment:
POSTGRES_USER: opendata
POSTGRES_PASSWORD: password
POSTGRES_DB: opendata
7 changes: 6 additions & 1 deletion opendata-python/opendata/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,10 @@
data_prefix='data',
metadata_prefix='metadata',
datasets_prefix='datasets',
local_storage=config['Storage']['local_storage_path']
local_storage=config['Storage']['local_storage_path'],
db_host='localhost',
db_port='5432',
db_user='opendata',
db_password='password',
db_name='opendata',
)
Empty file.
11 changes: 11 additions & 0 deletions opendata-python/opendata/db/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
csv_to_db_mapping = {
'secs': 'time',
'km': 'distance',
'spd': 'speed',
'power': 'power',
'cad': 'cadence',
'hr': 'heartrate',
'alt': 'altitude',
'slope': 'slope',
'temp': 'temperature',
}
83 changes: 83 additions & 0 deletions opendata-python/opendata/db/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
from contextlib import contextmanager

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

from opendata.conf import settings
from opendata.utils import filename_to_datetime
from . import models
from .constants import csv_to_db_mapping


class OpenDataDB:
def __init__(self, host=settings.db_host, port=settings.db_port,
user=settings.db_user, password=settings.db_password,
database=settings.db_name):
self.host = host
self.port = port
self.user = user
self.password = password
self.database = database
self.Session = sessionmaker()

def get_engine(self):
return create_engine(
f'postgres://{self.user}:{self.password}@{self.host}:{self.port}/{self.database}'
)

@contextmanager
def engine(self):
engine = self.get_engine()
yield engine
engine.dispose()

def get_session(self):
return self.Session(bind=self.get_engine())

@contextmanager
def session(self):
session = self.get_session()
yield session
session.close()

def create_tables(self):
with self.session() as session, self.engine() as engine:
models.Base.metadata.create_all(engine)
session.commit()

def insert_athlete(self, athlete):
with self.session() as session:
session.add(models.Athlete(
id=athlete.id,
meta=athlete.metadata
))
session.commit()

def insert_activity(self, activity, athlete=None):
with self.session() as session:
if activity.metadata is not None \
and 'METRICS' in activity.metadata:
metrics = activity.metadata.pop('METRICS')
else:
metrics = None

db_activity = models.Activity(
id=activity.id,
datetime=filename_to_datetime(activity.id),
meta=activity.metadata,
metrics=metrics,
)

if athlete is not None:
db_activity.athlete = athlete.id

for column in csv_to_db_mapping.keys():
if column in activity.data:
setattr(
db_activity,
csv_to_db_mapping[column],
activity.data[column].values.tolist()
)

session.add(db_activity)
session.commit()
42 changes: 42 additions & 0 deletions opendata-python/opendata/db/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from sqlalchemy import Column, Float, ForeignKey, String
from sqlalchemy.dialects import postgresql
from sqlalchemy.types import DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship

Base = declarative_base()


class Athlete(Base):
__tablename__ = 'athletes'

id = Column(String, primary_key=True)
meta = Column(postgresql.JSONB)
activities = relationship('Activity')

def __repr__(self):
return f'<Athlete({self.id})'


class Activity(Base):
__tablename__ = 'activities'

id = Column(String, primary_key=True)
athlete = Column(String, ForeignKey('athletes.id'))
datetime = Column(DateTime)

meta = Column(postgresql.JSONB)
metrics = Column(postgresql.JSONB)

time = Column(postgresql.ARRAY(Float, dimensions=1))
distance = Column(postgresql.ARRAY(Float, dimensions=1))
speed = Column(postgresql.ARRAY(Float, dimensions=1))
power = Column(postgresql.ARRAY(Float, dimensions=1))
cadence = Column(postgresql.ARRAY(Float, dimensions=1))
heartrate = Column(postgresql.ARRAY(Float, dimensions=1))
altitude = Column(postgresql.ARRAY(Float, dimensions=1))
slope = Column(postgresql.ARRAY(Float, dimensions=1))
temperature = Column(postgresql.ARRAY(Float, dimensions=1))

def __repr__(self):
return f'<Activity({self.id})'
6 changes: 5 additions & 1 deletion opendata-python/opendata/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,12 @@ def date_string_to_filename(date_string):
return suffix + '.csv'


def filename_to_datetime(filename):
return datetime.strptime(filename, FILENAME_FORMAT_WITH_EXTENSION)


def filename_to_date_string(filename):
dt = datetime.strptime(filename, FILENAME_FORMAT_WITH_EXTENSION)
dt = filename_to_datetime(filename)
return dt.strftime(DATE_STRING_FORMAT) + 'UTC'


Expand Down
2 changes: 2 additions & 0 deletions opendata-python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
"boto3",
"pandas",
"pkgsettings",
"psycopg2",
"sqlalchemy==1.2.5",
],
tests_require=[
"pytest",
Expand Down
5 changes: 5 additions & 0 deletions opendata-python/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@
import pytest


pytest_modules = ['db']

settings.db_port = '5433'


def dummy_metadata(dir_name):
metadata = {
'ATHLETE': 'some metadata',
Expand Down
1 change: 1 addition & 0 deletions opendata-python/tests/db/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from opendata.db.main import OpenDataDB
63 changes: 63 additions & 0 deletions opendata-python/tests/db/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
from uuid import uuid4

import pytest
from opendata.conf import settings
from opendata.db import models, main
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker


@pytest.fixture(scope='function')
def test_db_empty():
engine = create_engine(
f'postgres://{settings.db_user}:{settings.db_password}@{settings.db_host}:{settings.db_port}')
conn = engine.connect()
conn.execute('commit')

db_name = f'opendata_test_{uuid4().hex}'
conn.execute(f'create database {db_name}')
conn.execute('commit')

yield db_name

engine.dispose()
conn.close()


@pytest.fixture(scope='function')
def test_db(test_db_empty):
opendatadb = main.OpenDataDB(database=test_db_empty)
opendatadb.create_tables()
return test_db_empty


@pytest.fixture(scope='function')
def db_engine(test_db):
engine = create_engine(
f'postgres://{settings.db_user}:{settings.db_password}@{settings.db_host}:{settings.db_port}/{test_db}'
)

yield engine

engine.dispose()



@pytest.fixture(scope='function')
def db_conn(db_engine):
conn = engine.connect()
conn.execute('commit')

yield conn

conn.close()


@pytest.fixture(scope='function')
def db_session(db_engine):
Session = sessionmaker()
session = Session(bind=db_engine)

yield session

session.close()
Loading