Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split docs #31

Merged
merged 11 commits into from
Sep 17, 2020
595 changes: 24 additions & 571 deletions README.md

Large diffs are not rendered by default.

28 changes: 28 additions & 0 deletions dev/dev.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Developer instructions

As a developer, you can build the code like this:

make install

For testing, add a local database with expected credentials, for instance like this:

sudo -u postgres psql
postgres=# CREATE USER tbtest WITH PASSWORD 'tbtest';
postgres=# CREATE DATABASE tbtest WITH OWNER = tbtest;
postgres=# exit

or this:

docker run --name test-postgres -p 5432:5432 -e POSTGRES_PASSWORD=tbtest -e POSTGRES_USER=tbtest -e POSTGRES_DB_NAME=tbtest -d postgres

And run tests:

make test

To update dependencies run:

pip-compile -o ci/requirements.txt -U

To install dependencies run:

pip install -r ci/requirements.txt
47 changes: 47 additions & 0 deletions timely_beliefs/docs/accuracy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Metrics of probabilistic accuracy

## Table of contents

1. [Accuracy and error metrics](#accuracy-and-error-metrics)
1. [Probabilistic forecasts](#probabilistic-forecasts)
1. [Probabilistic reference](#probabilistic-reference)
1. [References](#references)

## Accuracy and error metrics

To our knowledge, there is no standard metric for accuracy.
However, there are some standard metrics for what can be considered to be its opposite: error.
By default, we give back the Mean Absolute Error (MAE),
the Mean Absolute Percentage Error (MAPE)
and the Weighted Absolute Percentage Error (WAPE).
Each of these metrics is a representation of how wrong a belief is (believed to be),
with its convenience depending on use case.
For example, for intermittent demand time series (i.e. sparse data with lots of zero values) MAPE is not a useful metric.
For an intuitive representation of accuracy that works in many cases, we suggest to use:

>>> df["accuracy"] = 1 - df["wape"]

With this definition:

- 100% accuracy denotes that all values are correct
- 50% accuracy denotes that, on average, the values are wrong by half of the reference value
- 0% accuracy denotes that, on average, the values are wrong by exactly the reference value (i.e. zeros or twice the reference value)
- negative accuracy denotes that, on average, the values are off-the-chart wrong (by more than the reference value itself)

## Probabilistic forecasts

The previous metrics (MAE, MAPE and WAPE) are technically not defined for probabilistic beliefs.
However, there is a straightforward generalisation of MAE called the Continuous Ranked Probability Score (CRPS), which is used instead.
The other metrics follow by dividing over the deterministic reference value.
For simplicity in usage of the `timely-beliefs` package,
the metrics names in the BeliefsDataFrame are the same regardless of whether the beliefs are deterministic or probabilistic.

## Probabilistic reference

It is possible that the reference itself is a probabilistic belief rather than a deterministic belief.
Our implementation of CRPS handles this case, too, by calculating the distance between the cumulative distribution functions of each forecast and reference [(Hans Hersbach, 2000)](#references).
As the denominator for calculating MAPE and WAPE, we use the expected value of the probabilistic reference.

## References

- Hans Hersbach. [Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems](https://journals.ametsoc.org/doi/pdf/10.1175/1520-0434%282000%29015%3C0559%3ADOTCRP%3E2.0.CO%3B2) in Weather and Forecasting, Volume 15, No. 5, pages 559-570, 2000.
21 changes: 21 additions & 0 deletions timely_beliefs/docs/confidence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Keeping track of confidence

To keep track of the confidence of a data point, timely-beliefs works with probability distributions.
More specifically, the BeliefsDataFrame contains points of interest on the cumulative distribution function (CDF),
and leaves it to the user to set an interpolation policy between those points.
This allows you to describe both discrete possible event values (as a probability mass function) and continuous possible event values (as a probability density function).
A point of interest on the CDF is described by the `cumulative_probability` index level (ranging between 0 and 1) and the `event_value` column (a possible value).

The default interpolation policy is to interpret the CDF points as discrete possible event values,
leading to a non-decreasing step function as the CDF.
In case an event value with a cumulative probability of 1 is missing, the last step is extended to 1 (i.e. the chance of an event value that is greater than largest available event value is taken to be 0).

A deterministic belief consists of a single row in the BeliefsDataFrame.
Regardless of the cumulative probability actually listed (we take 0.5 by default),
the default interpolation policy will interpret the single CDF point as an event value stated with 100% certainty.
The reason why we choose a default cumulative probability of 0.5 instead of 1 is that, in our experience, sources more commonly intend to report their expected value rather than an event value with absolute confidence.

A probabilistic belief consists of multiple rows in the BeliefsDataFrame,
with a shared `event_start`, `belief_time` and `source`, but different `cumulative_probability` values.
_For a future release we are considering adding interpolation policies to interpret the CDF points as describing a normal distribution or a (piecewise) uniform distribution,
to offer out-of-the-box support for resampling continuous probabilistic beliefs._
143 changes: 143 additions & 0 deletions timely_beliefs/docs/db.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Database storage

## Table of contents

1. [Derived database classes](#derived-database-classes)
1. [Table creation and session](#table-creation-and-session)
1. [Subclassing](#subclassing)

## Derived database classes

The timely-beliefs library supports persisting your beliefs data in a database.
All relevant classes have a subclass which also derives from [sqlalchemy's declarative base](https://docs.sqlalchemy.org/en/13/orm/extensions/declarative/index.html?highlight=declarative).

The timely-beliefs library comes with database-backed classes for the three main components of the data model - `DBTimedBelief`, `DBSensor` and `DBBeliefSource`.
Objects from these classes can be used just like their super classes, so for instance `DBTimedBelief` objects can be used for creating a `BeliefsDataFrame`.

### Table creation and storage

You can let sqlalchemy create the tables in your database session and start using the DB classes (or subclasses, see below) and program code without much work by yourself.
The database session is under your control - where or how you get it, depends on the context you're working in.
Here is an example how to set up a session and also have sqlachemy create the tables:

from sqlalchemy.orm import sessionmaker
from timely_beliefs.db_base import Base as TBBase

SessionClass = sessionmaker()
session = None

def create_db_and_session():
engine = create_engine("your-db-connection-string")
SessionClass.configure(bind=engine)

TBBase.metadata.create_all(engine)

if session is None:
session = SessionClass()

# maybe add some inital sensors and sources to your session here ...

return session

Note how we're using timely-belief's sqlalchemy base (we're calling it `TBBase`) to create them.
This does not create other tables you might have in your data model:

session = create_db_and_session()
for table_name in ("belief_source", "sensor", "timed_beliefs"):
assert table_name in session.tables.keys()

Now you can add objects to your database and query them:

from timely_beliefs import DBBeliefSource, DBSensor, DBTimedBelief

source = DBBeliefSource(name="my_mom")
session.add(source)

sensor = DBSensor(name="AnySensor")
session.add(sensor)

session.flush()

now = datetime.now(tz=timezone("Europe/Amsterdam"))
belief = DBTimedBelief(
sensor=sensor,
source=source,
belief_time=now,
event_start=now + timedelta(minutes=3),
value=100,
)
session.add(belief)

q = session.query(DBTimedBelief).filter(DBTimedBelief.event_value == 100)
assert q.count() == 1
assert q.first().source == source
assert q.first().sensor == sensor
assert sensor.beliefs == [belief]



### Subclassing

`DBTimedBelief`, `DBSensor` and `DBBeliefSource` also can be subclassed, for customization purposes.
Possible reasons are to add more attributes or to use an existing table with a different name.

Adding fields should be most interesting for sensors and maybe belief sources.
Below is an example, for the case of a db-backed case, where we wanted a sensor to have a location.
We added three attributes, `latitude`, `longitude` and `location_name`:

from sqlalchemy import Column, Float, String
from timely_beliefs import DBSensor


class DBLocatedSensor(DBSensor):
"""A sensor with a location lat/long and location name"""

latitude = Column(Float(), nullable=False)
longitude = Column(Float(), nullable=False)
location_name = Column(String(80), nullable=False)

def __init__(
self,
latitude: float = None,
longitude: float = None,
location_name: str = "",
**kwargs,
):
self.latitude = latitude
self.longitude = longitude
self.location_name = location_name
DBSensor.__init__(self, **kwargs)

Changing the table name is more tricky. Here is a class where we do that.
This one uses a Mixin class (which is also used to create the class `DBTimedBelief` which we saw above) ― so we have to do more work, but also have more freedom to influence lower-level things such as the `table_name` attribute and pointing to a customer table for belief sources ("my_belief_source"):


from sqlalchemy import Column, Float, ForeignKey
from sqlalchemy.orm import backref, relationship
from sqlalchemy.ext.declarative import declared_attr
from timely_beliefs import TimedBeliefDBMixin


class JoyfulBeliefInCustomTable(Base, TimedBeliefDBMixin):

__tablename__ = "my_timed_belief"

happiness = Column(Float(), default=0)

@declared_attr
def source_id(cls):
return Column(Integer, ForeignKey("my_belief_source.id"), primary_key=True)

source = relationship(
"RatedSourceInCustomTable", backref=backref("beliefs", lazy=True)
)

def __init__(self, sensor, source, happiness: float = None, **kwargs):
self.happiness = happiness
TimedBeliefDBMixin.__init__(self, sensor, source, **kwargs)
Base.__init__(self)


Note that we don say where the sqlalchemy `Base` comes from here. This is the one from your project.
If you create tables from timely_belief's Base (see above) as well, you end up with more tables that you probably want to use.
Which is not a blocker, but for cleanliness you might want to get all tables from timely beliefs base or define all Table implementations yourself, such as with `JoyfulBeliefInCustomTable` above.
10 changes: 10 additions & 0 deletions timely_beliefs/docs/lineage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Lineage

Get the (number of) sources contributing to the BeliefsDataFrame:

>>> df.lineage.sources
array([<BeliefSource Source A>, <BeliefSource Source B>], dtype=object)
>>> df.lineage.number_of_sources
2

Many more convenient properties can be found in `df.lineage`.
73 changes: 73 additions & 0 deletions timely_beliefs/docs/resampling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Resampling

BeliefsDataFrames come with a custom resample method `.resample_events()` to infer new beliefs about underlying events over time (upsampling) or aggregated events over time (downsampling).

Resampling a BeliefsDataFrame can be an expensive operation, especially when the frame contains beliefs from multiple sources and/or probabilistic beliefs.

## Table of contents

1. [Upsampling](#upsampling)
1. [Downsampling](#downsampling)

## Upsampling

Upsample to events with a resolution of 5 minutes:

>>> from datetime import timedelta
>>> df = timely_beliefs.examples.example_df
>>> df5m = df.resample_events(timedelta(minutes=5))
>>> df5m.sort_index(level=["belief_time", "source"]).head(9)
event_value
event_start belief_time source cumulative_probability
2000-01-03 09:00:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.1587 90.0
0.5000 100.0
0.8413 110.0
2000-01-03 09:05:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.1587 90.0
0.5000 100.0
0.8413 110.0
2000-01-03 09:10:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.1587 90.0
0.5000 100.0
0.8413 110.0

When resampling, the event resolution of the underlying sensor remains the same (it's still a fixed property of the sensor):

>>> df.sensor.event_resolution == df5m.sensor.event_resolution
True

However, the event resolution of the BeliefsDataFrame is updated, as well as knowledge horizons and knowledge times:

>>> df5m.event_resolution
datetime.timedelta(seconds=300)
>>> -df.knowledge_horizons[0] # note negative horizons denote "after the fact", and the original resolution was 15 minutes
Timedelta('0 days 00:15:00')
>>> -df5m.knowledge_horizons[0]
Timedelta('0 days 00:05:00')

## Downsampling

Downsample to events with a resolution of 2 hours:

>>> df2h = df.resample_events(timedelta(hours=2))
>>> df2h.sort_index(level=["belief_time", "source"]).head(15)
event_value
event_start belief_time source cumulative_probability
2000-01-03 09:00:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.158700 90.0
0.500000 100.0
1.000000 110.0
2000-01-03 10:00:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.025186 225.0
0.079350 235.0
0.133514 240.0
0.212864 245.0
0.329350 250.0
0.408700 255.0
0.579350 260.0
0.750000 265.0
1.000000 275.0
2000-01-03 12:00:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.158700 360.0
0.500000 400.0
1.000000 440.0
>>> -df2h.knowledge_horizons[0]
Timedelta('0 days 02:00:00')

Notice the time-aggregation of probabilistic beliefs about the two events between 10 AM and noon.
Three possible outcomes for both events led to nine possible worlds, because downsampling assumes by default that the values indicate discrete possible outcomes.
Loading