SeitaBV · Flix6x · Sep 17, 2020 · Sep 15, 2020 · Sep 15, 2020 · Sep 16, 2020
diff --git a/README.md b/README.md
diff --git a/dev/dev.md b/dev/dev.md
@@ -0,0 +1,28 @@
+# Developer instructions
+
+As a developer, you can build the code like this:
+
+    make install
+
+For testing, add a local database with expected credentials, for instance like this:
+
+    sudo -u postgres psql                                          
+    postgres=# CREATE USER tbtest WITH PASSWORD 'tbtest';
+    postgres=# CREATE DATABASE tbtest WITH OWNER = tbtest;
+    postgres=# exit
+
+or this:
+
+    docker run --name test-postgres -p 5432:5432 -e POSTGRES_PASSWORD=tbtest -e POSTGRES_USER=tbtest -e POSTGRES_DB_NAME=tbtest -d postgres
+
+And run tests:
+
+    make test
+
+To update dependencies run:
+
+    pip-compile -o ci/requirements.txt -U
+
+To install dependencies run:
+
+    pip install -r ci/requirements.txt
diff --git a/timely_beliefs/docs/accuracy.md b/timely_beliefs/docs/accuracy.md
@@ -0,0 +1,47 @@
+# Metrics of probabilistic accuracy
+
+## Table of contents
+
+1. [Accuracy and error metrics](#accuracy-and-error-metrics)
+1. [Probabilistic forecasts](#probabilistic-forecasts)
+1. [Probabilistic reference](#probabilistic-reference)
+1. [References](#references)
+
+## Accuracy and error metrics
+
+To our knowledge, there is no standard metric for accuracy.
+However, there are some standard metrics for what can be considered to be its opposite: error.
+By default, we give back the Mean Absolute Error (MAE),
+the Mean Absolute Percentage Error (MAPE)
+and the Weighted Absolute Percentage Error (WAPE).
+Each of these metrics is a representation of how wrong a belief is (believed to be),
+with its convenience depending on use case.
+For example, for intermittent demand time series (i.e. sparse data with lots of zero values) MAPE is not a useful metric.
+For an intuitive representation of accuracy that works in many cases, we suggest to use:
+
+    >>> df["accuracy"] = 1 - df["wape"]
+
+With this definition:
+
+- 100% accuracy denotes that all values are correct
+- 50% accuracy denotes that, on average, the values are wrong by half of the reference value
+- 0% accuracy denotes that, on average, the values are wrong by exactly the reference value (i.e. zeros or twice the reference value)
+- negative accuracy denotes that, on average, the values are off-the-chart wrong (by more than the reference value itself)
+
+## Probabilistic forecasts
+
+The previous metrics (MAE, MAPE and WAPE) are technically not defined for probabilistic beliefs.
+However, there is a straightforward generalisation of MAE called the Continuous Ranked Probability Score (CRPS), which is used instead.
+The other metrics follow by dividing over the deterministic reference value.
+For simplicity in usage of the `timely-beliefs` package,
+the metrics names in the BeliefsDataFrame are the same regardless of whether the beliefs are deterministic or probabilistic.
+
+## Probabilistic reference
+
+It is possible that the reference itself is a probabilistic belief rather than a deterministic belief.
+Our implementation of CRPS handles this case, too, by calculating the distance between the cumulative distribution functions of each forecast and reference [(Hans Hersbach, 2000)](#references).
+As the denominator for calculating MAPE and WAPE, we use the expected value of the probabilistic reference.
+
+## References
+
+- Hans Hersbach. [Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems](https://journals.ametsoc.org/doi/pdf/10.1175/1520-0434%282000%29015%3C0559%3ADOTCRP%3E2.0.CO%3B2) in Weather and Forecasting, Volume 15, No. 5, pages 559-570, 2000.
diff --git a/timely_beliefs/docs/confidence.md b/timely_beliefs/docs/confidence.md
@@ -0,0 +1,21 @@
+# Keeping track of confidence
+
+To keep track of the confidence of a data point, timely-beliefs works with probability distributions.
+More specifically, the BeliefsDataFrame contains points of interest on the cumulative distribution function (CDF),
+and leaves it to the user to set an interpolation policy between those points.
+This allows you to describe both discrete possible event values (as a probability mass function) and continuous possible event values (as a probability density function).
+A point of interest on the CDF is described by the `cumulative_probability` index level (ranging between 0 and 1) and the `event_value` column (a possible value).
+
+The default interpolation policy is to interpret the CDF points as discrete possible event values,
+leading to a non-decreasing step function as the CDF.
+In case an event value with a cumulative probability of 1 is missing, the last step is extended to 1 (i.e. the chance of an event value that is greater than largest available event value is taken to be 0).
+
+A deterministic belief consists of a single row in the BeliefsDataFrame.
+Regardless of the cumulative probability actually listed (we take 0.5 by default),
+the default interpolation policy will interpret the single CDF point as an event value stated with 100% certainty. 
+The reason why we choose a default cumulative probability of 0.5 instead of 1 is that, in our experience, sources more commonly intend to report their expected value rather than an event value with absolute confidence.
+
+A probabilistic belief consists of multiple rows in the BeliefsDataFrame,
+with a shared `event_start`, `belief_time` and `source`, but different `cumulative_probability` values.
+_For a future release we are considering adding interpolation policies to interpret the CDF points as describing a normal distribution or a (piecewise) uniform distribution,
+to offer out-of-the-box support for resampling continuous probabilistic beliefs._
diff --git a/timely_beliefs/docs/db.md b/timely_beliefs/docs/db.md
@@ -0,0 +1,143 @@
+# Database storage
+
+## Table of contents
+
+1. [Derived database classes](#derived-database-classes)
+1. [Table creation and session](#table-creation-and-session)
+1. [Subclassing](#subclassing)
+
+## Derived database classes
+
+The timely-beliefs library supports persisting your beliefs data in a database.
+All relevant classes have a subclass which also derives from [sqlalchemy's declarative base](https://docs.sqlalchemy.org/en/13/orm/extensions/declarative/index.html?highlight=declarative).
+
+The timely-beliefs library comes with database-backed classes for the three main components of the data model - `DBTimedBelief`, `DBSensor` and `DBBeliefSource`.
+Objects from these classes can be used just like their super classes, so for instance `DBTimedBelief` objects can be used for creating a `BeliefsDataFrame`.
+
+### Table creation and storage
+
+You can let sqlalchemy create the tables in your database session and start using the DB classes (or subclasses, see below) and program code without much work by yourself.
+The database session is under your control - where or how you get it, depends on the context you're working in.
+Here is an example how to set up a session and also have sqlachemy create the tables:
+
+    from sqlalchemy.orm import sessionmaker
+    from timely_beliefs.db_base import Base as TBBase
+
+    SessionClass = sessionmaker()
+    session = None
+
+    def create_db_and_session():
+        engine = create_engine("your-db-connection-string")
+        SessionClass.configure(bind=engine)
+
+        TBBase.metadata.create_all(engine)
+
+        if session is None:
+            session = SessionClass()
+
+        # maybe add some inital sensors and sources to your session here ...
+
+        return session
+
+Note how we're using timely-belief's sqlalchemy base (we're calling it `TBBase`) to create them.
+This does not create other tables you might have in your data model:
+
+    session = create_db_and_session()
+    for table_name in ("belief_source", "sensor", "timed_beliefs"):
+        assert table_name in session.tables.keys()
+
+Now you can add objects to your database and query them:
+
+    from timely_beliefs import DBBeliefSource, DBSensor, DBTimedBelief
+
+    source = DBBeliefSource(name="my_mom")
+    session.add(source)
+
+    sensor = DBSensor(name="AnySensor")
+    session.add(sensor)
+
+    session.flush()
+
+    now = datetime.now(tz=timezone("Europe/Amsterdam"))
+    belief = DBTimedBelief(
+        sensor=sensor,
+        source=source,
+        belief_time=now,
+        event_start=now + timedelta(minutes=3),
+        value=100,
+    )
+    session.add(belief)
+
+    q = session.query(DBTimedBelief).filter(DBTimedBelief.event_value == 100)
+    assert q.count() == 1
+    assert q.first().source == source
+    assert q.first().sensor == sensor
+    assert sensor.beliefs == [belief]
+
+
+
+### Subclassing
+
+`DBTimedBelief`, `DBSensor` and `DBBeliefSource` also can be subclassed, for customization purposes.
+Possible reasons are to add more attributes or to use an existing table with a different name.
+
+Adding fields should be most interesting for sensors and maybe belief sources.
+Below is an example, for the case of a db-backed case, where we wanted a sensor to have a location.
+We added three attributes, `latitude`, `longitude` and `location_name`:
+
+    from sqlalchemy import Column, Float, String
+    from timely_beliefs import DBSensor
+
+
+    class DBLocatedSensor(DBSensor):
+        """A sensor with a location lat/long and location name"""
+
+        latitude = Column(Float(), nullable=False)
+        longitude = Column(Float(), nullable=False)
+        location_name = Column(String(80), nullable=False)
+
+        def __init__(
+            self,
+            latitude: float = None,
+            longitude: float = None,
+            location_name: str = "",
+            **kwargs,
+        ):
+            self.latitude = latitude
+            self.longitude = longitude
+            self.location_name = location_name
+            DBSensor.__init__(self, **kwargs)
+
+Changing the table name is more tricky. Here is a class where we do that.
+This one uses a Mixin class (which is also used to create the class `DBTimedBelief` which we saw above) ― so we have to do more work, but also have more freedom to influence lower-level things such as the `table_name` attribute and pointing to a customer table for belief sources ("my_belief_source"):
+
+
+    from sqlalchemy import Column, Float, ForeignKey
+    from sqlalchemy.orm import backref, relationship
+    from sqlalchemy.ext.declarative import declared_attr
+    from timely_beliefs import TimedBeliefDBMixin
+
+
+    class JoyfulBeliefInCustomTable(Base, TimedBeliefDBMixin):
+
+        __tablename__ = "my_timed_belief"
+
+        happiness = Column(Float(), default=0)
+
+        @declared_attr
+        def source_id(cls):
+            return Column(Integer, ForeignKey("my_belief_source.id"), primary_key=True)
+
+        source = relationship(
+            "RatedSourceInCustomTable", backref=backref("beliefs", lazy=True)
+        )
+
+        def __init__(self, sensor, source, happiness: float = None, **kwargs):
+            self.happiness = happiness
+            TimedBeliefDBMixin.__init__(self, sensor, source, **kwargs)
+            Base.__init__(self)
+
+
+Note that we don say where the sqlalchemy `Base` comes from here. This is the one from your project.
+If you create tables from timely_belief's Base (see above) as well, you end up with more tables that you probably want to use.
+Which is not a blocker, but for cleanliness you might want to get all tables from timely beliefs base or define all Table implementations yourself, such as with `JoyfulBeliefInCustomTable` above.
diff --git a/timely_beliefs/docs/lineage.md b/timely_beliefs/docs/lineage.md
@@ -0,0 +1,10 @@
+# Lineage
+
+Get the (number of) sources contributing to the BeliefsDataFrame:
+
+    >>> df.lineage.sources
+    array([<BeliefSource Source A>, <BeliefSource Source B>], dtype=object)
+    >>> df.lineage.number_of_sources
+    2
+
+Many more convenient properties can be found in `df.lineage`.
diff --git a/timely_beliefs/docs/resampling.md b/timely_beliefs/docs/resampling.md
@@ -0,0 +1,73 @@
+# Resampling
+
+BeliefsDataFrames come with a custom resample method `.resample_events()` to infer new beliefs about underlying events over time (upsampling) or aggregated events over time (downsampling).
+
+Resampling a BeliefsDataFrame can be an expensive operation, especially when the frame contains beliefs from multiple sources and/or probabilistic beliefs.
+
+## Table of contents
+
+1. [Upsampling](#upsampling)
+1. [Downsampling](#downsampling)
+
+## Upsampling
+
+Upsample to events with a resolution of 5 minutes:
+
+    >>> from datetime import timedelta
+    >>> df = timely_beliefs.examples.example_df
+    >>> df5m = df.resample_events(timedelta(minutes=5))
+    >>> df5m.sort_index(level=["belief_time", "source"]).head(9)
+                                                                                         event_value
+    event_start               belief_time               source   cumulative_probability
+    2000-01-03 09:00:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.1587                         90.0
+                                                                 0.5000                        100.0
+                                                                 0.8413                        110.0
+    2000-01-03 09:05:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.1587                         90.0
+                                                                 0.5000                        100.0
+                                                                 0.8413                        110.0
+    2000-01-03 09:10:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.1587                         90.0
+                                                                 0.5000                        100.0
+                                                                 0.8413                        110.0
+
+When resampling, the event resolution of the underlying sensor remains the same (it's still a fixed property of the sensor):
+
+    >>> df.sensor.event_resolution == df5m.sensor.event_resolution
+    True
+
+However, the event resolution of the BeliefsDataFrame is updated, as well as knowledge horizons and knowledge times: 
+
+    >>> df5m.event_resolution
+    datetime.timedelta(seconds=300)
+    >>> -df.knowledge_horizons[0]  # note negative horizons denote "after the fact", and the original resolution was 15 minutes
+    Timedelta('0 days 00:15:00')
+    >>> -df5m.knowledge_horizons[0]
+    Timedelta('0 days 00:05:00')
+
+## Downsampling
+
+Downsample to events with a resolution of 2 hours:
+
+    >>> df2h = df.resample_events(timedelta(hours=2))
+    >>> df2h.sort_index(level=["belief_time", "source"]).head(15)
+                                                                                         event_value
+    event_start               belief_time               source   cumulative_probability
+    2000-01-03 09:00:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.158700                       90.0
+                                                                 0.500000                      100.0
+                                                                 1.000000                      110.0
+    2000-01-03 10:00:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.025186                      225.0
+                                                                 0.079350                      235.0
+                                                                 0.133514                      240.0
+                                                                 0.212864                      245.0
+                                                                 0.329350                      250.0
+                                                                 0.408700                      255.0
+                                                                 0.579350                      260.0
+                                                                 0.750000                      265.0
+                                                                 1.000000                      275.0
+    2000-01-03 12:00:00+00:00 2000-01-01 00:00:00+00:00 Source A 0.158700                      360.0
+                                                                 0.500000                      400.0
+                                                                 1.000000                      440.0
+    >>> -df2h.knowledge_horizons[0]
+    Timedelta('0 days 02:00:00')
+
+Notice the time-aggregation of probabilistic beliefs about the two events between 10 AM and noon.
+Three possible outcomes for both events led to nine possible worlds, because downsampling assumes by default that the values indicate discrete possible outcomes.