dssg · thcrock · Dec 14, 2018 · Feb 21, 2019 · Feb 21, 2019 · Feb 27, 2019
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -19,11 +19,15 @@ pages:
     - Defining an Experiment: experiments/defining.md
     - Testing Feature Configuration: experiments/feature-testing.md
     - Running an Experiment: experiments/running.md
-    - Upgrading an Experiment: experiments/upgrading.md
+    - Upgrading an Experiment:
+        to v5: experiments/upgrade-to-v5.md
+        to v6: experiments/upgrade-to-v6.md
+        to v7: experiments/upgrade-to-v7.md
     - Temporal Validation Deep Dive: experiments/temporal-validation.md
     - Cohort and Label Deep Dive: experiments/cohort-labels.md
     - Feature Generation Recipe Book: experiments/features.md
     - Experiment Algorithm: experiments/algorithm.md
     - Experiment Architecture: experiments/architecture.md
+    - Extending Experiment Features: experiments/extending-features.md
   - Audition: https://github.com/dssg/triage/tree/master/src/triage/component/audition
   - Postmodeling: https://github.com/dssg/triage/tree/master/src/triage/component/postmodeling
diff --git a/docs/sources/experiments/extending-features.md b/docs/sources/experiments/extending-features.md
@@ -0,0 +1,143 @@
+# Extending Feature Generation
+
+This document describes how to extend Triage's feature generation capabilities by writing new FeatureBlock classes and incorporating them into Experiments.
+
+## What is a FeatureBlock?
+
+A FeatureBlock represents a single feature table in the database and how to generate it. If you're familiar with `collate` parlance, a `SpacetimeAggregation` is similar in scope to a FeatureBlock. A `FeatureBlock` class can be instantiated with whatever arguments it needs,and from there can provide queries to produce its output feature table. Full-size Triage experiments tend to contain multiple feature blocks. These all live in a collection as the `experiment.feature_blocks` property in the Experiment.
+
+## What existing FeatureBlock classes can I use?
+
+Class name | Experiment config key | Use
+------------ | ------------- | ------------
+triage.component.collate.SpacetimeAggregation | spacetime_aggregations  | Temporal aggregations of event-based data
+
+## Writing a new FeatureBlock class
+
+The `FeatureBlock` base class defines a set of abstract methods that any child class must implement, as well as a number of initialization arguments that it must take and implement in order to fulfill expectations Triage users have on feature generators. Triage expects these classes to define the queries they need to run, as opposed to generating the tables themselves, so that Triage can implement scaling by parallelization.
+
+### Abstract methods
+
+Any method here without parentheses afterwards is expected to be a property.
+
+Method | Task | Return Type
+------------ | ------------- | -------------
+final_feature_table_name | The name of the final table with all features filled in (no missing values) | string
+feature_columns | The list of feature columns in the final, postimputation table. Should exclude any index columns (e.g. entity id, date) | list
+preinsert_queries | Return all queries that should be run before inserting any data. The creation of your feature table should happen here, and is expected to have `entity_id(integer)` and `as_of_date(timestamp)` columns. | list
+insert_queries | Return all inserts to populate this data. Each query in this list should be parallelizable, and should be valid after all `preinsert_queries` are run. | list
+postinsert_queries | Return all queries that should be run after inserting all data | list
+imputation_queries | Return all queries that should be run to fill in missing data with imputed values. | list
+
+Any of the query list properties can be empty: for instance, if your implementation doesn't have inserts separate from table creation and is just one big query (e.g. a `CREATE TABLE AS`), you could just define `preinsert_queries` so be that one mega-query and leave the other properties as empty lists.
+
+### Properties Provided by Base Class
+
+There are several attributes/properties that can be used within subclass implementations that the base class provides. Triage experiments take care of providing this data during runtime: if you want to instantiate a FeatureBlock object on your own, you'll have to provide them in the constructor.
+
+Name | Type | Purpose
+------------ | ------------- | -------------
+as_of_dates | list | Features are created "as of" specific dates, and expects that each of these dates will be populated with a row for each member of the cohort on that date.
+cohort_table | string | The final shape of the feature table should at least include every entity id/date pair in this cohort table.
+db_engine | sqlalchemy.engine | The engine to use to access the database. Although these instances are mostly returning queries, the engine may be useful for implementing imputation.
+features_schema_name | string | The database schema where all feature tables should reside. Defaults to None, which ends up in the public schema.
+feature_start_time | string/datetime | A time before which no data should be considered for features. This is generally only applicable if your FeatureBlock is doing temporal aggregations. Defaults to None, which means no data will be excluded.
+features_ignore_cohort | bool | If True (the default), features are only computed for members of the cohort. If False, the shape of the final feature table could include more.
+
+
+`FeatureBlock` child classes can, and in almost all cases will, include more configuration at initialization time that are specific to them. They probably also define many more methods to use internally. But as long as they adhere to this interface, they'll work with Triage.
+
+### Making the new FeatureBlock available to experiments
+
+Triage Experiments run on serializable configuration, and although it's possible to take fully generated `FeatureBlock` instances and bypass this (e.g. `experiment.feature_blocks = <my_collection_of_feature_blocks>`), it's not recommended. The last step is to pick a config key for use within the `features` key of experiment configs, within `triage.component.architect.feature_block_generators.FEATURE_BLOCK_GENERATOR_LOOKUP` and point it to a function that instantiates a bunch of your objects based on config.
+
+## Example
+
+That's a lot of information! Let's see this in action. Let's say that we want to create a very flexible type of feature that simply runs a configured query with a parametrized as-of-date and returns its result as a feature.
+
+```python
+from triage.component.architect.feature_block import FeatureBlock
+
+
+class SimpleQueryFeature(FeatureBlock):
+    def __init__(self, query, *args, **kwargs):
+        self.query = query
+        super().__init__(*args, **kwargs)
+
+    @property
+    def final_feature_table_name(self):
+        return f"{self.features_schema_name}.mytable"
+
+    @property
+    def feature_columns(self):
+        return ['myfeature']
+
+    @property
+    def preinsert_queries(self):
+        return [f"create table {self.final_feature_table_name}" "(entity_id bigint, as_of_date timestamp, myfeature float)"]
+
+    @property
+    def insert_queries(self):
+        if self.features_ignore_cohort:
+            final_query = self.query
+        else:
+            final_query = f"""
+                select * from (self.query) raw
+                join {self.cohort_table} using (entity_id, as_of_date)
+            """
+        return [
+            final_query.format(as_of_date=date)
+            for date in self.as_of_dates
+        ]
+
+    @property
+    def postinsert_queries(self):
+        return [f"create index on {self.final_feature_table_name} (entity_id, as_of_date)"]
+
+    @property
+    def imputation_queries(self):
+        return [f"update {self.final_feature_table_name} set myfeature = 0.0 where myfeature is null"]
+```
+
+This class would allow many different uses: basically any query a user can come up with would be a feature. To instantiate this class outside of triage with a simple query, you could:
+
+```python
+feature_block = SimpleQueryFeature(
+    query="select entity_id, as_of_date, quantity from source_table where date < '{as_of_date}'",
+    as_of_dates=["2016-01-01"],
+    cohort_table="my_cohort_table",
+    db_engine=triage.create_engine(<..mydbinfo..>)
+)
+
+feature_block.run_preimputation()
+feature_block.run_imputation()
+```
+
+To use it from a Triage experiment, modify `triage.component.architect.feature_block_generators.py` and submit a pull request:
+
+Before:
+
+```python
+FEATURE_BLOCK_GENERATOR_LOOKUP = {
+    'spacetime_aggregations': generate_spacetime_aggregations
+}
+```
+
+After:
+
+```python
+FEATURE_BLOCK_GENERATOR_LOOKUP = {
+    'spacetime_aggregations': generate_spacetime_aggregations,
+    'simple_query': SimpleQueryFeature,
+}
+```
+
+At this point, you could use it in an experiment configuration like this:
+
+```yaml
+
+features:
+    simple_query:
+        - query: "select entity_id, as_of_date, quantity from source_table where date < '{as_of_date}'"
+        - query: "select entity_id, as_of_date, other_quantity from other_source_table where date < '{as_of_date}'"
+```
diff --git a/docs/sources/experiments/feature-testing.md b/docs/sources/experiments/feature-testing.md
@@ -2,26 +2,27 @@
 
 Developing features for Triage experiments can be a daunting task. There are a lot of things to configure, a small amount of configuration can result in a ton of SQL, and it can take a long time to validate your feature configuration in the context of an Experiment being run on real data.
 
-To speed up the process of iterating on features, you can run a list of feature aggregations, without imputation, on just one as-of-date. This functionality can be accessed through the `triage` command line tool or called directly from code (say, in a Jupyter notebook) using the `FeatureGenerator` component.
+To speed up the process of iterating on features, you can run a list of feature aggregations, without imputation, on just one as-of-date. This functionality can be accessed through the `triage` command line tool or called directly from code (say, in a Jupyter notebook) using the `feature_blocks_from_config` utility.
 
 ## Using Triage CLI
-![triage featuretest cli help screen](featuretest-cli.png)
 
 The command-line interface for testing features takes in two arguments:
-	- A feature config file. Refer to [example_feature_config.yaml](https://github.com/dssg/triage/blob/master/example/config/feature.yaml). Essentially this is the content of the [example_experiment_config.yaml](https://github.com/dssg/triage/blob/master/example/config/experiment.yaml)'s `feature_aggregations` section. It consists of a YAML list, with one or more feature_aggregation rows present.
-	- An as-of-date. This should be in the format `2016-01-01`.
 
-Example: `triage experiment featuretest example/config/feature.yaml 2016-01-01`
+- An experiment config file. It should have at least a `features` section, and if a `cohort_config` section is present, it will use that to limit the number of feature rows it creates to the cohort at the given date. Other keys can be in there but are ignored. In other lwords, you can use your experiment config file either before or after its fully completed.
+- An as-of-date. This should be in the format `2016-01-01`.
+
+Example: `triage experiment featuretest example/config/experiment.yaml 2016-01-01`
 
 All given feature aggregations will be processed for the given date. You will see a bunch of queries pass by in your terminal, populating tables in the `features_test` schema which you can inspect afterwards.
 
 ![triage feature test result](featuretest-result.png)
 
 ## Using Python Code
-If you'd like to call this from a notebook or from any other Python code, the arguments look similar but are a bit different. You have to supply your own sqlalchemy database engine to create a 'FeatureGenerator' object, and then call the `create_features_before_imputation` method with your feature config as a list of dictionaries, along with an as-of-date as a string. Make sure your logging level is set to INFO if you want to see all of the queries.
+If you'd like to call this from a notebook or from any other Python code, the arguments look similar but are a bit different. You have to supply the same arguments plus a few others to the `feature_blocks_from_config` function to create a set of feature blocks, and then call the `run_preimputation` method on each feature block. Make sure your logging level is set to INFO if you want to see all of the queries.
+
 
 ```
-from triage.component.architect.feature_generators import FeatureGenerator
+from triage.component.architect.feature_block_generators import feature_blocks_from_config
 from triage.util.db import create_engine
 import logging
 import yaml
@@ -32,12 +33,13 @@ logging.basicConfig(level=logging.INFO)
 db_url = 'your db url here'
 db_engine = create_engine(db_url)
 
-feature_config = [{
+feature_config = {'spacetime_aggregations': [{
 	'prefix': 'aprefix',
 	'aggregates': [
 		{
 		'quantity': 'quantity_one',
 		'metrics': ['sum', 'count'],
+        }
 	],
 	'categoricals': [
 		{
@@ -50,10 +52,15 @@ feature_config = [{
 	'intervals': ['all'],
 	'knowledge_date_column': 'knowledge_date',
 	'from_obj': 'data'
-}]
+}]}
 
-FeatureGenerator(db_engine, 'features_test').create_features_before_imputation(
-	feature_aggregation_config=feature_config,
-	feature_dates=['2016-01-01']
+feature_blocks = feature_blocks_from_config(
+    feature_config,
+    as_of_dates=['2016-01-01'],
+    cohort_table=None,
+    db_engine=db_engine,
+    features_schema_name="features_test",
 )
+for feature_block in feature_blocks:
+    feature_block.run_preimputation(verbose=True)
 ```
diff --git a/docs/sources/experiments/upgrade-to-v7.md b/docs/sources/experiments/upgrade-to-v7.md
@@ -0,0 +1,66 @@
+# Upgrading your experiment configuration to v7
+
+
+This document details the steps needed to update a triage v6 configuration to
+v7, mimicking the old behavior.
+
+Experiment configuration v7 includes only one change from v6: The features are given at a different key. Instead of `feature_aggregations`, to make space for non-collate features to be added in the future, there is now a more generic `features` key, under which collate features reside at `spacetime_aggregations`.
+
+
+Old:
+
+```
+feature_aggregations:
+    -
+        prefix: 'prefix'
+        from_obj: 'cool_stuff'
+        knowledge_date_column: 'open_date'
+        aggregates_imputation:
+            all:
+                type: 'constant'
+                value: 0
+        aggregates:
+            -
+                quantity: 'homeless::INT'
+                metrics: ['count', 'sum']
+        intervals: ['1 year', '2 year']
+        groups: ['entity_id']
+```
+
+New:
+
+```
+features:
+    spacetime_aggregations:
+        -
+            prefix: 'prefix'
+            from_obj: 'cool_stuff'
+            knowledge_date_column: 'open_date'
+            aggregates_imputation:
+                all:
+                    type: 'constant'
+                    value: 0
+            aggregates:
+                -
+                    quantity: 'homeless::INT'
+                    metrics: ['count', 'sum']
+            intervals: ['1 year', '2 year']
+            groups: ['entity_id']
+```
+
+## Upgrading the experiment config version
+
+At this point, you should be able to bump the top-level experiment config version to v7:
+
+Old:
+
+```
+config_version: 'v6'
+```
+
+New:
+
+```
+config_version: 'v7'
+```
+
diff --git a/docs/sources/experiments/upgrading.md b/docs/sources/experiments/upgrading.md
diff --git a/example/config/experiment.yaml b/example/config/experiment.yaml
@@ -5,7 +5,7 @@
 # old configuration files are released. Be sure to assign the config version
 # that matches the triage.experiments.CONFIG_VERSION in the triage release
 # you are developing against!
-config_version: 'v6'
+config_version: 'v7'
 
 # EXPERIMENT METADATA
 # model_comment (optional) will end up in the model_comment column of the
@@ -72,37 +72,38 @@ label_config:
 
 
 # FEATURE GENERATION
-# The aggregate features to generate for each train/test split
-#
-# Implemented by wrapping collate: https://github.com/dssg/collate
-# Most terminology here is taken directly from collate
-#
-# Each entry describes a collate.SpacetimeAggregation object, and the
-# arguments needed to create it. Generally, each of these entries controls
-# the features from one source table, though in the case of multiple groups
-# may result in multiple output tables
-#
-# Rules specifying how to handle imputation of null values must be explicitly
-# defined in your config file. These can be specified in two places: either
-# within each feature or overall for each type of feature (aggregates_imputation,
-# categoricals_imputation, array_categoricals_imputation). In either case, a rule must be given for
-# each aggregation function (e.g., sum, max, avg, etc) used, or a catch-all
-# can be specified with `all`. Aggregation function-specific rules will take
-# precedence over the `all` rule and feature-specific rules will take
-# precedence over the higher-level rules. Several examples are provided below.
-#
-# Available Imputation Rules:
-#   * mean: The average value of the feature (for SpacetimeAggregation the
-#           mean is taken within-date).
-#   * constant: Fill with a constant value from a required `value` parameter.
-#   * zero: Fill with zero.
-#   * null_category: Only available for categorical features. Just flag null
-#                    values with the null category column.
-#   * binary_mode: Only available for aggregate column types. Takes the modal
-#                  value for a binary feature.
-#   * error: Raise an exception if any null values are encountered for this
-#            feature.
-feature_aggregations:
+features:
+    spacetime_aggregations:
+    # The aggregate features to generate for each train/test split
+    #
+    # Implemented by wrapping collate: https://github.com/dssg/collate
+    # Most terminology here is taken directly from collate
+    #
+    # Each entry describes a collate.SpacetimeAggregation object, and the
+    # arguments needed to create it. Generally, each of these entries controls
+    # the features from one source table, though in the case of multiple groups
+    # may result in multiple output tables
+    #
+    # Rules specifying how to handle imputation of null values must be explicitly
+    # defined in your config file. These can be specified in two places: either
+    # within each feature or overall for each type of feature (aggregates_imputation,
+    # categoricals_imputation, array_categoricals_imputation). In either case, a rule must be given for
+    # each aggregation function (e.g., sum, max, avg, etc) used, or a catch-all
+    # can be specified with `all`. Aggregation function-specific rules will take
+    # precedence over the `all` rule and feature-specific rules will take
+    # precedence over the higher-level rules. Several examples are provided below.
+    #
+    # Available Imputation Rules:
+    #   * mean: The average value of the feature (for SpacetimeAggregation the
+    #           mean is taken within-date).
+    #   * constant: Fill with a constant value from a required `value` parameter.
+    #   * zero: Fill with zero.
+    #   * null_category: Only available for categorical features. Just flag null
+    #                    values with the null category column.
+    #   * binary_mode: Only available for aggregate column types. Takes the modal
+    #                  value for a binary feature.
+    #   * error: Raise an exception if any null values are encountered for this
+    #            feature.
     -
         # prefix given to the resultant tables
         prefix: 'prefix'