Merge pull request #114 from e10v/dev

Switch to PyArrow for internal data and remove Pandas dependency
e10v · Jan 5, 2025 · 7450313 · 7450313
2 parents 3aa878e + 64b43d6
commit 7450313
Show file tree

Hide file tree

Showing 30 changed files with 1,158 additions and 954 deletions.
diff --git a/README.md b/README.md
@@ -9,11 +9,11 @@
 
 **tea-tasting** is a Python package for the statistical analysis of A/B tests featuring:
 
-- Student's t-test, Z-test, Bootstrap, and quantile metrics out of the box.
+- Student's t-test, Z-test, bootstrap, and quantile metrics out of the box.
 - Extensible API: define and use statistical tests of your choice.
 - [Delta method](https://alexdeng.github.io/public/files/kdd2018-dm.pdf) for ratio metrics.
-- Variance reduction with [CUPED](https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf)/[CUPAC](https://doordash.engineering/2020/06/08/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/) (also in combination with the delta method for ratio metrics).
-- Confidence intervals for both absolute and percentage change.
+- Variance reduction using [CUPED](https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf)/[CUPAC](https://doordash.engineering/2020/06/08/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/) (which can also be combined with the delta method for ratio metrics).
+- Confidence intervals for both absolute and percentage changes.
 - Sample ratio mismatch check.
 - Power analysis.
 - Multiple hypothesis testing (family-wise error rate and false discovery rate).
@@ -56,7 +56,6 @@ Learn more in the detailed [user guide](https://tea-tasting.e10v.me/user-guide/)
 
 ## Roadmap
 
-- Switch from Pandas DataFrames to PyArrow Tables for internal data. Make Pandas dependency optional.
 - A/A tests and simulations.
 - More statistical tests:
     - Asymptotic and exact tests for frequency data.

diff --git a/docs/api/config.md b/docs/api/config.md
@@ -1,3 +1 @@
 ::: tea_tasting.config
-    options:
-      members_order: source
diff --git a/docs/api/multiplicity.md b/docs/api/multiplicity.md
@@ -1,3 +1 @@
 ::: tea_tasting.multiplicity
-    options:
-      members_order: source
diff --git a/docs/custom-metrics.md b/docs/custom-metrics.md
@@ -2,12 +2,12 @@
 
 ## Intro
 
-**tea-tasting** supports Student's t-test, Z-test, and [some other statistical tests](api/metrics/index.md) out of the box. However, you might want to analyze an experiment using other statistical criteria. In this case you can define a custom metric with statistical test of your choice.
+**tea-tasting** supports Student's t-test, Z-test, and [some other statistical tests](api/metrics/index.md) out of the box. However, you might want to analyze an experiment using other statistical criteria. In this case, you can define a custom metric with a statistical test of your choice.
 
 In **tea-tasting**, there are two types of metrics:
 
-- Metrics that require only aggregated statistics for analysis.
-- Metrics that require granular data for analysis.
+- Metrics that require only aggregated statistics for the analysis.
+- Metrics that require granular data for the analysis.
 
 This guide explains how to define a custom metric for each type.
 
@@ -17,7 +17,7 @@ First, let's import all the required modules and prepare the data:
 from typing import Literal, NamedTuple
 
 import numpy as np
-import pandas as pd
+import pyarrow as pa
 import scipy.stats
 import tea_tasting as tt
 import tea_tasting.aggr
@@ -26,7 +26,7 @@ import tea_tasting.metrics
 import tea_tasting.utils
 
 
-data = tt.make_users_data(seed=42)
+data = tt.make_users_data(seed=42, return_type="pandas")
 data["has_order"] = data.orders.gt(0).astype(int)
 print(data)
 #>       user  variant  sessions  orders    revenue  has_order
@@ -63,7 +63,7 @@ class ProportionResult(NamedTuple):
     statistic: float
 ```
 
-The second step is defining the metric class itself. Metric based on aggregated statistics should be a subclass of [`MetricBaseAggregated`](api/metrics/base.md#tea_tasting.metrics.base.MetricBaseAggregated). `MetricBaseAggregated` is a generic class with the result class as a type variable.
+The second step is defining the metric class itself. A metric based on aggregated statistics should be a subclass of [`MetricBaseAggregated`](api/metrics/base.md#tea_tasting.metrics.base.MetricBaseAggregated). `MetricBaseAggregated` is a generic class with the result class as a type variable.
 
 The metric should have the following methods and properties defined:
 
@@ -119,15 +119,15 @@ class Proportion(tea_tasting.metrics.MetricBaseAggregated[ProportionResult]):
         )
 ```
 
-Method `__init__` save metric parameters to be used in analysis. You can use utility functions [`check_scalar`](api/utils.md#tea_tasting.utils.check_scalar) and [`auto_check`](api/utils.md#tea_tasting.utils.auto_check) to check parameter values.
+Method `__init__` saves metric parameters to be used in the analysis. You can use utility functions [`check_scalar`](api/utils.md#tea_tasting.utils.check_scalar) and [`auto_check`](api/utils.md#tea_tasting.utils.auto_check) to check parameter values.
 
 Property `aggr_cols` returns an instance of [`AggrCols`](api/metrics/base.md#tea_tasting.metrics.base.AggrCols). Analysis of proportion requires the number of rows (`has_count=True`) and the average value for the column of interest (`mean_cols=(self.column,)`) for each variant.
 
 Method `analyze_aggregates` accepts two parameters: `control` and `treatment` data as instances of class [`Aggregates`](api/aggr.md#tea_tasting.aggr.Aggregates). They contain values for statistics and columns specified in `aggr_cols`.
 
-Method `analyze_aggregates` returns an instance of `ProportionResult`, defined earlier, with analysis result.
+Method `analyze_aggregates` returns an instance of `ProportionResult`, defined earlier, with the analysis result.
 
-Now we can analyze the proportion of users who created at least one order during the experiment. For comparison, let's also add a metric that performs Z-test on the same column.
+Now we can analyze the proportion of users who created at least one order during the experiment. For comparison, let's also add a metric that performs a Z-test on the same column.
 
 ```python
 experiment_prop = tt.Experiment(
@@ -142,7 +142,7 @@ print(experiment_prop.analyze(data))
 
 ## Metrics based on granular data
 
-Now let's define a metric that performs the Mann-Whitney U test. While it's possible to use the aggregated sum of ranks in the test, this example will use granular data for analysis.
+Now let's define a metric that performs the Mann-Whitney U test. While it's possible to use the aggregated sum of ranks for the test, this example uses granular data for analysis.
 
 The result class:
 
@@ -152,13 +152,13 @@ class MannWhitneyUResult(NamedTuple):
     statistic: float
 ```
 
-Metric that analyses granular data should be a subclass of [`MetricBaseGranular`](api/metrics/base.md#tea_tasting.metrics.base.MetricBaseGranular). `MetricBaseGranular` is a generic class with the result class as a type variable.
+A metric that analyzes granular data should be a subclass of [`MetricBaseGranular`](api/metrics/base.md#tea_tasting.metrics.base.MetricBaseGranular). `MetricBaseGranular` is a generic class with the result class as a type variable.
 
 Metric should have the following methods and properties defined:
 
 - Method `__init__` checks and saves metric parameters.
 - Property `cols` returns columns to be fetched for an analysis.
-- Method `analyze_dataframes` analyzes the metric using granular data.
+- Method `analyze_granular` analyzes the metric using granular data.
 
 ```python
 class MannWhitneyU(tea_tasting.metrics.MetricBaseGranular[MannWhitneyUResult]):
@@ -181,14 +181,14 @@ class MannWhitneyU(tea_tasting.metrics.MetricBaseGranular[MannWhitneyUResult]):
     def cols(self) -> tuple[str]:
         return (self.column,)
 
-    def analyze_dataframes(
+    def analyze_granular(
         self,
-        control: pd.DataFrame,
-        treatment: pd.DataFrame,
+        control: pa.Table,
+        treatment: pa.Table,
     ) -> MannWhitneyUResult:
         res = scipy.stats.mannwhitneyu(
-            treatment[self.column],
-            control[self.column],
+            treatment[self.column].combine_chunks().to_numpy(zero_copy_only=False),
+            control[self.column].combine_chunks().to_numpy(zero_copy_only=False),
             use_continuity=self.correction,
             alternative=self.alternative,
         )
@@ -200,9 +200,9 @@ class MannWhitneyU(tea_tasting.metrics.MetricBaseGranular[MannWhitneyUResult]):
 
 Property `cols` should return a sequence of strings.
 
-Method `analyze_dataframes` accepts two parameters: control and treatment data as Pandas DataFrames. Even with [data backend](data-backends.md) different from Pandas, **tea-tasting** will retrieve the data and transform into a Pandas DataFrame.
+Method `analyze_granular` accepts two parameters: control and treatment data as PyArrow Tables. Even with [data backend](data-backends.md) different from PyArrow, **tea-tasting** will retrieve the data and transform into a PyArrow Table.
 
-Method `analyze_dataframes` returns an instance of `MannWhitneyUResult`, defined earlier, with analysis result.
+Method `analyze_granular` returns an instance of `MannWhitneyUResult`, defined earlier, with analysis result.
 
 Now we can perform the Mann-Whitney U test:
 
@@ -237,7 +237,7 @@ print(experiment.analyze(data))
 #>            mwu_revenue       -         -               -             [-, -] 0.0300
 ```
 
-In this case, **tea-tasting** perform two queries on experimental data:
+In this case, **tea-tasting** performs two queries on the experimental data:
 
 - With aggregated statistics required for analysis of metrics of type `MetricBaseAggregated`.
 - With detailed data with columns required for analysis of metrics of type `MetricBaseGranular`.
@@ -249,4 +249,4 @@ Follow these recommendations when defining custom metrics:
 - Use parameter and attribute names consistent with the ones that are already defined in **tea-tasting**. For example, use `pvalue` instead of `p_value` or `correction` instead of `use_continuity`.
 - End confidence interval boundary names with `"_ci_lower"` and `"_ci_upper"`.
 - During initialization, save parameter values in metric attributes using the same names. For example, use `self.correction = correction` instead of `self.use_continuity = correction`.
-- Use globals settings as default values for standard parameters, such as `alternative` or `confidence_level`. See the [reference](api/config.md#tea_tasting.config.config_context) for the full list of standard parameters. You can also define and use your own global parameters.
+- Use global settings as default values for standard parameters, such as `alternative` or `confidence_level`. See the [reference](api/config.md#tea_tasting.config.config_context) for the full list of standard parameters. You can also define and use your own global parameters.
diff --git a/docs/data-backends.md b/docs/data-backends.md
@@ -27,6 +27,7 @@ First, let's prepare a demo database:
 
 ```python
 import ibis
+import polars as pl
 import tea_tasting as tt
 
 
@@ -35,7 +36,7 @@ con = ibis.duckdb.connect()
 con.create_table("users_data", users_data)
 #> DatabaseTable: memory.main.users_data
 #>   user     int64
-#>   variant  uint8
+#>   variant  int64
 #>   sessions int64
 #>   orders   int64
 #>   revenue  float64
@@ -51,7 +52,7 @@ See the [Ibis documentation on how to create connections](https://ibis-project.o
 
 ## Querying experimental data
 
-Method `con.create_table` in the example above returns an instance of Ibis Table which already can be used in the analysis of the experiment. But let's see how to use an SQL query to create Ibis Table:
+Method `con.create_table` in the example above returns an Ibis Table which already can be used in the analysis of the experiment. But let's see how to use an SQL query to create an Ibis Table:
 
 ```python
 data = con.sql("select * from users_data")
@@ -61,30 +62,39 @@ print(data)
 #>     select * from users_data
 #>   schema:
 #>     user     int64
-#>     variant  uint8
+#>     variant  int64
 #>     sessions int64
 #>     orders   int64
 #>     revenue  float64
 ```
 
-It's a very simple query. In real world, you might need to use joins, aggregations, and CTEs to get the data. You can define any SQL query supported by your data backend and use it to create Ibis Table.
+It's a very simple query. In the real world, you might need to use joins, aggregations, and CTEs to get the data. You can define any SQL query supported by your data backend and use it to create Ibis Table.
 
 Keep in mind that **tea-tasting** assumes that:
 
 - Data is grouped by randomization units, such as individual users.
-- There is a column indicating variant of the A/B test (typically labeled as A, B, etc.).
+- There is a column indicating the variant of the A/B test (typically labeled as A, B, etc.).
 - All necessary columns for metric calculations (like the number of orders, revenue, etc.) are included in the table.
 
 Ibis Table is a lazy object. It doesn't fetch the data when created. You can use Ibis DataFrame API to query the table and fetch the result:
 
 ```python
-print(data.head(5).to_pandas())
-#>    user  variant  sessions  orders    revenue
-#> 0     0        1         2       1   9.166147
-#> 1     1        0         2       1   6.434079
-#> 2     2        1         2       1   7.943873
-#> 3     3        1         2       1  15.928675
-#> 4     4        0         1       1   7.136917
+with pl.Config(
+    float_precision=5,
+    tbl_cell_alignment="RIGHT",
+    tbl_formatting="NOTHING",
+    trim_decimal_zeros=False,
+):
+    print(data.head(5).to_polars())
+#> shape: (5, 5)
+#>  user  variant  sessions  orders   revenue
+#>   ---      ---       ---     ---       ---
+#>   i64      i64       i64     i64       f64
+#>     0        1         2       1   9.16615
+#>     1        0         2       1   6.43408
+#>     2        1         2       1   7.94387
+#>     3        1         2       1  15.92867
+#>     4        0         1       1   7.13692
 ```
 
 ## Ibis example
@@ -104,7 +114,7 @@ print(aggr_data)
 #>     select * from users_data
 #>   schema:
 #>     user     int64
-#>     variant  uint8
+#>     variant  int64
 #>     sessions int64
 #>     orders   int64
 #>     revenue  float64
@@ -122,10 +132,19 @@ print(aggr_data)
 `aggr_data` is another Ibis Table defined as a query over the previously defined `data`. Let's fetch the result:
 
 ```python
-print(aggr_data.to_pandas())
-#>    variant  sessions_per_user  orders_per_session  orders_per_user  revenue_per_user
-#> 0        0           1.996045            0.265726         0.530400          5.241079
-#> 1        1           1.982802            0.289031         0.573091          5.730132
+with pl.Config(
+    float_precision=5,
+    tbl_cell_alignment="RIGHT",
+    tbl_formatting="NOTHING",
+    trim_decimal_zeros=False,
+):
+    print(aggr_data.to_polars())
+#> shape: (2, 5)
+#>  variant  sessions_per_user  orders_per_session  orders_per_user  revenue_per_user
+#>      ---                ---                 ---              ---               ---
+#>      i64                f64                 f64              f64               f64
+#>        0            1.99605             0.26573          0.53040           5.24108
+#>        1            1.98280             0.28903          0.57309           5.73013
 ```
 
 Internally, Ibis compiles a Table to an SQL query supported by the backend:
@@ -151,7 +170,7 @@ See [Ibis documentation](https://ibis-project.org/tutorials/getting_started) for
 
 ## Experiment analysis
 
-The example above shows how to query the metric averages. But for statistical inference it's not enough. For example, Student's t-test and Z-test also require number of rows and variance. And analysis of ratio metrics and variance reduction with CUPED require covariances.
+The example above shows how to query the metric averages. But for statistical inference, it's not enough. For example, Student's t-test and Z-test also require number of rows and variance. Additionally, analysis of ratio metrics and variance reduction with CUPED requires covariances.
 
 Querying all the required statistics manually can be a daunting and error-prone task. But don't worry—**tea-tasting** does this work for you. You just need to specify the metrics:
 
@@ -171,9 +190,9 @@ print(result)
 #>   revenue_per_user    5.24      5.73            9.3%       [-2.4%, 22%]  0.123
 ```
 
-In the example above, **tea-tasting** fetches all the required statistics with a single query and then uses them to analyse the experiment.
+In the example above, **tea-tasting** fetches all the required statistics with a single query and then uses them to analyze the experiment.
 
-Some statistical methods, like Bootstrap, require granular data for the analysis. In this case, **tea-tasting** fetches the detailed data as well.
+Some statistical methods, like bootstrap, require granular data for analysis. In this case, **tea-tasting** fetches the detailed data as well.
 
 ## Example with CUPED
 
@@ -184,7 +203,7 @@ users_data_with_cov = tt.make_users_data(seed=42, covariates=True)
 con.create_table("users_data_with_cov", users_data_with_cov)
 #> DatabaseTable: memory.main.users_data_with_cov
 #>   user               int64
-#>   variant            uint8
+#>   variant            int64
 #>   sessions           int64
 #>   orders             int64
 #>   revenue            float64
@@ -215,14 +234,11 @@ print(result_with_cov)
 
 ## Polars example
 
-An example of analysis using a Polars DataFrame as input data:
+Here’s an example of how to analyze data using a Polars DataFrame:
 
 ```python
-import polars as pl
-
-
-polars_data = pl.from_pandas(users_data)
-print(experiment.analyze(polars_data))
+data_polars = pl.from_arrow(users_data)
+print(experiment.analyze(data_polars))
 #>             metric control treatment rel_effect_size rel_effect_size_ci pvalue
 #>  sessions_per_user    2.00      1.98          -0.66%      [-3.7%, 2.5%]  0.674
 #> orders_per_session   0.266     0.289            8.8%      [-0.89%, 19%] 0.0762

diff --git a/docs/index.md b/docs/index.md
@@ -9,11 +9,11 @@
 
 **tea-tasting** is a Python package for the statistical analysis of A/B tests featuring:
 
-- Student's t-test, Z-test, Bootstrap, and quantile metrics out of the box.
+- Student's t-test, Z-test, bootstrap, and quantile metrics out of the box.
 - Extensible API: define and use statistical tests of your choice.
 - [Delta method](https://alexdeng.github.io/public/files/kdd2018-dm.pdf) for ratio metrics.
-- Variance reduction with [CUPED](https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf)/[CUPAC](https://doordash.engineering/2020/06/08/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/) (also in combination with the delta method for ratio metrics).
-- Confidence intervals for both absolute and percentage change.
+- Variance reduction using [CUPED](https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf)/[CUPAC](https://doordash.engineering/2020/06/08/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/) (which can also be combined with the delta method for ratio metrics).
+- Confidence intervals for both absolute and percentage changes.
 - Sample ratio mismatch check.
 - Power analysis.
 - Multiple hypothesis testing (family-wise error rate and false discovery rate).
@@ -56,7 +56,6 @@ Learn more in the detailed [user guide](https://tea-tasting.e10v.me/user-guide/)
 
 ## Roadmap
 
-- Switch from Pandas DataFrames to PyArrow Tables for internal data. Make Pandas dependency optional.
 - A/A tests and simulations.
 - More statistical tests:
     - Asymptotic and exact tests for frequency data.