Fixed stratified splitting with Dask #1883

tgaddair · 2022-04-06T21:52:13Z

No description provided.

github-actions · 2022-04-06T22:17:27Z

Unit Test Results

        6 files ±  0       6 suites ±0 2h 42m 36s ⏱️ - 10m 57s
  3 386 tests +10 3 308 ✔️ +12   78 💤 ±0 0 ❌ - 2
10 158 runs +30 9 900 ✔️ +32 258 💤 ±0 0 ❌ - 2

Results for commit 4f6e168. ± Comparison against base commit e60626f.

♻️ This comment has been updated with latest results.

jppgks · 2022-09-06T09:18:46Z

For a partitioned dataset, we can stratify split each partition individually to obtain a global stratified split.

Proof.
Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$

$$ \begin{align} &\sum_{p \in P} \%_{train} \times | p_{S = s} | & \%_{train} \texttt{ of every partition's } s \texttt{ values will be in that partition's train} \\ = &\%_{train} \times \sum_{p \in P} | p_{S = s} | & \texttt{which is } \%_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\ = &\%_{train} \times | D_{S = s} | & \texttt{which is } \%_{train} \texttt{ of the total number of } s \texttt{ values} \\ \end{align} $$

$\%_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.

Assumes every partition fits in memory Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$ $$ \begin{align} &\sum_{p \in P} \\%\_{train} \times | p\_{S = s} | & \\%\_{train} \texttt{ of every partition's } s \texttt{ values will be in that partition's train} \\ = &\\%\_{train} \times \sum_{p \in P} | p\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\ = &\\%\_{train} \times | D\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the total number of } s \texttt{ values} \\ \end{align} $$ $\\%\_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.

jppgks · 2022-09-06T19:02:06Z

From @justinxzhao: Move split_partition outside split scope

jppgks · 2022-09-06T19:03:18Z

From @ShreyaR: might want to repartition after splitting into train/valid/test. Determine when to make the repartition call

ludwig/data/split.py

ShreyaR

Looks great! Feel free to merge once tests are passing

Assumes every partition fits in memory Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$ $$ \begin{align} &\sum_{p \in P} \\%\_{train} \times | p\_{S = s} | & \\%\_{train} \texttt{ of every partition's } s \texttt{ values will be in that partition's train} \\ = &\\%\_{train} \times \sum_{p \in P} | p\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\ = &\\%\_{train} \times | D\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the total number of } s \texttt{ values} \\ \end{align} $$ $\\%\_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.

Added test for stratified sampling

037fd80

jppgks force-pushed the stratified-sampling branch from 2d8c7b8 to 037fd80 Compare August 30, 2022 13:46

have test fail using new split syntax

f5d2456

jppgks added 4 commits September 6, 2022 10:04

add failing unit test

5d91d25

Merge remote-tracking branch 'origin/master' into stratified-sampling

ce8345b

bring back integration test after pulling in master

b4d723a

jppgks marked this pull request as ready for review September 6, 2022 11:35

jppgks added 2 commits September 7, 2022 15:50

extract helper out of class

9ee5bd7

Merge remote-tracking branch 'origin/master' into stratified-sampling

c999ada

ShreyaR requested changes Sep 7, 2022

View reviewed changes

ludwig/data/split.py Outdated Show resolved Hide resolved

ludwig/data/split.py Show resolved Hide resolved

jppgks added 2 commits September 9, 2022 16:30

Merge remote-tracking branch 'origin/master' into stratified-sampling

6b5c9f8

remove the need for local backend

4f6e168

jppgks requested a review from ShreyaR September 14, 2022 09:58

ShreyaR approved these changes Sep 14, 2022

View reviewed changes

remove unused imports

2153533

jppgks merged commit 7bab539 into master Sep 14, 2022

jppgks deleted the stratified-sampling branch September 14, 2022 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed stratified splitting with Dask #1883

Fixed stratified splitting with Dask #1883

tgaddair commented Apr 6, 2022

github-actions bot commented Apr 6, 2022 •

edited

Loading

jppgks commented Sep 6, 2022

jppgks commented Sep 6, 2022 •

edited

Loading

jppgks commented Sep 6, 2022

ShreyaR left a comment

Fixed stratified splitting with Dask #1883

Fixed stratified splitting with Dask #1883

Conversation

tgaddair commented Apr 6, 2022

github-actions bot commented Apr 6, 2022 • edited Loading

Unit Test Results

jppgks commented Sep 6, 2022

jppgks commented Sep 6, 2022 • edited Loading

jppgks commented Sep 6, 2022

ShreyaR left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 6, 2022 •

edited

Loading

jppgks commented Sep 6, 2022 •

edited

Loading