Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed stratified splitting with Dask #1883

Merged
merged 11 commits into from
Sep 14, 2022
Merged

Fixed stratified splitting with Dask #1883

merged 11 commits into from
Sep 14, 2022

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Apr 6, 2022

No description provided.

@github-actions
Copy link

github-actions bot commented Apr 6, 2022

Unit Test Results

         6 files  ±  0         6 suites  ±0   2h 42m 36s ⏱️ - 10m 57s
  3 386 tests +10  3 308 ✔️ +12    78 💤 ±0  0  - 2 
10 158 runs  +30  9 900 ✔️ +32  258 💤 ±0  0  - 2 

Results for commit 4f6e168. ± Comparison against base commit e60626f.

♻️ This comment has been updated with latest results.

@jppgks jppgks force-pushed the stratified-sampling branch from 2d8c7b8 to 037fd80 Compare August 30, 2022 13:46
@jppgks
Copy link
Contributor

jppgks commented Sep 6, 2022

For a partitioned dataset, we can stratify split each partition individually to obtain a global stratified split.

Proof.
Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$

$$ \begin{align} &\sum_{p \in P} \%_{train} \times | p_{S = s} | & \%_{train} \texttt{ of every partition's } s \texttt{ values will be in that partition's train} \\ = &\%_{train} \times \sum_{p \in P} | p_{S = s} | & \texttt{which is } \%_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\ = &\%_{train} \times | D_{S = s} | & \texttt{which is } \%_{train} \texttt{ of the total number of } s \texttt{ values} \\ \end{align} $$

$\%_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.

Assumes every partition fits in memory

Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$

$$
\begin{align}
 &\sum_{p \in P} \\%\_{train} \times | p\_{S = s} | & \\%\_{train} \texttt{ of every partition's } s
\texttt{ values will be in that partition's train} \\
= &\\%\_{train} \times \sum_{p \in P} | p\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\
= &\\%\_{train} \times | D\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the total number of } s \texttt{ values} \\
\end{align}
$$

$\\%\_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.
@jppgks jppgks marked this pull request as ready for review September 6, 2022 11:35
@jppgks
Copy link
Contributor

jppgks commented Sep 6, 2022

From @justinxzhao: Move split_partition outside split scope

@jppgks
Copy link
Contributor

jppgks commented Sep 6, 2022

From @ShreyaR: might want to repartition after splitting into train/valid/test. Determine when to make the repartition call

ludwig/data/split.py Outdated Show resolved Hide resolved
ludwig/data/split.py Show resolved Hide resolved
@jppgks jppgks requested a review from ShreyaR September 14, 2022 09:58
Copy link
Contributor

@ShreyaR ShreyaR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Feel free to merge once tests are passing

@jppgks jppgks merged commit 7bab539 into master Sep 14, 2022
@jppgks jppgks deleted the stratified-sampling branch September 14, 2022 17:27
jppgks pushed a commit that referenced this pull request Sep 14, 2022
Assumes every partition fits in memory

Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$

$$
\begin{align}
 &\sum_{p \in P} \\%\_{train} \times | p\_{S = s} | & \\%\_{train} \texttt{ of every partition's } s
\texttt{ values will be in that partition's train} \\
= &\\%\_{train} \times \sum_{p \in P} | p\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\
= &\\%\_{train} \times | D\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the total number of } s \texttt{ values} \\
\end{align}
$$

$\\%\_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants