-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed stratified splitting with Dask #1883
Conversation
2d8c7b8
to
037fd80
Compare
For a partitioned dataset, we can stratify split each partition individually to obtain a global stratified split. Proof. |
Assumes every partition fits in memory Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$ $$ \begin{align} &\sum_{p \in P} \\%\_{train} \times | p\_{S = s} | & \\%\_{train} \texttt{ of every partition's } s \texttt{ values will be in that partition's train} \\ = &\\%\_{train} \times \sum_{p \in P} | p\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\ = &\\%\_{train} \times | D\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the total number of } s \texttt{ values} \\ \end{align} $$ $\\%\_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.
From @justinxzhao: Move |
From @ShreyaR: might want to repartition after splitting into train/valid/test. Determine when to make the repartition call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Feel free to merge once tests are passing
Assumes every partition fits in memory Given $P$ the set of partitions of dataset $D$, and $S$ the set of unique values in the stratify column. When stratified splitting every partition individually, $\forall s \in S:$ $$ \begin{align} &\sum_{p \in P} \\%\_{train} \times | p\_{S = s} | & \\%\_{train} \texttt{ of every partition's } s \texttt{ values will be in that partition's train} \\ = &\\%\_{train} \times \sum_{p \in P} | p\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the sum of all } s \texttt{ value counts over all partitions} \\ = &\\%\_{train} \times | D\_{S = s} | & \texttt{which is } \\%\_{train} \texttt{ of the total number of } s \texttt{ values} \\ \end{align} $$ $\\%\_{train}$ of records in $D$ with value $s$ in the stratify column will land in the training set. Equivalent proof for the valid and test split.
No description provided.