Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3540] Check columns for One-Hot-Encoding before compression #2054

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

smyomous
Copy link

@smyomous smyomous commented Jul 24, 2024

[SYSTEMDS-3540] Checking if column groups are One-Hot-Encoded before compression

This patch adds checks on column groups to verify if multiple columns, together, are One Hot Encoded. In that case, these column groups are compressed accordingly into an IdentityDictionary to exploit the structure of One Hot Encoded columns and optimize the subsequent operations. We observe a reduction in the number of column groups of up to 98%, with most experiments seeing a reduction of over 80%. However, the overhead on execution time is large for larger datasets due to the current implementation of the compression of One-Hot-Encoded columns.
Detailed documentation can be found in scripts/perftest/ohe_checks/README.md

Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall good, however, some cleanup tasks left and we need to fix the tests.

conf/log4j-compression.properties Outdated Show resolved Hide resolved
conf/log4j-compression.properties Outdated Show resolved Hide resolved
scripts/perftest/ohe_checks/README.md Outdated Show resolved Hide resolved
scripts/perftest/ohe_checks/README.md Outdated Show resolved Hide resolved
scripts/perftest/ohe_checks/experiments.sh Outdated Show resolved Hide resolved
@Baunsgaard
Copy link
Contributor

Simply, mark the comments as resolved if you have addressed them.

@Baunsgaard
Copy link
Contributor

try to look into the errors from the tests. To do this click on the failing tests

image

and decipher where the bug is.

One for instance is:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants