Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: filter() function return remain_columns and deleted_columns for statistics #8767

Merged
merged 5 commits into from
Nov 14, 2022
Merged

feat: filter() function return remain_columns and deleted_columns for statistics #8767

merged 5 commits into from
Nov 14, 2022

Conversation

lichuang
Copy link
Contributor

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

feat: filter() function return remain_columns and deleted_columns for statistics

Closes #8766

@vercel
Copy link

vercel bot commented Nov 12, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Nov 13, 2022 at 11:17AM (UTC)

@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label Nov 12, 2022
@sundy-li
Copy link
Member

This will make filter queries work slower than before which is unacceptable.

If you want to have the deleted column, there are some other ways like using Series::take

@BohuTANG BohuTANG removed their request for review November 13, 2022 00:57
@lichuang
Copy link
Contributor Author

This will make filter queries work slower than before which is unacceptable.

If you want to have the deleted column , there are some other ways like using Series::take

filter() use BooleanColumn as its filter args, i have make some poc that can get set and unset bits indices from BooleanColumn, and use Series::take to get remain and delete columns respectively, but it will loop the columns twice, if refactar filter() to return remain and deleted tuple, it will only need to be loop once.

@BohuTANG
Copy link
Member

BohuTANG commented Nov 13, 2022

I prefer not to add the deleted to the filter function, it's a common function and is a hot path.
Could you give more information on how we use the filter function in your case? cc @lichuang

@sundy-li
Copy link
Member

sundy-li commented Nov 13, 2022

it will loop the columns twice, if refactar filter() to return remain and deleted tuple, it will only need to be loop once.

An extra memory copy for the column value is much more costable than an extra loop.

@lichuang
Copy link
Contributor Author

I prefer not to add the deleted to the filter function, it's a common function and is a hot path. Could you give more information on how we use the filter function in your case? cc @lichuang

i agree that filter() is the hot path, and extra memory copy is expensive, so can we add a arg like Option<()> to indicate that if or not return deleted_column? cc @sundy-li

@sundy-li I have another question, if Series::take can filter all the column by BooleanColumn, then why all the sub-type of Column need to impl filter() function, why just use Series::take instead?

@sundy-li
Copy link
Member

filter can use vectorized selection to improve performance.

mask_chunks
        .by_ref()
        .enumerate()
        .for_each(|(mask_index, mut mask)| {
            while mask != 0 {
                let n = mask.trailing_zeros() as usize;
                let i = mask_index * CHUNK_SIZE + n + start_index;
                builder.push(c.get_data(i));
                mask = mask & (mask - 1);
            }
        });

take is used to take value by random.

so can we add a arg like Option<()> to indicate that if or not return deleted_column?

I don't think we should add extra arg to do that. Let's make the filter method work as its name.

The simplest way to have is
deleted_columns = column.filter(Neg::neg(bitmap) )

@BohuTANG BohuTANG merged commit 1340868 into databendlabs:main Nov 14, 2022
@lichuang lichuang deleted the filter_add_delete_column branch December 9, 2022 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filter() function return remain_columns and deleted_columns for statistics
3 participants