Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ascii encoding for a subset of rows. #32

Closed
wants to merge 2 commits into from

Conversation

kgpai
Copy link
Contributor

@kgpai kgpai commented Aug 12, 2021

Currently for any String vector we store state on whether its fully ascii or not. When a vector is fully ascii it allows us to use the ascii fast path for any string computation. Most datasets are predominantly ascii , however presence of even one utf string will cause it to take utf path. We now maintain asciiness for the whole Vector but also maintain state on the rows we have processed.

Note:

  • We now also store asciiStateMap which is a selectivity vector of all rows we have processed so far; if we get a Selectivity vector for a row that we havent seen then we return nullopt else we return asciiness of whole vector.
  • We dont store the encoding mode state in the vector anymore.

TODO: This PR doesnt address case when a Vector is reused ; It might be possible then to have old / invalid ascii state. This will be addressed in a follow up PR.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 12, 2021
@kgpai kgpai requested review from mbasmanova and pedroerp August 12, 2021 05:39
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai Some comments and questions.

velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/functions/lib/StringEncodingUtils.h Outdated Show resolved Hide resolved
velox/functions/lib/StringEncodingUtils.h Outdated Show resolved Hide resolved
velox/functions/lib/StringEncodingUtils.h Outdated Show resolved Hide resolved
velox/vector/ConstantVector.h Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai Some comments and questions below.

velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
// If T is velox::StringView, specifies the string encoding mode
folly::Optional<functions::stringCore::StringEncodingMode> encodingMode_ =
folly::none;
// If T is f4d::StringView, we create a bitmap to store asciiness for each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • f4d -> velox or better yet just drop it; "we create a bitmap to" can also be dropped for brevity: "If T is a StringView, stores ascii-ness ...."

velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/functions/common/StringFunctions.cpp Outdated Show resolved Hide resolved
velox/functions/common/tests/StringFunctionsTest.cpp Outdated Show resolved Hide resolved
velox/vector/ConstantVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
@kgpai
Copy link
Contributor Author

kgpai commented Aug 18, 2021

In a succeeding PR , I will :

  • Remove StringEncoding enum
  • Add Tests and support for switch form.

velox/expression/ControlExpr.cpp Show resolved Hide resolved
// always then path
test("if(1=1, lower(C1), lower(C2))", StringEncodingMode::ASCII);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these tests are no longer relevant (for e.g the test that checks encoding on partially populated vector. The other tests, test switch expressions which is coming in a subsequent pr.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai A few quick questions.

velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved

rows.template applyToSelected([&](auto row) {
asciiRows_.setValid(row, valid);
asciiSetRows_.setValid(row, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be shortened to asciiSetRows_.select(rows) ?

velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
copyAsciiDataFrom(const BaseVector* vector, const SelectivityVector& rows) {
auto source = vector->asUnchecked<SimpleVector<StringView>>();
if (vector->isConstantEncoding()) {
resizeAsciiRows(asciiRows_, rows.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resize calls should be the same in both branches, hence, can be moved outside of the "if"

velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
maxEncoding(resultEncoding, inputVector->getStringEncoding().value());
}

resultVector->updateIsAscii(*inputVector, rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai My reading of the implementation of the updateIsAscii method is that it copies isAscii flags from other vector for the specified rows. However, here we need a different behavior. We want ascii flag to be true only for rows where all input rows are ascii.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch ! Fixed !

@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

2 similar comments
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai Krishna, thank you for updating the PR. Here are some further comments on the non-test changes.

velox/expression/ControlExpr.cpp Outdated Show resolved Hide resolved
velox/expression/ControlExpr.cpp Show resolved Hide resolved
velox/expression/ControlExpr.cpp Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Show resolved Hide resolved
velox/expression/Expr.cpp Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/expression/Expr.cpp Outdated Show resolved Hide resolved
velox/expression/ControlExpr.cpp Outdated Show resolved Hide resolved
velox/expression/ControlExpr.cpp Show resolved Hide resolved
velox/expression/Expr.cpp Show resolved Hide resolved
velox/expression/ControlExpr.cpp Outdated Show resolved Hide resolved
(*result)->as<SimpleVector<StringView>>()->setStringEncoding(
*resultEncoding);
}
propagateIsAscii(vectorFunction_.get(), inputValues_, result, applyRows);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanVectorFunctionInputsStringEncoding does calculate asciiness before calling applyVectorFunction. And we set it after the call via propagateIsAscii..

velox/expression/Expr.cpp Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
@@ -56,13 +56,15 @@ class SelectivityVector {
// are set to false.
static SelectivityVector empty(vector_size_t size);

void resize(int32_t size) {
void resize(int32_t size, bool value = true) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a seperate PR for this then.

@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai Some further comments.

velox/vector/ConstantVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
velox/vector/SimpleVector.h Outdated Show resolved Hide resolved
@kgpai kgpai force-pushed the fast_ascii branch 8 times, most recently from 233f771 to 94f3792 Compare September 1, 2021 05:45
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kgpai merged this pull request in b48cd5f.

@kgpai kgpai mentioned this pull request Sep 9, 2021
facebook-github-bot pushed a commit that referenced this pull request Sep 10, 2021
Summary:
With the recent ascii compute changes (See: #32), the StringEncoding enum is redundant and no longer required.

Pull Request resolved: #183

Reviewed By: mbasmanova

Differential Revision: D30847517

Pulled By: kgpai

fbshipit-source-id: 7c850f2536d17af179d343138b154ba934fb9c28
rui-mo added a commit to rui-mo/velox that referenced this pull request Jul 28, 2022
rui-mo added a commit to rui-mo/velox that referenced this pull request Aug 12, 2022
rui-mo added a commit to rui-mo/velox that referenced this pull request Aug 22, 2022
rui-mo added a commit to rui-mo/velox that referenced this pull request Sep 7, 2022
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Sep 26, 2022
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Oct 26, 2022
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Nov 8, 2022
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Nov 8, 2022
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Nov 22, 2022
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Dec 15, 2022
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Jan 6, 2023
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Jan 12, 2023
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
PHILO-HE pushed a commit to PHILO-HE/velox that referenced this pull request Feb 3, 2023
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Feb 24, 2023
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Mar 3, 2023
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Mar 9, 2023
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Apr 1, 2023
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Apr 23, 2023
…incubator#27)

* Filter validation for Parquet reader at runtime

* Style

* Style

* Format

Removed special handling for avg (facebookincubator#31)

[OPPRO-173] Make batch size configurable (facebookincubator#32)

support dwrf format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants