-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ascii encoding for a subset of rows. #32
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai Some comments and questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai Some comments and questions below.
velox/vector/SimpleVector.h
Outdated
// If T is velox::StringView, specifies the string encoding mode | ||
folly::Optional<functions::stringCore::StringEncodingMode> encodingMode_ = | ||
folly::none; | ||
// If T is f4d::StringView, we create a bitmap to store asciiness for each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- f4d -> velox or better yet just drop it; "we create a bitmap to" can also be dropped for brevity: "If T is a StringView, stores ascii-ness ...."
In a succeeding PR , I will :
|
// always then path | ||
test("if(1=1, lower(C1), lower(C2))", StringEncodingMode::ASCII); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these tests are no longer relevant (for e.g the test that checks encoding on partially populated vector. The other tests, test switch expressions which is coming in a subsequent pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai A few quick questions.
velox/vector/SimpleVector.h
Outdated
|
||
rows.template applyToSelected([&](auto row) { | ||
asciiRows_.setValid(row, valid); | ||
asciiSetRows_.setValid(row, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be shortened to asciiSetRows_.select(rows) ?
velox/vector/SimpleVector.h
Outdated
copyAsciiDataFrom(const BaseVector* vector, const SelectivityVector& rows) { | ||
auto source = vector->asUnchecked<SimpleVector<StringView>>(); | ||
if (vector->isConstantEncoding()) { | ||
resizeAsciiRows(asciiRows_, rows.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resize calls should be the same in both branches, hence, can be moved outside of the "if"
velox/expression/Expr.cpp
Outdated
maxEncoding(resultEncoding, inputVector->getStringEncoding().value()); | ||
} | ||
|
||
resultVector->updateIsAscii(*inputVector, rows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai My reading of the implementation of the updateIsAscii method is that it copies isAscii flags from other vector for the specified rows. However, here we need a different behavior. We want ascii flag to be true only for rows where all input rows are ascii.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch ! Fixed !
@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
2 similar comments
@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai Krishna, thank you for updating the PR. Here are some further comments on the non-test changes.
velox/expression/Expr.cpp
Outdated
(*result)->as<SimpleVector<StringView>>()->setStringEncoding( | ||
*resultEncoding); | ||
} | ||
propagateIsAscii(vectorFunction_.get(), inputValues_, result, applyRows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scanVectorFunctionInputsStringEncoding does calculate asciiness before calling applyVectorFunction. And we set it after the call via propagateIsAscii..
velox/vector/SelectivityVector.h
Outdated
@@ -56,13 +56,15 @@ class SelectivityVector { | |||
// are set to false. | |||
static SelectivityVector empty(vector_size_t size); | |||
|
|||
void resize(int32_t size) { | |||
void resize(int32_t size, bool value = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a seperate PR for this then.
@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
1 similar comment
@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai Some further comments.
233f771
to
94f3792
Compare
@kgpai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
…incubator#27) * Filter validation for Parquet reader at runtime * Style * Style * Format Removed special handling for avg (facebookincubator#31) [OPPRO-173] Make batch size configurable (facebookincubator#32) support dwrf format
Currently for any String vector we store state on whether its fully ascii or not. When a vector is fully ascii it allows us to use the ascii fast path for any string computation. Most datasets are predominantly ascii , however presence of even one utf string will cause it to take utf path. We now maintain asciiness for the whole Vector but also maintain state on the rows we have processed.
Note:
TODO: This PR doesnt address case when a Vector is reused ; It might be possible then to have old / invalid ascii state. This will be addressed in a follow up PR.