Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better parsing for the words- and docsfile #1695

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

Flixtastic
Copy link
Contributor

Seperate PR to further improve the parsing during the textindex building.

@Flixtastic
Copy link
Contributor Author

One question I have is that I didn't find a solution to convert absl::StrSplit to a std::range or std::view and therefore resulted to using another cppcoro generator. I've seen the idea to avoid these generators but am I right that it is only possible to use these new Iterators through creating classes that implement them?

Copy link

codecov bot commented Dec 28, 2024

Codecov Report

Attention: Patch coverage is 88.00000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 89.87%. Comparing base (acb6633) to head (bea5936).

Files with missing lines Patch % Lines
src/index/IndexImpl.Text.cpp 75.00% 9 Missing and 3 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1695   +/-   ##
=======================================
  Coverage   89.86%   89.87%           
=======================================
  Files         389      390    +1     
  Lines       37308    37339   +31     
  Branches     4204     4205    +1     
=======================================
+ Hits        33527    33557   +30     
+ Misses       2485     2483    -2     
- Partials     1296     1299    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, this absolutely goes into the right direction.
I have some initial comments for the cleaning up, let me know if you need further advice.

src/index/IndexImpl.Text.cpp Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.cpp Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.cpp Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.cpp Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.cpp Outdated Show resolved Hide resolved
test/WordsAndDocsFileLineCreator.h Outdated Show resolved Hide resolved
test/WordsAndDocsFileLineCreator.h Outdated Show resolved Hide resolved
test/WordsAndDocsFileParserTest.cpp Outdated Show resolved Hide resolved
Flixtastic and others added 2 commits January 9, 2025 12:44
…sts in WordsAndDocsFileParserTest.cpp. Renamed methods in WordsAndDocsFileLineCreator.h to reduce ambiguity. Incorporated requested small changes of PR.
@Flixtastic Flixtastic requested a review from joka921 January 9, 2025 15:54
Signed-off-by: Johannes Kalmbach <johannes.kalmbach@gmail.com>
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small suggestions.

Also have a look at the sonarcloud issues,

src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
src/parser/WordsAndDocsFileParser.h Outdated Show resolved Hide resolved
test/WordsAndDocsFileParserTest.cpp Outdated Show resolved Hide resolved
ASSERT_EQ(std::get<0>(testLine), std::get<0>(expectedResult.at(i)));
ASSERT_EQ(std::get<1>(testLine), std::get<1>(expectedResult.at(i)));
ASSERT_EQ(std::get<2>(testLine), std::get<2>(expectedResult.at(i)));
ASSERT_EQ(std::get<3>(testLine), std::get<3>(expectedResult.at(i)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is much better with the helper functions (There are even cleaner ways with better error messages in GoogleTest, but this refactoring is nice because now all improviements can be applied locally!

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a very small suggestion.

src/index/IndexImpl.Text.cpp Outdated Show resolved Hide resolved
@Flixtastic Flixtastic requested a review from joka921 January 10, 2025 18:00
@sparql-conformance
Copy link

@Flixtastic
Copy link
Contributor Author

One possible solution to the current coverage problem is to start a file IndexImplHelpers.h and a corresponding cpp to outsource the helper methods and test them seperately. This would leed to even more references being passed to the functions.

Currently I am unsure whether to do this or not. Also maybe there is another way to reduce the nesting at the positions where the helper functions are now at play as solution.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a very small thing, otherwise this now looks much cleaner.

@@ -53,8 +53,7 @@ cppcoro::generator<WordsFileLine> IndexImpl::wordsInTextRecords(
std::string_view textView = text;
textView = textView.substr(0, textView.rfind('"'));
textView.remove_prefix(1);
auto normalizedWords = tokenizeAndNormalizeText(textView, localeManager);
for (auto word : normalizedWords) {
for (auto word : tokenizeAndNormalizeText(textView, localeManager)) {
WordsFileLine wordLine{word, false, contextId, 1};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we benefit from a std::move(word) here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants