Speed up the word prefix databases computation time #436

Kerollmops · 2022-01-18T11:05:58Z

This PR depends on the fixes done in #431 and must be merged after it.

In this PR we will bring the WordPrefixPairProximityDocids, WordPrefixDocids and, WordPrefixPositionDocids update structures to a new era, a better era, where computing the word prefix pair proximities costs much fewer CPU cycles, an era where this update structure can use the, previously computed, set of new word docids from the newly indexed batch of documents.

The WordPrefixPairProximityDocids is an update structure, which means that it is an object that we feed with some parameters and which modifies the LMDB database of an index when asked for. This structure specifically computes the list of word prefix pair proximities, which correspond to a list of pairs of words associated with a proximity (the distance between both words) where the second word is not a word but a prefix e.g. s, se, a. This word prefix pair proximity is associated with the list of documents ids which contains the pair of words and prefix at the given proximity.

The origin of the performances issue that this struct brings is related to the fact that it starts its job from the beginning, it clears the LMDB database before rewriting everything from scratch, using the other LMDB databases to achieve that. I hope you understand that this is absolutely not an optimized way of doing things.

…ty words

…ompute

…e database

ManyTheFish · 2022-02-16T12:25:24Z

milli/src/update/index_documents/mod.rs

+                    // We extract and mmap our chunk file to be able to get it for next processes.
+                    let mut file = chunk.into_inner();
+                    let mmap = unsafe { memmap2::Mmap::map(&file)? };
+                    let cursor_mmap = CursorClonableMmap::new(ClonableMmap::from(mmap));
+                    let chunk = grenad::Reader::new(cursor_mmap)?;
+                    word_docids.push(chunk);
+
+                    // We reconstruct our typed-chunk back.
+                    file.rewind()?;
+                    let chunk = grenad::Reader::new(file)?;


could you factorize this in a function?
Is into_clonable_grenad could suit this case?

Done by bumping the grenad version to v0.4.1 and then changing the behavior of the into_cloneable_grenad (renamed as_cloneable_grenad) by naming it take a reference of a grenad::Reader. Doing so reduced the complexity of this quoted code part.

curquiza · 2022-02-16T14:34:11Z

Is this PR breaking? (API break or DB break)

Kerollmops · 2022-02-16T14:42:48Z

Is this PR breaking? (API break or DB break)

There shouldn't be any breaking change either on the Rust API or the written formats here. If there are any it's a bug.

ManyTheFish

All good to me, Thanks!

Kerollmops · 2022-02-16T15:11:05Z

I'll merge then, thank you for your review!
bors merge

bors · 2022-02-16T15:57:23Z

Build succeeded:

Kerollmops · 2022-02-24T16:53:59Z

milli/src/update/index_documents/mod.rs

+        let current_prefix_fst = self.index.words_prefixes_fst(self.wtxn)?;
+
+        // We retrieve the common words between the previous and new prefix word fst.
+        let common_prefix_fst_words = fst_stream_into_vec(
+            previous_words_prefixes_fst.op().add(&current_prefix_fst).intersection(),
+        );
+        let common_prefix_fst_words: Vec<_> = common_prefix_fst_words
+            .as_slice()
+            .linear_group_by_key(|x| x.chars().nth(0).unwrap())
+            .collect();
+
+        // We retrieve the newly added words between the previous and new prefix word fst.
+        let new_prefix_fst_words = fst_stream_into_vec(
+            current_prefix_fst.op().add(&previous_words_prefixes_fst).difference(),
+        );
+
+        // We compute the set of prefixes that are no more part of the prefix fst.
+        let del_prefix_fst_words = fst_stream_into_hashset(
+            previous_words_prefixes_fst.op().add(&current_prefix_fst).difference(),
+        );
+


We must stop creating enormous Vecs when we process the first update. In this code snippet, on an initial update, we compute the difference between the previous prefix fst (empty) and the current prefix fst (big) and store the result in a Vec and an HashSet (takes time and space).

456: Remove useless grenad merging r=Kerollmops a=Kerollmops This PR must be merged after #454. This PR removes the part of code that was merging all of the grenad Readers merging that we don't need as the indexer should have merged them and, therefore, we should only have one final grenad Reader. We reduce the amount of CPU usage and memory pressure we were doing uselessly. `@ManyTheFish` are you sure I can skip merging the `word_docids` database? Here is the benchmark comparison with the previously merged PR #454: ``` group indexing_reintroduce-appending-sorted-values_c05e42a8 indexing_remove-useless-grenad-merging_d5b8b5a2 ----- ----------------------------------------------------- ----------------------------------------------- indexing/Indexing movies with default settings 1.06 16.6±1.04s ? ?/sec 1.00 15.7±0.93s ? ?/sec indexing/Indexing songs with default settings 1.16 60.1±7.07s ? ?/sec 1.00 51.7±5.98s ? ?/sec indexing/Indexing songs without faceted numbers 1.06 55.4±6.14s ? ?/sec 1.00 52.2±4.13s ? ?/sec ``` And the comparison with multi-batch indexing before #436, we can see that we gain time for benchmarks that index datasets in multiple batches but there is _so much_ variance that it's not clear. ``` group indexing_benchmark-multi-batch-indexing-before-speed-up_45f52620 indexing_remove-useless-grenad-merging_d5b8b5a2 ----- ---------------------------------------------------------------- ----------------------------------------------- indexing/Indexing geo_point 1.07 6.6±0.08s ? ?/sec 1.00 6.2±0.11s ? ?/sec indexing/Indexing songs in three batches with default settings 1.12 57.7±2.14s ? ?/sec 1.00 51.5±3.80s ? ?/sec indexing/Indexing songs with default settings 1.00 47.5±2.52s ? ?/sec 1.09 51.7±5.98s ? ?/sec indexing/Indexing songs without any facets 1.00 43.5±1.43s ? ?/sec 1.12 48.8±3.73s ? ?/sec indexing/Indexing songs without faceted numbers 1.00 47.1±2.23s ? ?/sec 1.11 52.2±4.13s ? ?/sec indexing/Indexing wiki 1.00 917.3±30.01s ? ?/sec 1.09 998.7±38.92s ? ?/sec indexing/Indexing wiki in three batches 1.09 1091.2±32.73s ? ?/sec 1.00 996.5±15.70s ? ?/sec ``` What do you think `@irevoire?` Should we change the benchmarks to make them do more runs? Co-authored-by: Kerollmops <clement@meilisearch.com>

457: Avoid iterating on big databases when useless r=Kerollmops a=Kerollmops This PR makes the prefix database updates to avoid iterating on big grenad files when it is unnecessary. We introduced this regression in #436 but it went unnoticed. --- According to the following benchmark results, we take more time when we index documents in one run than before #436. It looks like it is probably due to the fact that, now, instead of computing the prefixes database by iterating on the LMDB we directly iterate on the grenad file. Those could be slower to iterate on and could be the slowdown cause. I just pushed a commit that tests this branch with the new unreleased version of grenad where some work was done to speed up the iteration on grenad files. [The benchmarks for this last commit](https://github.com/meilisearch/milli/actions/runs/1927187408) are currently running. You can [see the diff](meilisearch/grenad@v0.4.1...main) between the v0.4 and the unreleased v0.5 version of grenad. ```diff group indexing_benchmark-multi-batch-indexing-before-speed-up_45f52620 indexing_stop-iterating-on-big-grenad-files_ac8b85c4 ----- ---------------------------------------------------------------- ---------------------------------------------------- + indexing/Indexing songs in three batches with default settings 1.12 57.7±2.14s ? ?/sec 1.00 51.3±2.76s ? ?/sec - indexing/Indexing wiki 1.00 917.3±30.01s ? ?/sec 1.10 1008.4±38.27s ? ?/sec + indexing/Indexing wiki in three batches 1.10 1091.2±32.73s ? ?/sec 1.00 995.5±24.33s ? ?/sec ``` Co-authored-by: Kerollmops <clement@meilisearch.com>

Kerollmops force-pushed the speed-up-word-prefix-pair-proximity branch 6 times, most recently from f45a56c to bda0c67 Compare January 19, 2022 17:11

Kerollmops changed the title ~~Speed up word prefix pair proximity~~ Speed up word prefix database computation time Jan 25, 2022

Kerollmops force-pushed the speed-up-word-prefix-pair-proximity branch 5 times, most recently from 45f647a to c5a4e46 Compare January 25, 2022 16:10

Kerollmops added 11 commits January 27, 2022 10:05

Retrieve the previous version of the words prefixes FST

d28f186

Bring the newly created word pair proximity docids

822f67e

Only compute the word prefix pairs on the created word pair proximities

c90fa95

Move the fst_stream_into_hashset method in the helper methods

5404bc0

Rework the WordPrefixDocids update to compute a subset of the database

28692f6

Rework the WordPrefixDocids update to compute a subset of the database

2ec8542

Fix the computation of the newly added and common prefix words

d59e559

Fix the computation of the newly added and common prefix pair proximi…

e760e02

…ty words

Create a function to simplify the word prefix pair proximity docids c…

dbba5fd

…ompute

Rework the WordsPrefixPositionDocids update to compute a subset of th…

e9c0217

…e database

Remove, now useless, the WriteMethod enum

51d1e64

Kerollmops force-pushed the speed-up-word-prefix-pair-proximity branch from 6226243 to 51d1e64 Compare January 27, 2022 09:09

Compute the new, common and, deleted prefix words fst once

fb79c32

Kerollmops requested a review from ManyTheFish January 27, 2022 10:28

Kerollmops marked this pull request as ready for review January 27, 2022 10:28

Kerollmops changed the title ~~Speed up word prefix database computation time~~ Speed up the word prefix databases computation time Jan 27, 2022

ManyTheFish suggested changes Feb 16, 2022

View reviewed changes

Finally bump grenad to v0.4.1

f367cc2

Change the behavior of the as_cloneable_grenad by taking a ref

ff8d7a8

ManyTheFish approved these changes Feb 16, 2022

View reviewed changes

bors bot merged commit 25123af into main Feb 16, 2022

bors bot deleted the speed-up-word-prefix-pair-proximity branch February 16, 2022 15:57

Kerollmops commented Feb 24, 2022

View reviewed changes

Kerollmops mentioned this pull request Mar 1, 2022

Remove useless grenad merging #456

Merged

Kerollmops mentioned this pull request Mar 1, 2022

Avoid iterating on big databases when useless #457

Merged

curquiza mentioned this pull request Mar 2, 2022

Improve indexation time when inserting documents meilisearch/meilisearch#2203

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up the word prefix databases computation time #436

Speed up the word prefix databases computation time #436

Kerollmops commented Jan 18, 2022 •

edited

Loading

ManyTheFish Feb 16, 2022

Kerollmops Feb 16, 2022

curquiza commented Feb 16, 2022 •

edited

Loading

Kerollmops commented Feb 16, 2022 •

edited

Loading

ManyTheFish left a comment

Kerollmops commented Feb 16, 2022

bors bot commented Feb 16, 2022

Kerollmops Feb 24, 2022

Speed up the word prefix databases computation time #436

Speed up the word prefix databases computation time #436

Conversation

Kerollmops commented Jan 18, 2022 • edited Loading

ManyTheFish Feb 16, 2022

Choose a reason for hiding this comment

Kerollmops Feb 16, 2022

Choose a reason for hiding this comment

curquiza commented Feb 16, 2022 • edited Loading

Kerollmops commented Feb 16, 2022 • edited Loading

ManyTheFish left a comment

Choose a reason for hiding this comment

Kerollmops commented Feb 16, 2022

bors bot commented Feb 16, 2022

Kerollmops Feb 24, 2022

Choose a reason for hiding this comment

Kerollmops commented Jan 18, 2022 •

edited

Loading

curquiza commented Feb 16, 2022 •

edited

Loading

Kerollmops commented Feb 16, 2022 •

edited

Loading