-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text Index lag time for realtime consuming segments #6431
Comments
Partial data set for not-working search is
|
@kishoreg @siddharthteotia Could you guys help me to point to correct direction to fix this issue? |
@siddharthteotia I just tried now and can see the query working, So that means there is a huge delay for text Index to index realtime data? or the results to appear in search. Are there any settings that controlling the text index timings? |
@siddharthteotia The behavior I am seeing is that the text index not pulling results from a segment that is in CONSUMING / IN PROGRESS status..! I am posting the consuming segment metadata here { |
Further Analysis showed there is roughly 3 minutes delay in Text Index updates flowing in . Following are the timelines measured for TextIndex . So it looks TextIndex is doing a three minute bulk index update rather than incremental. Is there any way to reduce this ~3 mts to instant like happening for inverted Index.
|
The issue is we are seeing a delay in getting realtime segment ingested. in this case it is about 3 minutes. |
@siddharthteotia @kishoreg reducing the rowcount to 1000 from 100000 is giving me a better performance of 20 seconds delay. The timings are below. However this causes the segment count to grow x100 . Could you guys please provide me the direction to fix the issue. I have to assume this is a bug and might need a code change? 2021-01-14 22:51:06.986000 |
We need to optimize the text index snapshot refresh code. It uses a single background thread to refresh index across all consuming segments. We need to make the refresh multi threaded. |
Let me think a little bit on the details of code changes. We can sync up then. |
Also the issue is not due to the delay in getting realtime segment ingested. This is a lag/consistency issue in the text index. We need to make the refresh faster to reduce the lag. Data is getting consumed correctly and timely into the segment |
I had suggested to reduce the number of column s with text index. Not to reduce the number of columns in the table. This can potentially make the refresh code faster. How did you test this? |
@siddharthteotia ,sorry for the misleading comment(corrected my comment above). I did not modified the table . I tested by reducing number of columns with text index. I deleted the text index definitions for the all columns except one i.e. log. The Steps I followed: However it is not making any difference. The lag still remains the same. But, when I reduced the number of rows per segment it does reduced the lag time. However the query performance in searches reduced slightly. I think it is because of lots of small segments. So it will be better to fix the lag time code. |
@siddharthteotia If I understand correctly the issue is because of the way how text index scans the CONSUMING segments. it keeps refreshing the entire CONSUMING segments. If the segment size is large then the refreshing takes longer which causes the lag. Please correct me if I am wrong. |
@siddharthteotia PhysicalColumnIndexContainer._invertedIndex is pulled from PinotDataBuffer (segmentReader.getIndexFor()) which I assume is memory or Consuming segments and it is same for _fstIndex, _rangeIndex, _h3Index, _jsonIndex etc.. But for _textIndex it is loading from segmentIndexDir which is Physical file or I guess Completed Segments. So we might have to create a new constructor for LuceneTextIndexReader Do you think this make any sense? |
I have a proposal to fix this. This is essentially an enhancement. Let's meet and discuss. Happy to help you make the code changes else I will find time to make the changes. In any case, let's meet. |
Sync'd up with @pabrahamusa . There was a bug in realtime text index. Had provided a custom build to Matt with fix and more logging. The lag issue was fixed with the build. Matt is doing some more testing. |
@siddharthteotia can you briefly describe the bug? I thought we had handled text index from within mutable segment? |
@siddharthteotia Afraid to say that further testing revealed the issue is not fully fixed. The lag time is ~20 - 40 seconds when the segment row count is 100000. However, when I bumped it to 900000 the lag time steeped up to ~7 - 9 minutes..! . Please note that inverted index is always giving stable results ~10 secs no matter the row count of segments. It will be great if we fix the bug in a way that text index also behaves the same as inverted index. |
We can go through the logs tomorrow based on the new build that I gave with additional logging and debug further. I thought you wanted to reduce the lag from 3 mins to 30secs or so which the new build has done. Also, your comparison is not fair. If your functionality can be achieved with inverted index, then text index should not be used at all. Inverted index should always be the first choice.. But on the log column in your schema, you can't really use inverted index for the kind of search you want to perform. Secondly, arbitrarily increasing the number of rows in consuming segment isn't a good idea either. If 100k is giving the optimal performance, my suggestion would be to stick with it. |
@siddharthteotia, we can go through this and I am sure we can get this fixed. Also the fix you provided last week is working as expected for ~100K rows and thanks a lot for that. From my preliminary testing query search performance appears same for both text index and inverted index. it is just the NRT lag that have the diff. The queries I tried with inverted index is with term query or exact match i.e
Unfortunately I cannot use this inverted index search as my requirement is to make a proper text search on a column. As I am migrating from Elastic Search I can use the Lucene index regex patterns supported by pinot text index which makes things easier for me. |
I am having Text Index and and inverted Index . And for some data Text Index not able to pull out results. for the same search Inverted Index works all the time.
Eg: - The Text Index Search for following is not working
select * from mylog where text_match(log, '1DB7BAD2-406A-D773-B1DA-291664314C4F')
but it works for
select * from mylog where text_match(log, '0D82F520-62C8-9914-14B8-4C2331E54075')
it looks like somehow some data are skipped from text_index. However both search will works fine with regexp_like inverted index.
The text was updated successfully, but these errors were encountered: