-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
very slow queries when "expand" is needed from subqueries #7220
Comments
Hi @rdelangh A couple of points here:
Thanks Luigi |
@luigidellaquila |
Hi @rdelangh No, I mean just running the query without the index. Something like
Just wondering if a full scan of the cluster is more efficient than an inefficient index access plus an inefficient random record access... Thanks Luigi |
@luigidellaquila |
Sorry, right syntax
|
Well, no, with these numbers for sure it will take a lot.
And what about this?
Thanks Luigi |
@luigidellaquila
I learned that I have to specify the index properties (ENV, PRT_DATE) as last criteria in the WHERE, otherwise the index will not be used. Correct? |
Result: So not good. |
Yep, I see... But you did not answer my question: how many records have My guess is that both index calls are returning a lot of records, so the executor is doing a lot of operations. Thanks Luigi |
There are about 19M records for the date-range (= 1 day) |
@luigidellaquila |
I'm afraid you will hardly optimize it under 60 seconds. The Thanks Luigi |
@luigidellaquila Disappointing. |
@rdelangh I think if you get rid of that direct index lookup (ie. select from index:...) you will be able to write a query with plain filters and without the EXPAND, this should speed up the execution to a few seconds Thanks Luigi |
hi @luigidellaquila
Result: So not good. |
How about parallellism ? I noticed that this is by default automatic, but
|
On the lucene side, can you avoid the leading wildcard? It is really the wrong way to use an inverted index such as lucene. Maybe writing a more expressive lucene query. |
hi @robfrank |
And can you please explain what you mean with "writing a more expressive lucene query" ? |
Maybe another strategy then: can we configure how the Lucene indexes are built?
How can we do that? Can you please share any examples of how the CREATE INDEX command would need to be adapted to do such Lucene optimisation? |
If you can recap to me with some samples of data and what kind of queries / results you need, I can try to help |
The quickest way to reduce the number of relevant records, is by filtering first on that CALLED-number. Only Lucene index supports leading and trailing wildcard searches, and does so fairly quickly (matter of seconds, or 1-2 minutes). Scanning the Lucene index for this, yields only the RIDs of the filtered records. But then we need also the filtering on PRT_DATE, so we need to "expand" these RIDs from the previous step. It is apparently this "expand" step that is very inefficient, and makes the queries run for hours ! For example (only a single day is searched for, but takes many minutes to complete, see above):
The inner query from the Lucene index returns in this case some 29K RIDs. The total number of records for a single day is approx 19M records. Therefor, swapping the order of the SELECTs would not give better results: first selecting all RIDs for one full day, then doing "expand" on these appro 20M RIDs, then filtering these records on their value of CALLED-number with LIKE. I have tried this, it did not terminate even after 3 hours... |
O don't know if it could work but you're using OrientDB mostly as a KV datastore. My suggestion is to try to build a multifield index on the cdr and date/datetime and use range queries : http://orientdb.com/docs/last/Full-Text-Index.html#numeric-and-date-range-queries-from-2214 so your query would be something like SELECT count( * ) FROM (
select expand(rid) from index:idx_cdr_af_20170212_3 WHERE key LUCENE ' cdr_filed_name:*08001* prt_date:[201612221000 TO 201612221100] ')
for sure, you need to do some experiments. |
hi @robfrank , I was able to create such multi-field Lucene index, but the resulting counts from that kind of query seem impossible: With property SERVEDIMSI being a STRING, and PRT_DATE being a DATETIME:
Then I search for a combination of SERVEDIMSI (exact string) and PRT_DATE (range):
whereas there exists only one single (=1) record in the class for these criteria! Also other syntaxes do not seem to be correct:
returns records for many different values of SERVEDIMSI and many different PRT_DATE ranges outside of my specified range. Also:
-> what is the correct syntax for such Lucene indexes used by OrientDB-SQL, and where can I find documentation about this syntax? |
can anyone please provide me the correct syntax to perform a query against a Lucene index that has been defined on multiple properties, such as the examples above in my previous post? |
hi @robfrank |
can anyone please help me out with the correct multi-field Lucene queries syntax ? |
The default Lucene's operator is 'OR'. So the query you wrote is asking for SERVEDIMSI:... OR PRT_DATE:.
The + means MUST As a reference: https://lucene.apache.org/core/5_3_2/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description |
the none-Lucene query shows that the record definitely exists:
However the Lucene multi-field query does not work:
|
I'll check, because I usually don't write queries using the index: notation, but a select from "class" where [field1,field2] lucene "query" . |
hi @robfrank , |
I added some working tests on our suite, you can look at them in the linked commits. |
hi @robfrank , can you please explain what the "linked commits" are, and where I find your tests? |
@rdelangh - the links (to the commits) are right above Rob's comment. Scott |
Link to the commit: 67648cf the complete test class: https://github.com/orientechnologies/orientdb/blob/2.2.x/lucene/src/test/java/com/orientechnologies/lucene/test/LuceneRangeTest.java Is it enough? |
hi @robfrank ,
A direct query of one particular record:
The query via this index:
The query via the class:
Sorry, but I do not find the error in my syntax, if any. |
may you try with dateTime values at minute? e.g.: orientdb {db=mobile}> select * from cdr_eno_20170312 WHERE [SERVEDIMSI,PRT_DATE] LUCENE '+SERVEDIMSI:206012221582810 +PRT_DATE:[201703120000 TO 201703120001]' Just as a test, while I investigate if it is possible to query at least on seconds interval |
@robfrank
|
Can you please send me a sample data set and what/how do you want to search over them?
|
hi @robfrank ,
|
hi,
|
Ok, thank you for the clarification. Having a little sample of real data, in form of SQL insert, little database or java code, will help a lot. Feel free to send it privately. |
Mind you that the study of how we might be able to obtain our query-results, is still independent from the fact that the documented syntax to run a multi-field Lucene query is not working in Orientdb... |
hello @robfrank , |
Hi @rdelangh, I'm on it, stay tuned. Thanks for sample datasa |
About range queries, I understand the problem, that is "obviously" related to time-zones. I'm working on it to fix. As a very temporary workaround, just for testing, you should transform your date in UTC adding or subtracting the number of hours to "reach" utc:
|
I created a dedicated issue related to timezone mismatch: #7382 |
OrientDB Version: <= 2.2.18
Java Version: N/A
OS: N/A
Expected behavior
This query completed in 9 seconds, entirely due to caching from the previous query (which lasted for about 1 minute)
Query executed in 261.234 sec
-> is there no better (=faster) way to run these countings ? It is a factor 3 slower than exactly the same query on a traditional, relational database with partitioning of a single big table.
So why would we use a big-database concept...
The text was updated successfully, but these errors were encountered: