Replies: 9 comments 48 replies
-
I love that we're starting to have conversation about this. Adding some color from our experience with indexing. We have a bitmap index to address some of the performance issues with using Below are the core pieces that we have. As noted above, some of it may not be relevant in future as this current design is made to work with The general idea is we have an We have a separate process that is responsible for keeping the index up to date with the latest information. I dont have a strong opinion currently on how this (keeping the index up to date) should be handled by datafusion.
Similar to other areas of datafusion i think it would be very cool if a simple index implementation could be included out of the box, acting as a reference implementation, while still providing the relevant traits to allow users to customize as needed. Finally, I really like the idea of having index specific nodes so that we can see the index's impact when analyzing logical / physical plans. |
Beta Was this translation helpful? Give feedback.
-
I was thinking about this some more and I wonder if this could be made in a generic enough way that things like deltalake or apache iceberg could be registered as index providers so that custom table providers arent needed to use those. |
Beta Was this translation helpful? Give feedback.
-
Something else I have been thinking about is if it would make sense to pull |
Beta Was this translation helpful? Give feedback.
-
Matthew, you do have a lot of flexibility and control with this capability.
Cory
…--
Cory Isaacson
http://www.coryisaacson.com
On Apr 8, 2024 at 8:46 PM -0400, Matthew Turner ***@***.***>, wrote:
That's an interesting idea. Something like a datafusion-table-providers crate and different implementations could live there that are feature gated?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I just came across They basically have I wonder if this could work for us as an index for parquet files with something like I haven't had the chance to dig into |
Beta Was this translation helpful? Give feedback.
-
Hi 👋 My current software stack is something really close to Apple's Record-Layer, meaning that I'm using FoundationDB to store "LogicalDBs". A Each index entry point to a "primay-key" item, allowing me to target the right data subspace. For example, if I have a Rust structure with a value-field indexed, my key-values will roughly looks like this:
Then, in my current implementation, I'm scanning the "my_index" subspace with the right range, and for each primary-keys found, asking for a new scan in the required pk subspace. This is actually using the It is also quite tiedous code to maintain, and we would like to leverage Datafusion for this. This would allow us to use Datafusion expressions to query and manipulate multiple indexes, which is something I will need to write complex queries. I feel like my need would be fullfilled by having something described as
I deeply agree with @matthewmturner, and I would love to contribute to this feature 😄 |
Beta Was this translation helpful? Give feedback.
-
This feature sounds interesting! I recently made a post about a potential integration of µWheel into DataFusion as an index to speed up temporal aggregation queries significantly. Example query: SELECT SUM(fare_amount) FROM yellow_tripdata
WHERE tpep_dropoff_datetime >= '?' and < '?' I'm Interested in helping out to make it supported natively by DataFusion! |
Beta Was this translation helpful? Give feedback.
-
Not adding much to the discussion here, but might be interesting nonetheless: I'm looking into I'm watching the |
Beta Was this translation helpful? Give feedback.
-
Comment based on my experience building index execution in some mature query engines : So would it make sense to extend pushdown predicates (rowfilter) as an extendable filter node? IMO it has several advantages:
I am still learning DataFusion but I am committed to contribute. |
Beta Was this translation helpful? Give feedback.
-
@westonpace and I had a good conversation about indexes that I wanted to document for anyone else who might be thinking of something similiar.
The high level observation we have is that both lancedb and InfluxDB have special
TableProviders
with special knowledge of how data is stored, and each effectively implement some form of custom "indexing".Indexing in this case means using the predicate from a query to rule out files / ranges of files before execution.
DataFusion itself also has a form of this type of index in the
ListingTable
(code link) where it prunes files based on statistics, and then inside the ParquetExec itself (link) where it again prunes row groups and data pages based on metadata,Some things we are thinking about in InfluxDB arepassing row selections into
ParquetExec
from an outside source (aka a special index) with an API like #9929Here are some diagrams from Weston showing LanceDB's indexing
Here are some options we discussed
Option 1: Each system that needs this type of indexing (continue)s to use a custom
TableProvider
Pros:
Cons:
__order
and LanceDB uses Row Masks)A specific example of duplication is that InfluxDB does not use ListingProvider when reading parquet files, but has something very similar internally (but augmented with our own catalog information, etc)
Option 2: Add APIs to pass additional knowledge about indexes / indexed reads into DataFusion somehow
Pros:
Cons:
We discussed some options:
CREATE INDEX ....
) 🤔Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions