Indexing Support in DataFusion? #9963

alamb · 2024-04-05T13:44:43Z

alamb
Apr 5, 2024
Collaborator

@westonpace and I had a good conversation about indexes that I wanted to document for anyone else who might be thinking of something similiar.

The high level observation we have is that both lancedb and InfluxDB have special TableProviders with special knowledge of how data is stored, and each effectively implement some form of custom "indexing".

Indexing in this case means using the predicate from a query to rule out files / ranges of files before execution.

DataFusion itself also has a form of this type of index in the ListingTable (code link) where it prunes files based on statistics, and then inside the ParquetExec itself (link) where it again prunes row groups and data pages based on metadata,

Some things we are thinking about in InfluxDB arepassing row selections into ParquetExec from an outside source (aka a special index) with an API like #9929

Here are some diagrams from Weston showing LanceDB's indexing

Here are some options we discussed

Option 1: Each system that needs this type of indexing (continue)s to use a custom `TableProvider`

Pros:

Can be done today, and is what both InfluxDB and LanceDB do
General: makes no assumptions about how indexing is implemented

Cons:

Can't leverage all the machinery of DataFusion likely resulting in duplication (e.g. if you have multiple indexes / predicates)
No good way to serialize / pass any required indedxing state between nodes without serializing them as arrow arrays (InfluxDB uses a __order and LanceDB uses Row Masks)

A specific example of duplication is that InfluxDB does not use ListingProvider when reading parquet files, but has something very similar internally (but augmented with our own catalog information, etc)

Option 2: Add APIs to pass additional knowledge about indexes / indexed reads into DataFusion somehow

Pros:

Reduce the implementation effort for using custom indexes by sharing code

Cons:

Unclear what those APIs could be

We discussed some options:

IndexSearch node
TakeExec (index lookup) -- really like an indexed scan somehow>
Standard way to represent row ids (Row Masks, Row Indexes, etc)?
Representation of the indexes themselves (e.g. serialization / deserializtion). Maybe DDL (CREATE INDEX ....) 🤔

Thoughts?

matthewmturner · 2024-04-06T04:46:14Z

matthewmturner
Apr 6, 2024

I love that we're starting to have conversation about this. Adding some color from our experience with indexing.

We have a bitmap index to address some of the performance issues with using ListingTable at scale (lots of files) and we are starting to build additional functionality into it that could work with a custom table provider. We aren't 100% set on the API / needed structures yet, and it may change as we continue to work on it (right now we use a ListingTable with a custom FileFormat that provides the indexing capabilities), but I think we're pretty happy with where it's going.

Below are the core pieces that we have. As noted above, some of it may not be relevant in future as this current design is made to work with FileFormat / ListingTable

The general idea is we have an Index struct that is part of a table and holds information like the columns that are indexed (columns), how to access the index (provider, i.e from disk or if it is in memory), and the strategy for using the index(index_strategy). A single index could have different information such as bitmaps, file metadata / statistics, row group / page info, etc and they could each have their own execution strategy. With all of this, and the filter expr, the index is responsible for producing file_groups that can be used by FileScanConfig instead of the ListingTable having to get them with with a list call.

We have a separate process that is responsible for keeping the index up to date with the latest information. I dont have a strong opinion currently on how this (keeping the index up to date) should be handled by datafusion.

#[derive(Debug)]
pub struct Index {
    columns: HashSet<String>,
    provider: Arc<dyn IndexProvider>,
    index_strategy: Arc<dyn IndexStrategy>,
}

impl Index {
    pub async fn evaluate(
        &self,
        state: &SessionState,
        conf: &FileScanConfig,
        predicate: &Arc<dyn PhysicalExpr>,
    ) -> Result<Vec<Vec<PartitionedFile>>>
}

pub trait IndexStrategy: Debug + Send + Sync {
    fn execute(
        &self,
        mask: FileMask,
        files: Arc<Vec<Arc<String>>>,
        bitmaps: HashMap<BitmapKey, Arc<Bitmap>>,
        conf: &FileScanConfig,
    ) -> Vec<Vec<PartitionedFile>>;
}

#[async_trait]
pub trait IndexProvider: Debug + Send + Sync {
    async fn get_bitmaps(
        &self,
        context: &SessionContext,
        keys: Option<&[BitmapKey]>,
    ) -> Result<HashMap<BitmapKey, Arc<Bitmap>>>;
    async fn get_files(&self, context: &SessionContext) -> Result<Arc<Vec<Arc<String>>>>;

    async fn get_last_modified(&self) -> Result<DateTime<Utc>>;

    fn get_name(&self) -> Arc<String>;
}

Similar to other areas of datafusion i think it would be very cool if a simple index implementation could be included out of the box, acting as a reference implementation, while still providing the relevant traits to allow users to customize as needed.

Finally, I really like the idea of having index specific nodes so that we can see the index's impact when analyzing logical / physical plans.

4 replies

alamb Apr 8, 2024
Collaborator Author

With all of this, and the filter expr, the index is responsible for producing file_groups that can be used by FileScanConfig instead of the ListingTable having to get them with with a list call.

We have some version of this in InfluxDB 3.0 too

One slight variation we are considering is somehow encoding in file_groups not just which files but also what ranges within those files (e.g. #9929)

alamb Apr 8, 2024
Collaborator Author

I think @cisaacson is also using a custom index structure with DataFusion so maybe he has some insights to share

matthewmturner Apr 8, 2024

@alamb indeed i read up on the row selection api and it looks promising. are there any existing benchmarks that show the performance improvements that can be achieved using that?

alamb Apr 8, 2024
Collaborator Author

Not yet -- but I hope to make some (though likely not until next week)

matthewmturner · 2024-04-06T15:01:24Z

matthewmturner
Apr 6, 2024

I was thinking about this some more and I wonder if this could be made in a generic enough way that things like deltalake or apache iceberg could be registered as index providers so that custom table providers arent needed to use those.

0 replies

alamb · 2024-04-08T19:48:13Z

alamb
Apr 8, 2024
Collaborator Author

Something else I have been thinking about is if it would make sense to pull ListingTable out of the core datafusion crate. If we did that then I think the boundaries of what could be built with a table provider (ListingTable being a good example) and what were built in might be more clear 🤔

9 replies

alamb Apr 13, 2024
Collaborator Author

Thank you @phillipleblanc --

https://github.com/spiceai/spiceai/tree/trunk/crates/data_components look basically like adapters for various databases like duckdb, sqlite etc as a TableProvider

I think @devinjdangelo and @backkem are working on some similar ideas in datafusion-federation: https://github.com/datafusion-contrib/datafusion-federation/tree/main/sources

If you would like, I would be happy to make a repo in datafusion-contrib (datafusion-contrib/datafusion-connector) for this code? Or maybe we could make different repos for the different connectors

datafusion-contrib/datafusion-connector-duckdb
datafusion-contrib/datafusion-connector-sqlite
datafusion-contrib/datafusion-connector-fligthsql
...

🤔

backkem Apr 15, 2024

Quick note on datafusion-federation: The project is focused specifically on federating query sub-plans. Rather then federating single tables, the goal is to federate the largest possible sub-plan that can be computed by a remote query engine. The sources are therefore slightly different TableProviders: instead of a single table scan, they send a larger query (SQL, Substrate or other) to a remote engine.
That being said, we're happy to support a datafusion-table-providers crate where possible.

phillipleblanc Apr 15, 2024

https://github.com/spiceai/spiceai/tree/trunk/crates/data_components look basically like adapters for various databases like duckdb, sqlite etc as a TableProvider

Yes, exactly.

If you would like, I would be happy to make a repo in datafusion-contrib (datafusion-contrib/datafusion-connector) for this code? Or maybe we could make different repos for the different connectors

Perfect, I think having a single repo would make the most sense, since there is a lot of common code between different SQL providers. And as @matthewmturner mentioned, we could feature gate the different connectors.

Rather then federating single tables, the goal is to federate the largest possible sub-plan that can be computed by a remote query engine.

Yeah, we could make these complementary - the TableProviders/sources with the connection/query logic could live in the datafusion-connector crate. And if you want to federate them, then the datafusion-federation crate would adapt them to work with the modified federation query planner.

alamb Apr 17, 2024
Collaborator Author

Hi @phillipleblanc -- I made https://github.com/datafusion-contrib/datafusion-table-providers and added you as an admin in case you want to start your project there / move any of the table providers.

phillipleblanc Apr 17, 2024

Thank you, I'll start moving over PostgreSQL, MySQL and SQLite table providers soon. I'm at the OSS Summit in Seattle this week, so I might not get to it until next week.

cisaacson · 2024-04-09T00:48:26Z

cisaacson
Apr 9, 2024

Matthew, you do have a lot of flexibility and control with this capability. Cory

…

-- Cory Isaacson http://www.coryisaacson.com

On Apr 8, 2024 at 8:46 PM -0400, Matthew Turner ***@***.***>, wrote: That's an interesting idea. Something like a datafusion-table-providers crate and different implementations could live there that are feature gated? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

matthewmturner · 2024-04-29T01:33:02Z

matthewmturner
Apr 29, 2024

I just came across sum_tree from Zeds dev log which is basically a concurrent B+ tree with customizable summary logic for each node.

They basically have SumTree<Chunks> and then ChunkSummary which wraps TextSummary.

I wonder if this could work for us as an index for parquet files with something like SumTree<Files> and FileSummary which wraps RowGroupSummary where the summaries are from file and row group statistics.

I haven't had the chance to dig into sum_tree (just read that article on a long drive) but thought I would bring it up here.

1 reply

matthewmturner Apr 29, 2024

I looked at the sum_tree crate and it actually doesnt appear to be a published crate although the code is open source. I plan to inquire with them to see if there are plans for publishing.

PierreZ · 2024-05-02T12:16:13Z

PierreZ
May 2, 2024

Hi 👋
I was about to start a discussion about how I could leverage Datafusion to handle my custom indexes in a different way than Option 1, glad to see there is a global discussion about this already 😄

My current software stack is something really close to Apple's Record-Layer, meaning that I'm using FoundationDB to store "LogicalDBs". A LogicalDB is simply a subspace of key-values organized into different sub-subpaces, including one where I have indexes.

Each index entry point to a "primay-key" item, allowing me to target the right data subspace.

For example, if I have a Rust structure with a value-field indexed, my key-values will roughly looks like this:

Key (represented as an ordered tuple)	Value
(ldb-1, data, pk_1)	Some data encoded
(ldb-1, idx, my_index, 42, pk_1)

Then, in my current implementation, I'm scanning the "my_index" subspace with the right range, and for each primary-keys found, asking for a new scan in the required pk subspace. This is actually using the Option 1 and it has the listed in the first post.

It is also quite tiedous code to maintain, and we would like to leverage Datafusion for this. This would allow us to use Datafusion expressions to query and manipulate multiple indexes, which is something I will need to write complex queries.

I feel like my need would be fullfilled by having something described as TakeExec (index lookup) -- really like an indexed scan somehow>, so that Datafusion handles the deduplication when multiple indexes are involved, before performing the actual data scan.

Similar to other areas of datafusion i think it would be very cool if a simple index implementation could be included out of the box, acting as a reference implementation, while still providing the relevant traits to allow users to customize as needed.

I deeply agree with @matthewmturner, and I would love to contribute to this feature 😄

10 replies

alamb May 6, 2024
Collaborator Author

https://github.com/datafusion-contrib/datafusion-index-provider created 🚀

PierreZ May 6, 2024

https://github.com/datafusion-contrib/datafusion-index-provider created 🚀

Awesome, thanks 😊 I'm on holiday this week, but I will start working on this next week. I need to dive a bit into the Record-Layer to see how they are handling indexes, that could give some context 🤔

PierreZ May 31, 2024

I'm not giving a lot of updates, but I'm having fun reading through datafusion 😄 I am close to have something working.

alamb May 31, 2024
Collaborator Author

BTW if you want to see a peek of where we are headed with parquet indexing, check this out: #10701 (I don't think this is a general purpose indexing story, but it is pretty neat)

PierreZ Jun 3, 2024

Thank you, that's really helpful to have more OLAP-oriented index usage

Max-Meldrum · 2024-05-15T14:32:14Z

Max-Meldrum
May 15, 2024

This feature sounds interesting!

I recently made a post about a potential integration of µWheel into DataFusion as an index to speed up temporal aggregation queries significantly.

Example query:

SELECT SUM(fare_amount) FROM yellow_tripdata
WHERE tpep_dropoff_datetime >= '?' and < '?'

I'm Interested in helping out to make it supported natively by DataFusion!

15 replies

Max-Meldrum Aug 2, 2024

Thanks for the feedback! @alamb

Yeah, I tried different approaches and operating at the LogicalPlan was the easiest. So the optimization is currently done at the logical level, but not as a local optimizer pass but at the QueryPlanner::create_physical_plan function.

Am I correct that with the optimizer local rewrite you mean to rewrite the logical plan to a new native LogicalPlan variant for UWheel execution? (LogicalPlan::UWheel(...)?)

alamb Aug 2, 2024
Collaborator Author

Am I correct that with the optimizer local rewrite you mean to rewrite the logical plan to a new native LogicalPlan variant for UWheel execution? (LogicalPlan::UWheel(...)?)

I was thinking more of implementing OptimizerRule and then registering it with SessionContext::add_optimizer_rule

There is an example of this in optimizer_rule.rs though that rewrites Exprs not the LogicalPlan

Given my understanding of what you are doing (replacing parts of the query with values from the Uwheel index) I don't think you need a new LogicalPlan variant. I was thinking maybe you could use one of the existing variants that can provide plan time values. Using a LogicalPlan::TableScan with a MemTable that has the values from the uwheel index is probably what I would look into

But I apoligize if this doens't make sense / I don't understand the transformation you are trying to do

Max-Meldrum Aug 3, 2024

Given my understanding of what you are doing (replacing parts of the query with values from the Uwheel index) I don't think you need a new LogicalPlan variant. I was thinking maybe you could use one of the existing variants that can provide plan time values. Using a LogicalPlan::TableScan with a MemTable that has the values from the uwheel index is probably what I would look into

But I apoligize if this doens't make sense / I don't understand the transformation you are trying to do

No, it does make sense. I believe I previously misunderstood the approach. I'll do a rework using the OptimizerRule.

Max-Meldrum Aug 6, 2024

Now refactored in uwheel/datafusion-uwheel@f88ea80

Max-Meldrum Aug 14, 2024

This new blog post introduces the datafusion-uwheel crate. Including how it works, how to use it, performance, and next steps.

Any feedback is valuable.

jeromegn · 2024-08-11T01:19:21Z

jeromegn
Aug 11, 2024

Not adding much to the discussion here, but might be interesting nonetheless: I'm looking into datafusion to handle the SQL parsing and query planning side of a DBMS I have in mind and supporting indexes would make my life a lot easier. I could use option 1 in the meantime.

I'm watching the datafusion-contrib/datafusion-index-provider repo closely. Unfortunately, I am not nearly knowledgeable about DataFusion to contribute, for now!

1 reply

adriangb Aug 11, 2024

For general index support you may want to check out https://github.com/datafusion-contrib/datafusion-async-parquet-index. I have some update I can roll into there that I'll try to get across in the next couple of days.

zhousun · 2024-08-11T18:08:05Z

zhousun
Aug 11, 2024

Comment based on my experience building index execution in some mature query engines :
For most secondary index types, it is beneficial to model them as 'special and optimized way to run a filter'
(hash/range, fulltext, vector, generalized inverted index, spatial...) all fall into this category.

So would it make sense to extend pushdown predicates (rowfilter) as an extendable filter node?
FilterNode: all index filters and pushdown predicates.

IMO it has several advantages:

For index heavy system, it is very important to run all index filters together using index intersection.
Certain index queries (ANN) needs to interact with other filters to run pre-filter.
It enables some adaptive execution (fallback to not use the index).
Also clearer when displaying the execution plan.

I am still learning DataFusion but I am committed to contribute.

8 replies

Epicism Aug 15, 2024

@alamb I may be completely wrong, but it appears that the scan method can only use indexes that identify the chunk, not the specific rows. If I want to build an index that pulls specific rows am I able to pass this information down further?

westonpace Aug 15, 2024

I think, if we want DF to become aware of indices (and not just something internal to a scan) then maybe rewrites / optimizer rules would be the way to go? It seems like a planner / optimizer task to recognize that an index could be applied to speed up a filter and convert a scan into an indexed_scan or something equivalent.

Or, for uWheel, as I understand it, it would be rewriting aggregation.

alamb Aug 15, 2024
Collaborator Author

In my mental database model, if it is possible to use a different index to scan each table, each of those different indexes would map to a different instance of DataFusion TableProvider

So one way to support multiple indexes for a table might be:

Register a TableProvider that uses the first index, and plan the query
Register a different TableProvider for the second index for the same table and plan the query again
Use a cost model to decide which of the plans is "better" and run that

Another way might be to

make the initial LogicalPlan with a single TableProvider
implement a rewrite pass as @westonpace suggests, that replaces the TableProvider
Run the pass to on 2 copies of the initial LogicalPlan with the two indexes --> 2 different table providers
Decide between the two different plans.

The downsides of the two approaches above is that it is combinatorial in the number of indexes and tables (a query with 10 tables, each with 2 possible indexes, would result in 2^10 possible plans)

Typically the way other databases handle this is some heuristics to prune the search space during enumeration, which one could also implement as a DataFusion optimizer pass

zhousun Aug 15, 2024

It really depends on the index implementation:
for example, for inverted index based implementation (either a hash key or fulltext or probably vector index) the optimal to execute is to use both/ all indexes and do index intersection.
But for range key, we do want to pick which index to use. But still it can be a local decision and could happen at runtime (within the TableProvider).

Pre-aggregation like uWheel might be different as it completely affects the plan (another case would be picking a sort order to avoid sorting or enable streaming GroupBy). But for these cases, it is almost a rule based decision, always better to use the index.

zhousun Aug 15, 2024

Share more context, I am starting on an open indexing framework (general hash/range, search and pre-aggregation) for arbitrary data files :) and work to have datafusion as one of the first integration.
Would love to jump on a call if anyone is interested.

Indexing Support in DataFusion? #9963

alamb Apr 5, 2024 Collaborator

Option 1: Each system that needs this type of indexing (continue)s to use a custom TableProvider

Option 2: Add APIs to pass additional knowledge about indexes / indexed reads into DataFusion somehow

Replies: 9 comments · 48 replies

alamb Apr 8, 2024 Collaborator Author

alamb Apr 8, 2024 Collaborator Author

alamb Apr 8, 2024 Collaborator Author

alamb Apr 8, 2024 Collaborator Author

alamb Apr 13, 2024 Collaborator Author

alamb Apr 17, 2024 Collaborator Author

alamb May 6, 2024 Collaborator Author

alamb May 31, 2024 Collaborator Author

alamb Aug 2, 2024 Collaborator Author

alamb Aug 15, 2024 Collaborator Author

alamb
Apr 5, 2024
Collaborator

Option 1: Each system that needs this type of indexing (continue)s to use a custom `TableProvider`

Replies: 9 comments 48 replies

alamb Apr 8, 2024
Collaborator Author

alamb Apr 8, 2024
Collaborator Author

alamb Apr 8, 2024
Collaborator Author

alamb
Apr 8, 2024
Collaborator Author

alamb Apr 13, 2024
Collaborator Author

alamb Apr 17, 2024
Collaborator Author

alamb May 6, 2024
Collaborator Author

alamb May 31, 2024
Collaborator Author

alamb Aug 2, 2024
Collaborator Author

alamb Aug 15, 2024
Collaborator Author