Decouple Streaming Use-Case from File IO Abstractions #7994

tustvold · 2023-10-30T23:28:27Z

Is your feature request related to a problem or challenge?

Currently we accommodate streaming workloads within DataFusion by overloading the file IO abstractions.

This is not always a very good fit and results in a number of workarounds:

Providing PartitionedFile information to methods that write new files via FileSinkConfig - https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileSinkConfig.html#structfield.file_groups
Passing unbounded_input to parallel write logic that really shouldn't care - https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/write/orchestration.rs#L79
Special logic to handle reading files that aren't regular files - Update ObjectStore 0.7.0 and Arrow 46.0.0 #7282 (comment)
The somewhat confusing notion of an "infinite" file - https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileScanConfig.html#structfield.infinite_source
An ObjectStore::append method that doesn't naturally fit with the design goals of the project - https://github.com/apache/arrow-rs/pull/4978/files#diff-fbf0cdd67af2b24703a831d28c932bfe0535234e3b745d3a5437b9c0bc881ddeR89
Schema inference within ListingTable currently assumes files are finite
ObjectMeta is passed around in plans with the implicit assumption that files are not changing underneath
And more...

As DataFusion gets more sophisticated about handling catalogs, reading/writing partitioned data, caching data, this overloading is getting more and more arcane and hard to reason about, and I think it is overdue we do something to address it.

Describe the solution you'd like

I would like to separate the notions of FileSink and FileScan from a StreamSink and StreamSource, this would allow abstractions that better fit their respective use-cases.

In particular

FileSink and FileScan can focus on reading/writing partitioned immutable files following standard big data practices
StreamSink and StreamSource can focus on reading/writing CSV / JSON (/ Avro) data from streaming sources

Not only would this simplify the current code, but would also expand the streaming support in DataFusion

Allows for more efficient non-blocking IO, as linux FIFO's support poll(2) (unlike general files)
Potential future integrations with data streaming systems such as Kafka, etc...

Describe alternatives you've considered

No response

Additional context

No response

alamb · 2023-10-30T23:48:54Z

I would like to separate the notions of FileSink and FileScan from a StreamSink and StreamSource, this would allow abstractions that better fit their respective use-cases.

Sounds like a great idea to me.

devinjdangelo · 2023-10-30T23:55:00Z

+1 from me as well. The more we finely optimize batch/streaming reads and writes the more the specific implementations ideally will diverge a bit. I felt this when working on parallelizing writes as some code that work great for batch caused inescapable infinite loops for infinite sources, requiring adding special cases to the code. Separating the two cases will allow both use cases to optimize appropriately without harming the other use case.

tustvold · 2023-11-01T07:54:19Z

I created a POC of ripping append semantics out of ListingTable in #8017 and the good news is this is very achievable 😄

metesynnada · 2023-11-01T12:02:26Z

I, too, am in favor of this split; the StreamSink methodology's implementation significantly enhances the unified approach, maintaining the integrity of features on both ends. At the moment, my agenda is filled with high-priority tasks, but I am available to assist with code reviews and provide guidance on the overall structure of the code. Additionally, I will contribute some tests to the codebase to verify that we are preserving the streaming functionality. I believe this will also serve to optimize the Datafusion use cases. Thank you for your hard work, @tustvold, @alamb, and @devinjdangelo

Btw, If a Google Doc or any similar platform is accessible, I would be glad to contribute to it.

tustvold · 2023-11-01T20:38:14Z

Whilst discussing potential API designs with @alamb the question came up of whether the existing TableProvider abstractions are now sufficiently mature that we could just use them directly. In particular the combination of TableProvider::insert_into and TableProviderFactory I think may cover all the necessary extension points for people to implement streaming workloads as they deem fit.

I tried to show what this might look like in #8021 PTAL.

If this would be sufficient for the streaming use-cases, perhaps we can avoid needing to define further abstractions, and can simply remove the streaming support from ListingTable proper, i.e. a more polished version of #8017?

alamb · 2023-11-08T21:34:33Z

There is a lot of related discussion on #8021 (comment)

metesynnada · 2023-11-11T11:32:26Z

I wrote an initial design document, @alamb and @tustvold , can you check this out? I have tried to cover all use cases while making it easier to develop the batch mode feature. The document allows commenting, so please share your thoughts 😀

alamb · 2023-11-11T12:44:11Z

I wrote an initial design document, @alamb and @tustvold , can you check this out? I have tried to cover all use cases while making it easier to develop the batch mode feature. The document allows commenting, so please share your thoughts 😀

Thanks @metesynnada -- I will do so, likely later today or tomorrow

tustvold · 2023-11-11T13:09:38Z

Thank you for writing this up, before getting into specific questions about the design, I think it would help to articulate what the objectives here are? Some possibilities might be:

To replicate the functionality currently provided by ListingTable
To facilitate the addition of new streaming sources beyond FIFO files

From my reading of the document, the proposed abstractions appear to largely mirror the equivalent TableProvider abstractions used in #8021, whilst also quite closely fitting to the current FIFO file mechanism based around serialized byte streams. This makes me suspect that neither of the above is quite the vision here?

On a more concrete level, it occurs to me that if StreamStoreRegistry were moved under StreamingTableFactory, all of this functionality would be encapsulated under one "extension point". Perhaps this might provide a mechanism to iterate and evolve these APIs incrementally, e.g. in datafusion-contrib or similar, without needing to front-load the design effort? This would allow new streaming sources to be added, and the abstractions evolved as necessary, all in one place? I for one anticipate Kafka, and certainly Kinesis, will require some non-trivial iteration on these APIs to accommodate their particular quirks. Just a suggestion, but it might not only allow us to make progress on this ticket quicker, but also yield a better development experience for ongoing work on the streaming functionality, to not be split across repositories.

alamb · 2023-11-11T14:27:28Z

I have read the document and I really like it. Thank you @metesynnada.

Use Case

My understanding is that the goal of this exercise is not primarily about the specifics of decoupling append/FIFO from ListingTable, but the more general goal of easily using DataFusion to build systems with functionality similar to Apache Flink, as described in #4285

In particular the goal is to make it easy to

Read data from "streaming" sources
Write data to "streaming" sinks

Where "streaming" means the data is NOT composed of a fixed set of immutable objects (as in object storage), but rather some number of "streams" which can be appended / consumed. Some examples are FIFO files, Kafka, RabbitMQ, Kinesis, etc.

One of the major challenges in trying to implement such streaming systems today is that ListingTable, as its name implies, is implemented assuming the data is being read or written to an object store, with both the benefits and limitation of the object_store API.

Design Feedback

I really like the idea to create a parallel StreamingTable to ListingTable that is backed by a different API than object store and that can be used to build such streaming usecases. It much better matches:

The differences in reading from a streaming vs immutable discrete objects in object store
The difference in writing data out in a streaming fashion compared to immutable discrete objects in object store

Requests / ideas

As we expand DataFusion in this streaming direction, I would like to request we take this opportunity to define some more crate boundaries so the ListingTable is not so tightly integrated / intertwined, and likewise StreamingTable is not so so intertwined -- perhaps we can aim to end up with three new crates datafusion_listing_table datafusion_streaming_table and datafusion_data_format

Next steps:

Would it be possible to create a PoC / proposal with the basic APIs and make sure they fit together and into the rest of DataFusion? This is likely to be a large change, so I think getting the skeleton in place and then filling out the details in subsequent PRs (rather than one massive one) would be my preferred process.

alamb · 2023-11-11T14:34:58Z

I think we can probably sketch out enough of this StreamingTable API to migrate the existing FIFO support pretty quickly. I would be happy to help / review quickly

tustvold · 2023-11-11T14:53:38Z

I would like to request that we don't overload the file format abstractions for streaming use-cases as that is part of the confusion I am trying to eliminate. There isn't a great deal of logic that can be shared anyway. Otherwise sounds good to me.

I still would favor not trying to front-load defining StreamingTable and instead allow it to form naturally as new backends are added, I am not confident that log oriented stores like kinesis and Kafka map well to the proposed API and it seems premature to be designing an abstraction before having multiple implementations to abstract 😅

alamb · 2023-11-11T15:28:14Z

I agree that starting with a basic StreamingTable abstraction, and copying over only API features from the currentListingTable that are needed (and not replicating APIs until necessary) makes a lot of sense

I would like to request that we don't overload the file format abstractions for streaming use-cases as that is part of the confusion I am trying to eliminate.

Do you mean the FileFormat trait? At the moment, it seems like it is tightly tied to ObjectStore both in the infer methods and create_writer_physical_plan via FileSinkConfig so I agree it is not clear how to make this easily fit into the streaming design

My understanding that the StreamingTable would an analgous trait (like StreamFormat perhaps).

ozankabak · 2023-11-11T22:55:05Z

My understanding that the StreamingTable would an analgous trait (like StreamFormat perhaps).

Exactly.

I think we can probably sketch out enough of this StreamingTable API to migrate the existing FIFO support pretty quickly. I would be happy to help / review quickly

Thank you 🚀

We will start implementing this next week to get the core APIs and the functionality in place. Your help in fine-tuning the design through reviews during this process will be much appreciated. @metesynnada and I scoped out the work and it seems the implementation should go relatively smoothly.

tustvold · 2023-11-13T23:03:42Z

What do people think about perhaps polishing up #8021 into a v0.1 for StreamingTable? I believe most of the functionality is there, and I would be happy to do this.

Just thinking about ways I could help get this over the line...

alamb · 2023-11-14T01:07:38Z

Thank you @tustvold -- I think that would be very helpful assuming @metesynnada / @ozankabak are not already working on it.

ozankabak · 2023-11-14T06:46:56Z

We started working on it and hope to get things ready for review this week. We will appreciate your help on reviewing it. Thank you @alamb and @tustvold

tustvold · 2023-11-14T06:53:48Z

Is there any value in starting small with something based off #8021? I'm confident I could have something ready for review this morning? This would allow us to take the time to get the v1 design right, whilst unblocking things in the meantime?

ozankabak · 2023-11-14T07:47:02Z

I want to avoid any possible overlap between the progress so far on our end and your time. Given that we will have something out this week, I think we can best utilize your time if you carefully review the upcoming PRs and help us optimize the design and code through that process. Once we have the basics out, there will also be some important work to remove leftover code in the ListingTable end to optimize it fully for bounded files, your help will be crucial there too.

tustvold · 2023-11-14T07:56:01Z

My time is immaterial, I just want to get this over the line. If it all gets replaced in 2 weeks, I would be more than happy.

As it stands now though, from my perspective, we have had a largely functionally complete PR that preserves the current functionality for weeks, and the approach was signed off by @metesynnada - #7994 (comment) 2 weeks ago. I am therefore I think pretty understandably frustrated at the prospect of yet further delay for reasons that haven't been articulated.

I would very much like to just get this over the line so that the work in DF and IOx that is blocked on it can move forward. If that means doing work that will be thrown away, so be it, it has mostly already been done.

ozankabak · 2023-11-14T09:44:18Z

Dear @tustvold,

Just like you, we also would like to get to a state where everyone can move forward, as we have been negatively impacted by this situation. I would like to remind you that we did not create the situation we are all in. The breaking arrow-rs change, which is the root cause of the time pressure, was done without us knowing. We were only notified of it after your merged it.

The fact that we agreed on the general aspects of the approach to adopt in terms of the roadmap does not mean we committed to a very specific timeframe. If we are to discuss frustrations, we can contribute quite a few talking points. I don't want to go down that path, as it will not help remedy the situation. I invite you to consider doing the same.

Whatever happened, happened. We are all active contributors to this project, and regardless of the situation we should focus on finding productive ways to work together. @alamb and I engaged in such a discussion last week to chart a path forward, and we are acting in good faith to get to a resolution as quickly as possible. The first step was to come up with a proposal, and we did in a timely manner. @alamb took a detailed look at it, and it seems we are on the same page w.r.t. to the general approach within it. Now, we are in the process of implementing it. I have said before that we will have something out this week, and @alamb will help us with reviews and suggestions to move the ball forward quickly when it is out (#8151). We are making progress as a community with a focus on moving forward. Please, I request from you to not exacerbate the situation needlessly. If you would like to help, we do welcome your help -- I already mentioned some possible ways how.

Datafusion is a great project with a huge role to play in the data ecosystem if we are able to form and maintain a cohesive community around it. I hope we don't lose sight of this due to temporary roadblocks like the one we are going through right now.

tustvold · 2023-11-14T09:47:05Z

I have already updated #8021 inline with the design proposal, if this is acceptable I would very much like to consider moving forwards with it as a stopgap. I leave the rest with you

If we are to discuss frustrations, we can contribute quite a few talking points.
I request from you to not exacerbate the situation needlessly

I apologise if that is how it has come across, that has not been my intention, I purely wish to keep things move things forward. I did not anticipate this change being any more fraught than the fairly frequent breaking changes we routinely make. Change is necessary but is disruptive and I could have done a better job communicating this.

ozankabak · 2023-11-14T10:16:09Z

Thank you, I appreciate you sharing your reflection publicly. I am glad that we are able to come together and collaborate towards a solution. @metesynnada and I are reviewing #8021 to see how we can use it as a base to accelerate the work towards implementing the proposal. We will circle back shortly.

* Implement FIFO using extension points (#7994) * Clippy * Rename to StreamTable and make public * Add StreamEncoding * Rework sort order * Fix logical conflicts * Format * Add DefaultTableProvider * Fix doc * Fix project sort keys and CSV headers * Respect batch size on read * Tests are updated * Resolving clippy --------- Co-authored-by: metesynnada <100111937+metesynnada@users.noreply.github.com>

* Remove FileWriterMode Support (#7994) * Don't ignore test * Error on insert to single file * Improve DisplayAs

* POC: Remove ListingTable Append Support (#7994) * Prepare object_store 0.8.0 * Fix datafusion-cli test * Update arrow version * Update tests * Update pin * Unignore fifo test * Update lockfile

tustvold added the enhancement New feature or request label Oct 30, 2023

tustvold mentioned this issue Oct 30, 2023

Fix AsyncPutWriter (#7991) #7992

Closed

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Oct 31, 2023

Don't canonicalize ListingTableUrl (apache#7994)

cd58dbe

tustvold mentioned this issue Oct 31, 2023

Defer ListingTable Path Creation To Write #8007

Closed

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Oct 31, 2023

Don't canonicalize ListingTableUrl (apache#7994)

da0554d

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Oct 31, 2023

Don't canonicalize ListingTableUrl (apache#7994)

72778b1

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Oct 31, 2023

Don't canonicalize ListingTableUrl (apache#7994)

b96615a

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Oct 31, 2023

Don't canonicalize ListingTableUrl (apache#7994)

7edff96

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Nov 1, 2023

POC: Remove ListingTable Append Support (apache#7994)

1d7e8bd

tustvold mentioned this issue Nov 1, 2023

Remove FileWriterMode and ListingTableInsertMode (#7994) #8017

Merged

tustvold mentioned this issue Nov 1, 2023

Remove ObjectStore::append apache/arrow-rs#5016

Merged

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Nov 1, 2023

Implement FIFO using extension points (apache#7994)

bf58d3e

tustvold mentioned this issue Nov 1, 2023

Implement StreamTable and StreamTableProvider (#7994) #8021

Merged

alamb mentioned this issue Nov 2, 2023

Minor: Improve documentation for PartitionStream and StreamingTableExec #8035

Merged

alamb mentioned this issue Nov 9, 2023

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Open

alamb mentioned this issue Nov 11, 2023

[Proposal] Streaming execution support roadmap #4285

Open

alamb mentioned this issue Nov 13, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 13, 2023 #8151

Closed

9 tasks

tustvold mentioned this issue Nov 15, 2023

File IO on Tokio Workers #8188

Open

tustvold closed this as completed in #8021 Nov 15, 2023

tustvold self-assigned this Nov 15, 2023

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Nov 15, 2023

Remove ListingTable Append Support (apache#7994)

76624a6

tustvold added a commit to tustvold/arrow-datafusion that referenced this issue Nov 15, 2023

Remove FileWriterMode Support (apache#7994)

1e86301

tustvold added a commit that referenced this issue Nov 17, 2023

Remove FileWriterMode and ListingTableInsertMode (#7994) (#8017)

325a3fb

* Remove FileWriterMode Support (#7994) * Don't ignore test * Error on insert to single file * Improve DisplayAs

This was referenced Dec 14, 2023

Remove ListingTable and FileScanConfig Unbounded #8540

Closed

Remove ListingTable Single File Option #8548

Closed

matthewgapp mentioned this issue Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple Streaming Use-Case from File IO Abstractions #7994

Decouple Streaming Use-Case from File IO Abstractions #7994

tustvold commented Oct 30, 2023 •

edited

Loading

alamb commented Oct 30, 2023

devinjdangelo commented Oct 30, 2023

tustvold commented Nov 1, 2023

metesynnada commented Nov 1, 2023 •

edited

Loading

tustvold commented Nov 1, 2023

alamb commented Nov 8, 2023

metesynnada commented Nov 11, 2023 •

edited

Loading

alamb commented Nov 11, 2023

tustvold commented Nov 11, 2023 •

edited

Loading

alamb commented Nov 11, 2023

alamb commented Nov 11, 2023

tustvold commented Nov 11, 2023 •

edited

Loading

alamb commented Nov 11, 2023

ozankabak commented Nov 11, 2023

tustvold commented Nov 13, 2023

alamb commented Nov 14, 2023

ozankabak commented Nov 14, 2023

tustvold commented Nov 14, 2023 •

edited

Loading

ozankabak commented Nov 14, 2023

tustvold commented Nov 14, 2023 •

edited

Loading

ozankabak commented Nov 14, 2023 •

edited

Loading

tustvold commented Nov 14, 2023 •

edited

Loading

ozankabak commented Nov 14, 2023

Decouple Streaming Use-Case from File IO Abstractions #7994

Decouple Streaming Use-Case from File IO Abstractions #7994

Comments

tustvold commented Oct 30, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Oct 30, 2023

devinjdangelo commented Oct 30, 2023

tustvold commented Nov 1, 2023

metesynnada commented Nov 1, 2023 • edited Loading

tustvold commented Nov 1, 2023

alamb commented Nov 8, 2023

metesynnada commented Nov 11, 2023 • edited Loading

alamb commented Nov 11, 2023

tustvold commented Nov 11, 2023 • edited Loading

alamb commented Nov 11, 2023

Use Case

Design Feedback

Requests / ideas

Next steps:

alamb commented Nov 11, 2023

tustvold commented Nov 11, 2023 • edited Loading

alamb commented Nov 11, 2023

ozankabak commented Nov 11, 2023

tustvold commented Nov 13, 2023

alamb commented Nov 14, 2023

ozankabak commented Nov 14, 2023

tustvold commented Nov 14, 2023 • edited Loading

ozankabak commented Nov 14, 2023

tustvold commented Nov 14, 2023 • edited Loading

ozankabak commented Nov 14, 2023 • edited Loading

tustvold commented Nov 14, 2023 • edited Loading

ozankabak commented Nov 14, 2023

tustvold commented Oct 30, 2023 •

edited

Loading

metesynnada commented Nov 1, 2023 •

edited

Loading

metesynnada commented Nov 11, 2023 •

edited

Loading

tustvold commented Nov 11, 2023 •

edited

Loading

tustvold commented Nov 11, 2023 •

edited

Loading

tustvold commented Nov 14, 2023 •

edited

Loading

tustvold commented Nov 14, 2023 •

edited

Loading

ozankabak commented Nov 14, 2023 •

edited

Loading

tustvold commented Nov 14, 2023 •

edited

Loading