Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change back to non async interface for try_into_logical_plan #3955

Closed
wants to merge 2 commits into from

Conversation

yahoNanJing
Copy link
Contributor

Which issue does this PR close?

Closes #3954.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Oct 25, 2022
@yahoNanJing
Copy link
Contributor Author

It's caused by #3907. I think there's some discussion before for changing the async interface to non async. I'm wondering why it's changed back to async again in #3907.

Hi @alamb, @andygrove and @avantgardnerio, what do you think of this?

@avantgardnerio
Copy link
Contributor

there's some discussion before

@yahoNanJing I am not familiar with the discussion (or don't recall)... do you have a link to the PR or github issue?

wondering why it's changed back to async

Because creating TableProviders may have to be an async operation for ones like Deltalake that need to go load schema from the network.

I looked into the alternative: having two methods on TableProviderFactory:

  1. fn async create(url: String) - what we have now
  2. fn with_schema(url: String, schema: SchemaRef) so that in theory when deserializing TableProviders we could skip the async operation by using the schema which should have been serialized in the scan.

Unfortunately, it did not look trivial to serialize all the state that a DeltaTable sets up during planning, so based on @andygrove 's suggestion I switched it to async.

@yahoNanJing
Copy link
Contributor Author

Thanks @avantgardnerio for the explanation. In case of network usage or other IO usage, we can refer to the object store interface implementation with spawn blocking.

@avantgardnerio
Copy link
Contributor

with spawn blocking.

This is the PR in question. This is the deserialization line of code that calls the async method. If you know another way to do this, I would be open to specific recommendations. I don't understand how spawn blocking would help, but I am open to it.

The only other option I was able to think of is to serialize the whole state of the DeltaTable including the ObjectStore config and the JSON history of the delta log.

@alamb
Copy link
Contributor

alamb commented Oct 25, 2022

Thanks @avantgardnerio for the explanation. In case of network usage or other IO usage, we can refer to the object store interface implementation with spawn blocking.

Yeah, the async is very infectious -- so adding async definitely causes non trivial downstream churn.

@yahoNanJing would an alternative approach be to create a PR in Ballista to update the code in question to be async?

@avantgardnerio perhaps @yahoNanJing is suggesting something like https://docs.rs/tokio/1.21.2/tokio/runtime/struct.Runtime.html#method.block_on which blocks a thread and waits for an async method to resolve

@avantgardnerio
Copy link
Contributor

update the code in question to be async?

FWIW, I have already done this, I just haven't filed a PR yet. I can do that easily if we decide to stick with async.

@alamb
Copy link
Contributor

alamb commented Oct 25, 2022

@yahoNanJing I can't tell from this PR or #3954 about why you are concerned about this change. Specifically does async prevent you from doing something? Or did you just want clarification on why the API was changed to async? Or something else?

Can you please let us know what you would prefer

  1. backout the async change to the serialize API while we have a discussion on the matter
  2. keep it in and proceed
  3. Something else

Note that @andygrove filed #3957 recently to help with this sort of discussion

@yahoNanJing
Copy link
Contributor Author

yahoNanJing commented Oct 26, 2022

Hi @avantgardnerio and @alamb, I think it's a similar case for the execute interface in ExecutionPlan. There're some discussion there, #2307, #2434.

Specifically does async prevent you from doing something?

Currently there's no blocking issue for the async change to ballista. But I'm not sure whether it will bring issues in the future. From my point of view, generally, the interface of creating the logical plan should be serialize API. For some edge case, we can use some workaround like https://docs.rs/tokio/1.21.2/tokio/runtime/struct.Runtime.html#method.block_on does.

Maybe we can involve more guys for the API design, like @tustvold, @andygrove, etc.

@avantgardnerio
Copy link
Contributor

the interface of creating the logical plan should be serialize API

I have the unfortunate tendency to agree with @yahoNanJing here... see fun examples like the log4j CVE and Java's famous URL.equals() examples of abuse of network access in parsing logic. That's why I originally spent so much time working on the separate async fn create() and fn with_schema() functions.

It was very convenient to switch this to async, but from a principled perspective I think it's the wrong move. I can work @houqp and the folks over at delta-rs to serde the whole DeltaTable and ObjectStore.

Since that PR is already working though, it would be nice to have some time to verify the assumption that the issue is resolvable before merging this PR.

Thoughts @andygrove and @alamb ?

@@ -86,5 +86,5 @@ pub trait TableProvider: Sync + Send {
#[async_trait]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably remove #[async_trait] as well

@tustvold
Copy link
Contributor

I'm not intimately familiar with the details here, but when given a choice between async or not, I almost always favour the latter. 😅. Certainly the idea of deserialization logic making network calls a little bit surprising. If it is possible to achieve the end result without making this async, that seems like a better path to me

@avantgardnerio
Copy link
Contributor

@yahoNanJing and everyone, I'd like to close this PR in favor of #3978 which will do the same thing, but not break deltalake. As long as we get it merged soon (before 14?) I think no harm was done, right?

@alamb
Copy link
Contributor

alamb commented Oct 26, 2022

So basically making an API async I think allows for more network enabled planning (e.g async table providers etc when the catalog information is not already at hand). I don't have a strong opinion either way. I agree with @yahoNanJing there are ways to do serialization without requiring network access (by leaving placeholders, for example, and then do a subsequent pass to look up anything from the network that is needed)

@yahoNanJing
Copy link
Contributor Author

Thanks @avantgardnerio. I agree to close this PR if #3978 works for both you and non-async 😄

@alamb
Copy link
Contributor

alamb commented Oct 27, 2022

#3978 was merged so closing this one

@alamb alamb closed this Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change back to non async interface for try_into_logical_plan
5 participants