Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parallel queries #475

Merged
merged 59 commits into from
Oct 3, 2024
Merged

feat: parallel queries #475

merged 59 commits into from
Oct 3, 2024

Conversation

domoritz
Copy link
Member

@domoritz domoritz commented Aug 2, 2024

Updated version of #438

Rather than sending requests in series, send them in parallel but serialize the order before returning them from the query manager.

Fixes #388

Copy link
Member

@jheer jheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, once comments are addressed.

packages/core/test/coordinator-test.js Outdated Show resolved Hide resolved
.vscode/launch.json Outdated Show resolved Hide resolved
eslint.config.js Outdated Show resolved Hide resolved
packages/core/src/util/query-result.js Outdated Show resolved Hide resolved
packages/core/src/util/query-result.js Outdated Show resolved Hide resolved
this._logger.debug('Cache');
result.fulfill(cached);
result.prepare(type === 'exec' ? null : data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a separate issue, but this makes me think we should double check caching interactions with exec queries. I don't think we want to be going to the cache for any exec queries...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like cache is set to false by default and exec makes a query without cache so it should always be off.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, sounds like younger me did think it through :)

@jheer
Copy link
Member

jheer commented Aug 22, 2024

One last question: did you try testing this against all of our current backends? (WASM, Node.js, Python, Rust)

One of the reasons I first ensured serialized queries was due to Node.js DuckDB failures. I believe it was due to an interaction at the time with the Arrow extension.

@domoritz
Copy link
Member Author

I tested the (at least casually) the flight 200k and 10m examples on

  • node sever socket and rest
  • python server socket and rest
  • wasm

There is a problem when I brush on one chart and move the cursor over another chart because mosaic then tries to run the query to build the index and query the index at the same time (and the latter fails since the index is not there yet).

I guess this problem (index not ready yet) can also happen if the index creation is slow and you start interacting without the index being there. I think the solution here is to always prioritize exec queries and don't send any queries until the exec is done. I'll implement that unless you have another suggestion. We may still want to prevent activation of chart while the mouse is down. At some point it might be nice to have an actual dependency model for queries but that's a lot more invasive.

@jheer
Copy link
Member

jheer commented Aug 22, 2024

I guess this problem (index not ready yet) can also happen if the index creation is slow and you start interacting without the index being there. I think the solution here is to always prioritize exec queries and don't send any queries until the exec is done. I'll implement that unless you have another suggestion. We may still want to prevent activation of chart while the mouse is down. At some point it might be nice to have an actual dependency model for queries but that's a lot more invasive.

I think blocking on exec queries probably makes sense. There are other specs with derived data sets on load that also assume serialization -- though in some cases multiple exec queries may be issued as a single DB request, I don't think this is always the case.

I agree that a full dependency model is a bigger lift. That said, checking (and if necessary, blocking) on data cube table construction within the data cube indexer might not be so bad as a stop-gap.

Either way, we should make sure we have accompanying unit tests for these cases.

@domoritz
Copy link
Member Author

Added logic to prevent multiple parallel exec requests and also added a unit test.

Copy link
Member

@jheer jheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, pending my review comments. But please do not merge until after #519 lands and I've cut a v0.11.0 release. I plan to cut a release without parallel queries and then merge this PR for the subsequent release, in part to permit more careful testing.

In addition, we need to check all of the database connectors. I think rest and wasm should be fine. I didn't re-check the Jupyter widget but I think it will be ok for parallel use, too. However, the socket connector includes its own blocking, forcing sequential query execution. We need to revisit that.

packages/core/src/QueryManager.js Outdated Show resolved Hide resolved
this._logger.debug('Cache');
result.fulfill(cached);
result.ready(type === 'exec' ? null : data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line looks harmless, but raises a question: do we ever see exec calls hit the cache? I don't think we should... they should be issued with cache: false | undefined. And if we didn't shouldn't the cache result resolve to undefined anyway? So I don't think we need this conditional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we don't cache exec requests but actually could benefit from it. Below, we add the promise for pending requests to the cache so we don't submit the same request twice.

This makes me think that we maybe want to have a separate map of pending requests that we put requests in no matter whether their results should be cached or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think it is safe to cache exec queries, as they can modify DB state. For example if I add a table, drop it, then add it again (which could maybe happen for a data cube index table), I don’t want the last exec call ignored due to caching.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good example. A common request of creating an index would be nice to cache, though. I guess for now we can not cache and hopefully improve the parallelization when we have full query dependency tracking.

this._logger.debug(`Request: ${(performance.now() - t0).toFixed(1)}`);
result.fulfill(data);
result.ready(type === 'exec' ? null : data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this here so that we don't err in QueryResult.fulfill.

@@ -42,7 +42,7 @@ async fn handle_post(

pub fn app() -> Result<Router> {
// Database and state setup
let db = ConnectionPool::new(":memory:", 16)?;
let db = ConnectionPool::new(":memory:", 1)?; // TODO: we can only use one connection since temp tables are scoped per connection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #519 lands data cube indexes will no longer be temp tables. But other Mosaic exec calls may make temp tables, so I don't think we can expand the connection pool...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will leave this at 1 for now.

What are your thoughts on not using temp tables at all anymore? We could change all the examples to write tables into a separate namespace so they are still easy to clean up.

@domoritz
Copy link
Member Author

Updated the code to address your comments.

However, the socket connector includes its own blocking, forcing sequential query execution. We need to revisit that.

I tested with the ws connector and it works fine. We are unnecessarily trying to parallelize queries that are then executed in series by the connector, yes. I could add a flag to not parallelize in the query manager but that introduces an extra code path so I'd be in favor of merging as is.

@jheer
Copy link
Member

jheer commented Sep 29, 2024

I tested with the ws connector and it works fine. We are unnecessarily trying to parallelize queries that are then executed in series by the connector, yes. I could add a flag to not parallelize in the query manager but that introduces an extra code path so I'd be in favor of merging as is.

I agree that we don't need another flag. My point was just to note that we need to update the socket connector to actually get benefits of parallel queries there.

@domoritz
Copy link
Member Author

Yes, that would be nice. I don't see an obvious way right now to do that, though. When we get a result on the socket, we just send binary so we can't easily change the order of responses. We would need to add some control signal that's sent before a result to send query results out of order. Should we do this in this pull request or a follow up (I can file an issue for it)?

@domoritz
Copy link
Member Author

domoritz commented Oct 3, 2024

I filed #547 as a follow up.

@domoritz domoritz enabled auto-merge (squash) October 3, 2024 02:31
@domoritz domoritz merged commit 425fba3 into main Oct 3, 2024
3 checks passed
@domoritz domoritz deleted the dom/parallel branch October 3, 2024 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel Queries
3 participants