Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: improve duckdbDataSource performance #278

Merged
merged 10 commits into from
Aug 17, 2023

Conversation

onlyjackfrost
Copy link
Contributor

@onlyjackfrost onlyjackfrost commented Aug 14, 2023

Description

  • Add performance analysis tool
  • Modify Duckdb data source execute method:
    • Pararelize execution and use connection.all instead of nextChunk

Note:
With the same data, the performance of stream.nextChunk is 1.5~2 times of the performance of connection.all

@vercel
Copy link

vercel bot commented Aug 14, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vulcan-sql-document ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 16, 2023 10:29am

@codecov-commenter
Copy link

codecov-commenter commented Aug 16, 2023

Codecov Report

Patch coverage: 84.84% and project coverage change: -0.13% ⚠️

Comparison is base (232f449) 90.21% compared to head (b4980f9) 90.08%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #278      +/-   ##
===========================================
- Coverage    90.21%   90.08%   -0.13%     
===========================================
  Files          350      351       +1     
  Lines         5967     6044      +77     
  Branches       803      816      +13     
===========================================
+ Hits          5383     5445      +62     
- Misses         423      437      +14     
- Partials       161      162       +1     
Flag Coverage Δ
build 89.57% <ø> (ø)
core 93.93% <83.01%> (-0.26%) ⬇️
extension-authenticator-canner 78.37% <ø> (ø)
extension-dbt 97.43% <ø> (ø)
extension-debug-tools 98.11% <ø> (ø)
extension-driver-bq 84.93% <ø> (ø)
extension-driver-canner 84.65% <ø> (ø)
extension-driver-clickhouse 88.09% <ø> (ø)
extension-driver-duckdb 91.66% <86.95%> (-1.29%) ⬇️
extension-driver-ksqldb 89.49% <ø> (-0.85%) ⬇️
extension-driver-pg 96.11% <ø> (ø)
extension-driver-snowflake 96.26% <ø> (ø)
extension-huggingface 86.25% <ø> (ø)
extension-store-canner 97.54% <ø> (ø)
integration-testing 90.27% <ø> (ø)
serve 87.17% <ø> (ø)
test-utility ∅ <ø> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
packages/core/src/lib/utils/analyzer.ts 82.69% <82.69%> (ø)
...xtension-driver-duckdb/src/lib/duckdbDataSource.ts 93.02% <84.61%> (-2.39%) ⬇️
packages/core/src/lib/utils/index.ts 100.00% <100.00%> (ø)
...ages/extension-driver-duckdb/src/lib/sqlBuilder.ts 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@onlyjackfrost onlyjackfrost changed the title [DRAFT] Chore: performance analysis Feature: improve duckdbDataSource performance Aug 16, 2023
Copy link
Contributor

@kokokuo kokokuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside some suggestion, others LGTM

// write to txt file
public static writePerformanceReport() {
const filePath = path.join('./performanceRecord.txt');
// print current date, time as humun readable format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: human

@@ -99,49 +99,74 @@ export class DuckDBDataSource extends DataSource<any, DuckDBOptions> {
}
const { db, configurationParameters, ...options } =
this.dbMapping.get(profileName)!;
const builtSQL = buildSQL(sql, operations);
const [builtSQL, streamSQL] = buildSQL(sql, operations);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest commenting on the purpose of why do you separate two SQL


const result = await statement.stream(...parameters);
const firstChunk = await result.nextChunk();
const [result, asyncIterable] = await Promise.all([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest you comment on why you separate two SQL and call one by .all and call another by stream, do let members know the reason why not call .all and convert to stream only or use .stream only


const result = await statement.stream(...parameters);
const firstChunk = await result.nextChunk();
const [result, asyncIterable] = await Promise.all([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use an other name to replace asyncIterable ? not straightforward actually.

}
}),
]);
const asyncIterableStream = new Readable({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a method to refactor the execution to prevent too long to read, e.g: convertToStream(result, iterable)

Comment on lines 274 to 311
// set duckdb thread to number
private async setThread(db: duckdb.Database) {
const thread = process.env['THREADS'];

if (!thread) return;
await new Promise((resolve, reject) => {
db.run(`SET threads=${Number(thread)}`, (err: any) => {
if (err) reject(err);
this.logger.debug(`Set thread to ${thread}`);
resolve(true);
});
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, if you still would like to keep the method, please comment the method's purpose and add the usage way in the README

@@ -26,15 +26,23 @@ export const isNoOP = (
return true;
};

const duckdbExecuteChunkSize =
process.env['DUCKDB_EXECUTE_CHUNK_SIZE'] || '2000';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to update to mentioned the env in README, and what influence will be if set the env to large value or small value.

Copy link
Contributor

@kokokuo kokokuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks for fixing.

@kokokuo kokokuo merged commit 72b1b7f into develop Aug 17, 2023
@kokokuo kokokuo deleted the chore/performance-analysis branch August 17, 2023 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants