Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): Dedupe #10101

Merged
merged 61 commits into from
Oct 10, 2024
Merged

feat(core): Dedupe #10101

merged 61 commits into from
Oct 10, 2024

Conversation

ShireenMissi
Copy link
Contributor

@ShireenMissi ShireenMissi commented Jul 18, 2024

Summary

Building on the work done in #5178
This PR creates a new version of Remove Duplicate node with the ability to remove duplicates across executions. it supports three logic types:

  • "Remove Items With Already Seen Key Values": Dedupe based on one or more keys, the whole value is hashed and stored.
  • "Remove Items Up to Stored Incremental Key": Dedupe based on incremental key, only store the. highest value processed previously.
  • "Remove Items Up to Stored Date": Dedupe based on a date key, only store the latest date processed.

the deduplication happens in two contexts Node and Workflow.

Related Linear tickets, Github issues, and Community forum posts

https://linear.app/n8n/issue/NODE-1487/dedupe-p0

Review / Merge checklist

  • PR title and summary are descriptive. (conventions)
  • Docs updated or follow-up ticket created.
  • Tests included.
  • PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)

@ShireenMissi ShireenMissi requested a review from a team as a code owner July 18, 2024 16:40
@ShireenMissi ShireenMissi marked this pull request as draft July 18, 2024 16:40
@n8n-assistant n8n-assistant bot added core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team node/new Creation of an entirely new node labels Jul 18, 2024
@ShireenMissi ShireenMissi force-pushed the node-1487 branch 10 times, most recently from fb0a47d to 509ebed Compare July 24, 2024 16:17
@ShireenMissi ShireenMissi force-pushed the node-1487 branch 6 times, most recently from 97f5711 to ca02cf9 Compare September 13, 2024 15:30
@ShireenMissi ShireenMissi marked this pull request as ready for review September 15, 2024 18:09
@ShireenMissi ShireenMissi removed the node/new Creation of an entirely new node label Sep 15, 2024
@ShireenMissi ShireenMissi force-pushed the node-1487 branch 3 times, most recently from 6b1d5b1 to 0ea7889 Compare September 19, 2024 07:51
@ShireenMissi ShireenMissi changed the title feat(core): Check data processed feat(core): Dedupe Oct 9, 2024
tomi
tomi previously approved these changes Oct 9, 2024
Copy link
Collaborator

@tomi tomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing all the comments 💟 Couple comments mainly about code readability that would be nice to address and a question about the sorting. But otherwise LGTM 🚀

items: DeduplicationItemTypes[],
mode: DeduplicationMode,
): DeduplicationItemTypes[] {
return items.slice().sort((a, b) => (DeduplicationHelper.compareValues(mode, a, b) ? 1 : -1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting is not currently stable (i.e. equal items don't have guaranteed order). Does it need to be? If it does, we should return 0 when the items are equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it took me a while to get :) and it was actually an issue, so I fixed it

return createHash('md5').update(value.toString()).digest('base64');
}

private async fetchProcessedData(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally fetch sounds like it's making an http request (because of fetch api). Maybe we could rename this queryProcessedData or selectProcessedData?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to findProcessedData to convey that it is a wrapper for findOne

Comment on lines +769 to +771
export type DeduplicationScope = 'node' | 'workflow';
export type DeduplicationItemTypes = string | number;
export type DeduplicationMode = 'entries' | 'latestIncrementalKey' | 'latestDate';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, more descriptive now 👌

this.validateMode(processedData, options);

if (['latestIncrementalKey', 'latestDate'].includes(options.mode)) {
const incomingItems = DeduplicationHelper.sortEntries(items, options.mode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is quite long and seems to consists of two branches. WDYT about moving the branches to their own methods? I.e. something like this:

if (options.mode === 'latestIncrementalKey' || options.mode === 'latestDate') {
  return await this.dedupeAndRecordByLatest(...)
} else {
  return await this.dedupeAndRecordByEntries(...)
}

Copy link

cypress bot commented Oct 9, 2024

n8n    Run #7316

Run Properties:  status check passed Passed #7316  •  git commit c29aecbc15: 🌳 🖥️ browsers:node18.12.0-chrome107 🤖 ShireenMissi 🗃️ e2e/*
Project n8n
Run status status check passed Passed #7316
Run duration 54m 04s
Commit git commit c29aecbc15: 🌳 🖥️ browsers:node18.12.0-chrome107 🤖 ShireenMissi 🗃️ e2e/*
Committer Shireen Missi
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 438

Copy link
Contributor

github-actions bot commented Oct 9, 2024

✅ All Cypress E2E specs passed

Co-authored-by: Tomi Turtiainen <10324676+tomi@users.noreply.github.com>
tomi
tomi previously approved these changes Oct 10, 2024
Copy link
Collaborator

@tomi tomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

packages/cli/src/deduplication/deduplication-helper.ts Outdated Show resolved Hide resolved
Copy link
Contributor

⚠️ Some Cypress E2E specs are failing, please fix them before merging

Co-authored-by: Tomi Turtiainen <10324676+tomi@users.noreply.github.com>
Copy link
Contributor

⚠️ Some Cypress E2E specs are failing, please fix them before merging

Copy link
Contributor

✅ All Cypress E2E specs passed

@ShireenMissi ShireenMissi merged commit 52dd2c7 into master Oct 10, 2024
36 of 39 checks passed
@ShireenMissi ShireenMissi deleted the node-1487 branch October 10, 2024 15:12
@github-actions github-actions bot mentioned this pull request Oct 16, 2024
@janober
Copy link
Member

janober commented Oct 16, 2024

Got released with n8n@1.64.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team Released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants