Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking forward to a multi-thread RemoveDuplicateLine Functions! #5

Open
henryjj99 opened this issue Dec 11, 2018 · 6 comments
Open
Labels
enhancement New feature or request

Comments

@henryjj99
Copy link

No description provided.

@henryjj99
Copy link
Author

In my recent project there is around 300000 lines to 'RemoveDuplicateLine' but the single thread func in Kangaroo2 is super slow. Hope I can do multi thread some day!

@henryjj99 henryjj99 changed the title Looking forward for a multi-thread RemoveDuplicateLine Functions Looking forward for a multi-thread RemoveDuplicateLine Functions! Dec 11, 2018
@henryjj99 henryjj99 changed the title Looking forward for a multi-thread RemoveDuplicateLine Functions! Looking forward to a multi-thread RemoveDuplicateLine Functions! Dec 11, 2018
@dcascaval
Copy link
Owner

That seems doable - I'll see if I can prototype it soon, and will let you know. Thanks for the request!

@henryjj99
Copy link
Author

Really appreciate it! Thanks

@dcascaval
Copy link
Owner

dcascaval commented Jan 15, 2019

Hi Henry -

I've attached a prototype as part of a new Impala GHA, along with an example/test file to go with it. The component is called 'ParRemDupLns'. It seems to be performing relatively well (a 'random' test does 300000 lines in under a second on my machine, but results may vary depending on the system. In any case it should be more tuned than the sequential version.)

A couple things to note: it reorders the lines coming through (similarly to other RemoveDuplicateLine implementations). A smaller (e.g more precise) tolerance will result in a faster computation speed. The 'granularity' parameter just affects how it batches its work into parallel portions. This should ONLY affect runtime, not the computed result. The optimal value for this depends on the system, so feel free to adjust to whichever works fastest on your machine (as a rule of thumb I've found 500-1000 are decent.)

Additionally, it behaves (slightly) differently than the original version in cases where there are many (non-duplicate) lines of very similar length, in that it chooses to cull differently - the result should still be usable.

Just replace your current Impala GHA (or download from Food4Rhino - you'll still need the other .dll dependencies) with this one to get it rolling. If you could let me know how it works for you, I'd really appreciate any feedback!

remove_duplicates.zip

@dcascaval dcascaval added the enhancement New feature or request label Jan 15, 2019
@henryjj99
Copy link
Author

Hi there:

Thank you for your work! Fantastic! It does work and it works really fast in my case. Screenshot is attached below. I am using New Surface Pro i5 7300u with 2 cores, 4 threads.

Best
capture

@dcascaval
Copy link
Owner

Glad to hear it! I'll refine and test it a bit more and likely add it to the next version of Impala. Thanks for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants