Allow "assume sorted" option #5

samuell · 2016-08-16T08:44:51Z

We could do a slightly different processing algorithm if we can assume the data is sorted (which should be much more efficient using a pure text sorting tool anyway, for n-triples files), which will require far less memory and probably be faster.

thiviyanT · 2017-12-11T12:32:59Z

By default, does rdf2smw (pre-release 0.6 version) have this option set to true? If so, it would explain why my triples were not imported as I expected.

Before running the rdf2smw script, one could always sort the data using the unix sort command.
I have achieved this using the command below and the process was astonishingly fast, matter of seconds, on more than 95K triples.

cat triples.nt | sort -k2,2 -k1,1 > sorted.triples.nt.

@samuell Would it be possible to incorporate a similar unix command into rdf2smw? Do you think doing so would dramatically impact the performance of the code?

samuell · 2017-12-11T16:25:14Z

Thanks for the interesting suggestion @ThiviyanThanapalasingam !

I think including the unix sort command would make the software drastically more complex (because of interfacing between Go and C-code), and harder to maintain, though.

But since the sort command is so widely available, on Linux, Mac, and now even on Windows, with the Windows Subsystem for Linux (WSL), one could enable a workflow where the user first sorts the file using sort, and then runs rdf2smw. rdf2smw can implement a potentially faster algorithm if it can assume sorted input, or at least it could use a lot less memory. For example, aggregating triples per subject, which is done internally, the subjects will be already grouped together in the input, so it can finish each new wiki page as soon as all triples for a particular subject have been processed, instead of keeping all the triples and pages in memory until the end.

thiviyanT · 2017-12-11T20:33:27Z

I see. Thanks for the explanation @samuell. In that case, it would be a good idea to let the OS do the heavy lifting. The command (sort triples.nt -k2,2 -k1,1 > sorted.triples.nt) can be added in the README file, right under the Usuage header text. Do you think doing this solves the issue once and for all?

If you are happy with the changes that I have proposed, I would like to contribute to this project by implementing it. Please let me know what the protocol is for contributing (i.e. Do I work on the master branch and then send you a pull request?)

samuell · 2017-12-11T20:38:50Z

Thanks for the input @ThiviyanThanapalasingam ! I'll look at including that in the README shortly.

Reg. contributing, awsome, that is much welcome!

I think I should set up a develop branch, and have released code in master, for the future. So, if you start working, you could create a new develop branch in your repo, and I'll fix with the develop branch shortly.

samuell · 2017-12-11T20:40:47Z

and have released code in master, for the future

Recommended for Go-packages, since Go lacks an official dependency manager, and most people just pull in the master branch of libraries :)

samuell added the enhancement label Aug 16, 2016

samuell mentioned this issue Dec 12, 2017

Add bash commands for sorting triples #12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow "assume sorted" option #5

Allow "assume sorted" option #5

samuell commented Aug 16, 2016

thiviyanT commented Dec 11, 2017 •

edited

Loading

samuell commented Dec 11, 2017 •

edited

Loading

thiviyanT commented Dec 11, 2017

samuell commented Dec 11, 2017 •

edited

Loading

samuell commented Dec 11, 2017

Allow "assume sorted" option #5

Allow "assume sorted" option #5

Comments

samuell commented Aug 16, 2016

thiviyanT commented Dec 11, 2017 • edited Loading

samuell commented Dec 11, 2017 • edited Loading

thiviyanT commented Dec 11, 2017

samuell commented Dec 11, 2017 • edited Loading

samuell commented Dec 11, 2017

thiviyanT commented Dec 11, 2017 •

edited

Loading

samuell commented Dec 11, 2017 •

edited

Loading

samuell commented Dec 11, 2017 •

edited

Loading