Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow "assume sorted" option #5

Open
samuell opened this issue Aug 16, 2016 · 5 comments
Open

Allow "assume sorted" option #5

samuell opened this issue Aug 16, 2016 · 5 comments

Comments

@samuell
Copy link
Member

samuell commented Aug 16, 2016

We could do a slightly different processing algorithm if we can assume the data is sorted (which should be much more efficient using a pure text sorting tool anyway, for n-triples files), which will require far less memory and probably be faster.

@thiviyanT
Copy link

thiviyanT commented Dec 11, 2017

By default, does rdf2smw (pre-release 0.6 version) have this option set to true? If so, it would explain why my triples were not imported as I expected.

Before running the rdf2smw script, one could always sort the data using the unix sort command.
I have achieved this using the command below and the process was astonishingly fast, matter of seconds, on more than 95K triples.

cat triples.nt | sort -k2,2 -k1,1 > sorted.triples.nt.

@samuell Would it be possible to incorporate a similar unix command into rdf2smw? Do you think doing so would dramatically impact the performance of the code?

@samuell
Copy link
Member Author

samuell commented Dec 11, 2017

Thanks for the interesting suggestion @ThiviyanThanapalasingam !

I think including the unix sort command would make the software drastically more complex (because of interfacing between Go and C-code), and harder to maintain, though.

But since the sort command is so widely available, on Linux, Mac, and now even on Windows, with the Windows Subsystem for Linux (WSL), one could enable a workflow where the user first sorts the file using sort, and then runs rdf2smw. rdf2smw can implement a potentially faster algorithm if it can assume sorted input, or at least it could use a lot less memory. For example, aggregating triples per subject, which is done internally, the subjects will be already grouped together in the input, so it can finish each new wiki page as soon as all triples for a particular subject have been processed, instead of keeping all the triples and pages in memory until the end.

@thiviyanT
Copy link

I see. Thanks for the explanation @samuell. In that case, it would be a good idea to let the OS do the heavy lifting. The command (sort triples.nt -k2,2 -k1,1 > sorted.triples.nt) can be added in the README file, right under the Usuage header text. Do you think doing this solves the issue once and for all?

If you are happy with the changes that I have proposed, I would like to contribute to this project by implementing it. Please let me know what the protocol is for contributing (i.e. Do I work on the master branch and then send you a pull request?)

@samuell
Copy link
Member Author

samuell commented Dec 11, 2017

Thanks for the input @ThiviyanThanapalasingam ! I'll look at including that in the README shortly.

Reg. contributing, awsome, that is much welcome!

I think I should set up a develop branch, and have released code in master, for the future. So, if you start working, you could create a new develop branch in your repo, and I'll fix with the develop branch shortly.

@samuell
Copy link
Member Author

samuell commented Dec 11, 2017

and have released code in master, for the future

Recommended for Go-packages, since Go lacks an official dependency manager, and most people just pull in the master branch of libraries :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants