-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
textreuse #20
Comments
Just a note to whoever ends up reviewing this package: I've made some performance improvements/bug fixes, but these are all on master now (with tags). |
thanks for the heads up, still seeking reviewer |
I'm used to waiting 6 to 9 months. :-) On Tue, Sep 22, 2015 at 2:53 PM, Scott Chamberlain <notifications@github.com
Lincoln Mullen, http://lincolnmullen.com |
@noamross assigned |
General Comments👏👏👏 This package is so f*ing elegant. It's classy, really goddam classy and I barely have anything to add and kind of want to delete everything on my hard drive in response. Some things I really, really like:
I'm wasn't familiar with the particular methods described here but they were easy to learn from the vignettes and I was able to apply them to a corpus of 3000 documents in no time. The package is also a testament to how much recent tooling such as I really couldn't find anything in the code to complain about. The following is mostly a bunch of tiny feature and documentation requests. Great, job, @lmullen! Tests
Documentation
Functionality
|
(P.S., sorry this is late.) |
@noamross Thanks so much for this very thorough review. I really appreciate the level of detail of your review, which gives me a lot more confidence in the package. I'll respond point by point as necessary as I work on implementing your suggestions. But in general, I think everything that you have suggested is worth doing, and I plan on implementing them before sending the package to CRAN. In particular, I like the idea of giving the pairwise data frames a class: much more elegant. Also, I ran into the One other thing. I've noticed that a performance problem using the hash package (ugh---so many kinds of hashes in this package!). With a corpus of 71K documents, the Thanks, Noam! |
I'm happy to review a new implementation. |
👏 |
I've implemented almost all of your suggestions, @noamross. They are now on the master branch of the repository, or you can get the most recent version tagged
The one suggestion that you made that I did not implement is somehow storing or at least validating the minhash function when creating adding new documents to the LSH buckets/cache. I agree in principle that this should be done, but I couldn't think of a good way to do it. I don't think it is a big issue for the first version of the package, since the package can only deal with in-memory data at this point. That is to say, the biggest corpus that I'm likely to use it on has 71K documents and that isn't even close to using up the memory on my laptop. In those cases, it is unlikely that someone would add to the cache instead of simply doing it all in one batch. For the future I'm thinking about how this could be extended for out of memory data, which the new dplyr implementation will permit. I think I'll have to figure out how to implement your suggestion then. Thanks again for your detailed review. It has improved the package a great deal. |
Great, let me know when is a good point to do an updated review. |
I think it's ready whenever is convenient for you. |
k, I should be able to slot it in in the next week or so. |
I've added one last feature (I promise this time!). It is a local alignment function that implements a version of the Smith-Waterman function from protein alignment for use with natural language. It is one function |
Note: Github's "compare" functionality is super helpful for doing this review. I just did
|
Thanks for doing the second thorough review, @noamross. Much appreciated! Here are the responses to your queries/notes:
In addition, I've cleared out any remaining issues in the repository. @sckott, @karthik: Anything else to be done before this package can be part of rOpenSci? |
Looks good to me. any thoughts @karthik |
LGTM, @lmullen i think you can move pkgs into ropensci already |
Okay, I've transferred the repository and added the rOpenSci footer, etc. |
🚀 thanks! |
This package detects document similarity, and implements the minhash/lsh algorithms.
https://github.com/lmullen/textreuse/
This package anticipates that the user has documents in plain text. Future versions could provide, for example, XML readers as the tm package does, but I think that probably does not belong in this package.
Detecting document similarity is a common problem when working the natural language, so I anticipate that this package will be broadly useful for anyone working in NLP.
No, there are no other R packages that implement minhash/locality-sensitive hashing. The tm package does implement some document similarity measures, but these are similarity in terms of content rather than in terms of actual borrowing of text. In other words, it would mark two documents that both talked about football as being similar, even if they had no shared text.
That said, this package extends classes from the NLP and tm packages, so it is intended to play nice with other R NLP packages.
Yes, I comply with all those guidelines. The exception is that I have named classes, for example,
TextReuseTextDocument
bowing to the precedent set by the NLP package. I don't like the name any better than you, but that's just how they do it with those packages.devtools::check()
produce any errors or warnings? If so paste them below.The text was updated successfully, but these errors were encountered: