-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching tokenizaiton, fingerprinting and indexing details for a growing corpus #732
Comments
Hi @anilgulecha You have an excellent suggestion and live code analysis is something we have in the back of our minds when implementing Dolos. The On a side note: the current implementation should lean towards O(N) since we're using a hash table to query the file tokens. However, as plagiarism increases this will evolve to an O(N²) behavior and collecting the final results tends to be a somewhat slow step as well. So yes: we do intend to support incremental analysis in the future. However, we are currently prioritizing UX/UI improvements and building a persistent server. The latter will also make the incremental analysis easier to implement. Thank you for your input, if you have a specific use case you would like us to support (for example: integration with a learning environment) please do tell us. We are currently integrating support for our own exercise platform Dodona. |
Thanks @rien. Yes, the specific usecase is integrating into a learning env. I can offer engineering help on getting this done. Are you able to provide bit of time towards approach, code-review and merging in the changes? |
Most definitely. I have sent you an email to discuss further details. |
I've been thinking about this integration and I think a few things need to be discussed or experimented with:
|
Happy to discuss more if needed |
Also I think those 4 actions need to be added into the library. That way both CLI or server can get the benefit of it. |
Hi Folks,
I came across dolos searching for a code plagiarism checking - and you've implemented exactly the approach I'd consider. This is a very good library!
From what I can parse on performance, it seems like the input (via CLI or server) is a list of N files, and the checking is roughly O(N^2).
For a very common usecase (a growing corpus of incoming code samples for a live coding problem), this can be unoptimal. If the intermediate values of tokenization/fingerprints and index can be saved for all the previous samples, for a new incoming N+1th sample, only checking against previous N is needed - bring performance to O(N) for every renewed run on the corpus.
in terms of implementation, the values can be cached in a peer file to every source file - so if there's file.js, file.js.dolo can have the cached values, which can read by CLI. Another approach could be a SQLite file in the directory. This can be enabled by a
--use-cache
flag.What is the dolos team's thought on this?
The text was updated successfully, but these errors were encountered: