Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching tokenizaiton, fingerprinting and indexing details for a growing corpus #732

Open
anilgulecha opened this issue May 3, 2022 · 6 comments

Comments

@anilgulecha
Copy link

Hi Folks,

I came across dolos searching for a code plagiarism checking - and you've implemented exactly the approach I'd consider. This is a very good library!

From what I can parse on performance, it seems like the input (via CLI or server) is a list of N files, and the checking is roughly O(N^2).

For a very common usecase (a growing corpus of incoming code samples for a live coding problem), this can be unoptimal. If the intermediate values of tokenization/fingerprints and index can be saved for all the previous samples, for a new incoming N+1th sample, only checking against previous N is needed - bring performance to O(N) for every renewed run on the corpus.

in terms of implementation, the values can be cached in a peer file to every source file - so if there's file.js, file.js.dolo can have the cached values, which can read by CLI. Another approach could be a SQLite file in the directory. This can be enabled by a --use-cache flag.

What is the dolos team's thought on this?

@rien
Copy link
Member

rien commented May 4, 2022

Hi @anilgulecha

You have an excellent suggestion and live code analysis is something we have in the back of our minds when implementing Dolos. The Index that is used to store the fingerprints and a reference to the files matching those fingerprints (and the location(s) of the match in that file) makes it possible, currently only in theory, to perform incremental analysis by re-using this index once a new file arrives. This eliminates the need to perform the tokenization and fingerprinting of the previous N files because the information needed is already there.

On a side note: the current implementation should lean towards O(N) since we're using a hash table to query the file tokens. However, as plagiarism increases this will evolve to an O(N²) behavior and collecting the final results tends to be a somewhat slow step as well.

So yes: we do intend to support incremental analysis in the future. However, we are currently prioritizing UX/UI improvements and building a persistent server. The latter will also make the incremental analysis easier to implement.

Thank you for your input, if you have a specific use case you would like us to support (for example: integration with a learning environment) please do tell us. We are currently integrating support for our own exercise platform Dodona.

@anilgulecha
Copy link
Author

anilgulecha commented May 4, 2022

Thanks @rien. Yes, the specific usecase is integrating into a learning env. I can offer engineering help on getting this done. Are you able to provide bit of time towards approach, code-review and merging in the changes?

@rien
Copy link
Member

rien commented May 5, 2022

Most definitely. I have sent you an email to discuss further details.

@rien
Copy link
Member

rien commented May 9, 2022

I've been thinking about this integration and I think a few things need to be discussed or experimented with:

  • @anilgulecha do you want to compare with the latest submission of each user, or every submission that was handed in? The latter option will require a bit more thought about scaling the index.
  • Do you want to use the lib, the CLI or the server? Because the server will probably be rewritten and extended with more features in the future, I suggest creating a separate micro service based on the lib.
  • If you add a single file to the analysis, would you want to receive all similar files associated to that file only, or all similarities in the current set of files? The latter will require a bunch more data to be saved.
  • I am still thinking about adhering to the current storage format (CSV-files), but I think SQLite might indeed be the better option if we want to avoid keeping the full index in memory. If I find the time I will try to compare these two options.

@anilgulecha
Copy link
Author

  1. We don't need to define it at user level. as far as solos cli is concerned, it can define a folder or namespace. And comparisons can span a namespace. That way end users of dolos can choose to fill the namespace per their need.
  2. Ideally the server supports these actions: createNamespace, addFile, addFiles, getReport. Internally on every file added, it's tokenized index and saved, and the report updated. get report should be O(1) operation, addFile is a O(N) operation, where N is size of namespace.
  3. The above structure will enable many flexible usecases - without comporomising on simplicity of design.
  4. I think SQLite makes sense as a store of index.

Happy to discuss more if needed

@anilgulecha
Copy link
Author

Also I think those 4 actions need to be added into the library. That way both CLI or server can get the benefit of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants