Skip to content

Usage: code API

Vincent Hellendoorn edited this page Jul 10, 2017 · 1 revision

The code is best used as a Java library. To get started with the code as a library, either add the Jar to your dependencies (Maven dependency coming soon!) or download the whole project and link it to yours. Have a look at slp.core.example to see how you would do all the setup for a natural language (NLRunner) and a Java code (JavaRunner) example; it show-cases quite a few options that you can set.

The usual process takes about five steps:

  1. Set up the LexerRunner with options such as whether to add delimiters around lines, or whole files, which lexer to use (e.g. preserve punctuation, split on whitespace?)
  2. Set up the vocabulary by building it before-hand with some cut-off for infrequent words, or you could leave it open entirely (as turns out to be better for source code)
  3. Set up the ModelRunner with options for modeling, such as whether to treat each line as a sentence (vs. the whole file, e.g. for Java), what order n-grams to use.
  4. Set up a Model, e.g. a simple n-gram model, a model with cache, a mixture of global, local + cache, or an automatically nested model, maybe make it dynamic to learn every token right after modeling it.
  5. Run your model on whatever data you have. You can call ModelRunner.model (or .predict) to model any sequence, file, or whole directory (recursively). What if you don't want to model just once? Maybe you want to model every commit in a project's history and then update with that commit right after modeling it? Easy! Because all the models are count-based, you can just wrap the modeling step in a loop and alternate with calls to ModelRunner.learn or ModelRunner.forget to keep your model up-to-date without retraining the whole thing.
Clone this wiki locally