This is an official implementation of the model described in:
"Structural Language Models of Code" [PDF]
Appeared in ICML'2020.
An online demo is available at https://AnyCodeGen.org.
This repository currently contains the dataset and the data extractor that we used to create the Java dataset in the paper.
Feel free to open a new issue for any question. We always respond quickly.
- Requirements
- Download our preprocessd dataset
- Creating a new dataset
- Datasets
- Querying the trained model
- Citation
python3 -c 'import tensorflow as tf; print(tf.__version__)'
This dataset contains ~1.3M examples (1.1GB).
mkdir data
cd data
wget https://codegen-slm.s3.us-east-2.amazonaws.com/data/java-small-preprocessed.tar.gz
tar -xvzf java-small-preprocessed.tar.gz
This will create a data/java-small/
sub-directory, containing the files that hold training, test and validation sets,
a dict file for various dataset properties and histograms, and a grammar file that is used during beam search to
distinguish between terminal and non-terminal nodes.
To create and preprocess a new dataset (for example, to compare SLM to a new model on another dataset):
- Edit the file preprocess.sh using the instructions there, pointing it to the correct training, validation and test directories.
- Run the preprocess.sh file:
bash preprocess.sh
To download the Java-small as raw *.java
files, use:
To download the preprocessed dataset, use:
To download the dataset in a tokenized format that can be used in seq2seq models (for example, with OpenNMT-py), use:
The following JSON files are the files that are created by the JavaExtractor. The preprocessed and the seq2seq files are created from these JSON files:
Every line is a JSON object
that contains the following fields: num_targets
, num_nodes
, targets
,
is_token
, target_child_id
, internal_paths
, relative_paths
, head_paths
,
head_root_path
, head_child_id
, linearized_tree
, filepath
, left_context
,
right_context
, target_seq
, line
.
The C# dataset that we used in the paper was created using the raw (*.cs
files) dataset of
Allamanis et al., 2018,
(https://aka.ms/iclr18-prog-graphs-dataset) and can be found here: https://aka.ms/iclr18-prog-graphs-dataset.
To extract examples from the C# files, we modified the data extraction code of Brockschmidt et al., 2019: https://github.com/microsoft/graph-based-code-modelling/.
To query the trained model, use the following API, where MYCODE
is the given code snippet, that includes two question marks (??
) to mark the "hole" that should be completed.
curl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "MYCODE"}'
For example:
curl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null) return null; Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i) { ret[i] = ??; } return ret; }"}'
curl -X POST https://63g9yqims7.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "MYCODE"}'
For example:
curl -X POST https://63g9yqims7.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "@Override public boolean retainAll(Collection<?> collection) { boolean changed = false; for (Iterator<E> iter = iterator(); iter.hasNext(); ) { Element elem = iter.next(); if (!collection.contains(elem)) { iter.remove(); ?? } } return changed;}"}'
Structural Language Models of Code
@inproceedings{alon2020structural,
title={Structural language models of code},
author={Alon, Uri and Sadaka, Roy and Levy, Omer and Yahav, Eran},
booktitle={International Conference on Machine Learning},
pages={245--256},
year={2020},
organization={PMLR}
}