GitHub

NLP FOR CHECKING IF A TEXT BLOCK CONTAINS PYTHON CODE

We are going to crawl some webpages where we know that there is text with Python code and text without code. Once we have the required data, we will use CountVectorizer() from the scikit-learn package to vectorize our sentences so that we can later on use it in some machine learning models.

SETUP

Pyhton Version: 2.7.15

Install libraries from pip_list file (recommended to use a virtual environment):

pip install -r requirements.txt

Info about files in repo

scrapping_code.py and scrapping_text.py contain the modules to get samples of sentences both containing code and not containing it.

nlp_code.py uses scrapping_code.py and scrapping_text.py files to create all_sentences file, which contains examples with and without python code.

trying_classif_models.py creates a model with nlp_code_python_{}.h5 name, using vectorized sentences coming from nlp_code module.

predict_new_sentences.py applies the created model in new sentences.

1. Create model

Run:

python trying_classif_models.py

It takes a bit since it needs to crawl the webpages and later on run the model.

Use predict_new_sentences.py file to try out the model. i.e run:

python predict_new_sentences.py --new-sentences 'This should not be taken as code'

python predict_new_sentences.py --new_sentences '
def set_pwd():
    x = raw_input("Enter the pwd")
    y = raw_input("Confirm the pwd")
'

Next steps

More samples should be added to make a more complete example.
Doing some current TODOs should help improve the model.
Model in trying_classif_models.py could be improved by adding more layers.
Use some different input and create a similar model for sql and java examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP FOR CHECKING IF A TEXT BLOCK CONTAINS PYTHON CODE

SETUP

Info about files in repo

1. Create model

Next steps

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Readme.md		Readme.md
all_sentences		all_sentences
nlp_code.py		nlp_code.py
nlp_code_python_1000.h5		nlp_code_python_1000.h5
predict_new_sentences.py		predict_new_sentences.py
requirements.txt		requirements.txt
scrapping_code.py		scrapping_code.py
scrapping_text.py		scrapping_text.py
trying_classif_models.py		trying_classif_models.py

VictorGeaGarcia/NLP

Folders and files

Latest commit

History

Repository files navigation

NLP FOR CHECKING IF A TEXT BLOCK CONTAINS PYTHON CODE

SETUP

Info about files in repo

1. Create model

Next steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages