Etymo Workshop

You can download 243 papers from ArXiv converted using pdfminer from our S3 bucket (around 4MB)

If you want, you can also download the original PDFs too from our S3 bucket (around 400MB).

You can download the metadata and keywords of 10000 processed papers at this github repo or by typing:

git clone git@github.com:EtymoIO/OpenData.git

Etymo background

Description of who we are and what we do.

Text extraction

Main methods of extraction, the main problems and which one looks the best.

Main problems:

Whitespace
Equation conversions
Figures
References

pdf2txt

pdfminer

tesseract

Pypdf2

How to download a bundle of text conversions.

Keyword extraction

Overview of main methods

RAKE

TF-IDF

Graph Theory approach

Institution extraction

University search (Heuristic 1 and 2)

TODO: make heuristic better

Briefly explain how it works in general terms, show examples

Build up method from very simple to more complicated.

Can we use the author's name?

Deep learning

Deep learning? Object recognition

The data

How to download a bundle of PDFs. Brief look at some examples.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
institution_extraction_talk		institution_extraction_talk
intro_talk		intro_talk
ipython_files		ipython_files
keyword_extraction_talk		keyword_extraction_talk
text_extraction_talk		text_extraction_talk
the_data		the_data
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md
extract_text.py		extract_text.py
institution_extraction.ipynb		institution_extraction.ipynb
keyword_extraction.ipynb		keyword_extraction.ipynb
scrape.py		scrape.py
text_extraction.ipynb		text_extraction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Etymo Workshop

Etymo background

Text extraction

pdf2txt

pdfminer

tesseract

Pypdf2

Keyword extraction

RAKE

TF-IDF

Graph Theory approach

Institution extraction

University search (Heuristic 1 and 2)

Deep learning

The data

About

Releases

Packages

Contributors 2

Languages

EtymoIO/Workshop

Folders and files

Latest commit

History

Repository files navigation

Etymo Workshop

Etymo background

Text extraction

pdf2txt

pdfminer

tesseract

Pypdf2

Keyword extraction

RAKE

TF-IDF

Graph Theory approach

Institution extraction

University search (Heuristic 1 and 2)

Deep learning

The data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages