abbreviation

History

Name		Name	Last commit message	Last commit date
parent directory ..
Codes		Codes
Pages		Pages
README.md		README.md
abbreviation_table.csv		abbreviation_table.csv

README.md

Abbreviation Extraction

This folder contains all the codes and the outcomes of the different methodology used for extracting Abbreviation from HTML and their full forms. The dataset used for abbreviation extraction task can be found in here which consists of the starting 300 pages of the Climate Report. Dataset Link: https://github.com/petermr/pyami/tree/main/temp/html

The code which I have used to extract abbreviations from the HTML files can be found in the link below: https://github.com/ananyas168/petermr/blob/main/climate_abbreviation_extraction.ipynb

The idea behind the code is as follows:

Convert an HTML to raw text using BeautifulSoup.
Extracting abbreviations using the NLP tool based on the Schwartz-Hearst algorithm. ( Here is the link for the tool: https://github.com/philgooch/abbreviation-extraction/blob/develop/README.md) and using scispacy abbreviation extractor
Saving the extracted raw text and the abbreviations in txt format.
Creating a final table with abbreviation, longform, count and wiki_lookup links in it.(see the image below for reference)

The structure of the folder is explained below:

Subfolder:Codes contains the ipynb file of the code which is used for extracting the abbreviation in the above mentioned methodology.
Subfolder: Pages contains the extracted raw text from HTML and the abbreviations in respective txt files.

Inside Subfolder Pages there are subsubfolder named as page_X, where X stands for the page noumber. And Inside this subsubfolder you can find three .txt files, namely:

page_X_dictionary_SH.txt --> contains the abbreviation extracted using Schwartz-Hearst algorithm.
page_X_dictionary_Spacy.txt --> contains the abbreviation extracted scispacy abbreviation extractor.
page_X_text.txt --> contains the raw text extracted by BeutifulSoup from HTML.

where X represents the page number.

The final output is presentin the parent folder with the name abbreviation_table.csv in a csv ormat which represents the our output formation aftter running through the code(an easy to read and access format in csv).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

abbreviation

abbreviation

README.md

Files

abbreviation

Directory actions

More options

Directory actions

More options

Latest commit

History

abbreviation

Folders and files

parent directory

README.md