Skip to content

Files

Latest commit

 

History

History
 
 

abbreviation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Abbreviation Extraction

This folder contains all the codes and the outcomes of the different methodology used for extracting Abbreviation from HTML and their full forms. The dataset used for abbreviation extraction task can be found in here which consists of the starting 300 pages of the Climate Report. Dataset Link: https://github.com/petermr/pyami/tree/main/temp/html

The code which I have used to extract abbreviations from the HTML files can be found in the link below: https://github.com/ananyas168/petermr/blob/main/climate_abbreviation_extraction.ipynb

The idea behind the code is as follows:

The structure of the folder is explained below:

  • Subfolder:Codes contains the ipynb file of the code which is used for extracting the abbreviation in the above mentioned methodology.
  • Subfolder: Pages contains the extracted raw text from HTML and the abbreviations in respective txt files.

Inside Subfolder Pages there are subsubfolder named as page_X, where X stands for the page noumber. And Inside this subsubfolder you can find three .txt files, namely:

  • page_X_dictionary_SH.txt --> contains the abbreviation extracted using Schwartz-Hearst algorithm.

  • page_X_dictionary_Spacy.txt --> contains the abbreviation extracted scispacy abbreviation extractor.

  • page_X_text.txt --> contains the raw text extracted by BeutifulSoup from HTML.

where X represents the page number.

The final output is presentin the parent folder with the name abbreviation_table.csv in a csv ormat which represents the our output formation aftter running through the code(an easy to read and access format in csv).