Fiction and Non-Fiction Text Classification

This repository contains the implementation of a classification model that distinguishes between fiction and non-fiction texts using linguistic features derived from part-of-speech (POS) tagging. Inspired by the research paper "A Simple Approach to Classify Fictional and Non-Fictional Genres", we replicate the results with a slight modification by utilizing the NLTK POS tagger instead of the one mentioned in the paper. The results demonstrate the robustness of the study. We further explore additional POS-based features for genre classification.

Introduction

The primary goal of this project is to classify text as fiction or non-fiction based on POS-based features. Initially, the study focuses on two key features:

Adverb-to-Adjective Ratio
Adjective-to-Pronoun Ratio

The classification is done using a logistic regression model. Additional POS-based features are also explored to test their efficacy in genre classification. Read the original paper for more details.

Features

Adverb-to-Adjective Ratio: Measures the prevalence of descriptive adverbs relative to adjectives.
Adjective-to-Pronoun Ratio: Measures the descriptive richness of text in relation to pronouns.
Custom POS-Based Features: Additional features derived from linguistic analysis are being evaluated for performance improvement.

Data Sources

Brown Corpus: A collection of texts categorized into fiction and non-fiction, provided by the NLTK library.
Baby BNC (British National Corpus): Fictional and non-fictional texts, sourced from the baby_bnc.csv file in the repository.

Dependencies

Ensure the following libraries are installed:

Python 3.7+
NLTK
pandas
scikit-learn

Install dependencies using:

pip install -r requirements.txt

Usage

Prepare the Data:
- Place the baby_bnc.csv file in the repository root.
- The Brown Corpus is automatically loaded from NLTK.
Run the Notebook:
- Open and execute the Jupyter Notebook similar_results.ipynb to reproduce results or experiment with additional features.
Generate Features:
- Modify the feature extraction logic in the extract_two_features function or extend it to include new features.
Train and Test:
- Execute the classification pipeline in the notebook to test the logistic regression model with various feature combinations.

Results

Using the NLTK POS tagger, the model achieves results comparable to the original study, validating its robustness.
Preliminary experiments with additional POS-based features show promising directions for improving classification accuracy.

Further Work

Exploring additional POS-based ratios to improve classification accuracy.
Testing the model on a broader set of corpora.
Applying other machine learning algorithms to evaluate performance enhancements.

References

Mohammed Rameez Qureshi, Sidharth Ranjan, Rajakrishnan P. Rajkumar, and Kushal Shah. "A Simple Approach to Classify Fictional and Non-Fictional Genres". Proceedings of the Second Storytelling Workshop, Florence, Italy, August 1, 2019.
NLTK documentation: https://www.nltk.org/
Scikit-learn documentation: https://scikit-learn.org/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

README.md

README.md

Fiction and Non-Fiction Text Classification

Table of Contents

Introduction

Features

Data Sources

Dependencies

Usage

Results

Further Work

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Fiction and Non-Fiction Text Classification

Table of Contents

Introduction

Features

Data Sources

Dependencies

Usage

Results

Further Work

References