Skip to content

ErwinGoneMad/Project-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

German Poetry Analysis with Deep Learning

A comprehensive Natural Language Processing (NLP) project for analyzing and classifying German poetry using deep learning techniques. This project focuses on predicting the century of creation for German poems through text analysis and neural network architectures.

🎯 Project Overview

This project implements and compares different deep learning approaches for German poetry classification, specifically designed to predict the historical period (century) when poems were written. It combines traditional NLP techniques with modern deep learning architectures to analyze the evolution of German literary styles.

📊 Dataset

The project uses a curated dataset of German poems (data/de_poems.parquet) containing:

  • Title: Poem titles
  • Text: Full poem content
  • Author: Poet information
  • Creation: Year of creation (converted to centuries for classification)

🏗️ Architecture

The project implements and compares two main neural network architectures:

1. Feedforward Neural Network (models/feedforward_nn/)

  • Multi-layer perceptron (MLP) architecture
  • Word2Vec embeddings for text representation
  • Dense layers for classification

2. Recurrent Neural Network (models/recurrent_nn/)

  • Sequential processing of text data
  • LSTM/GRU-based architecture
  • Enhanced temporal pattern recognition

🛠️ Technology Stack

Core Dependencies

  • Deep Learning: PyTorch ≥1.10.0
  • NLP Processing: spaCy ≥3.2.0, NLTK ≥3.6.0, Gensim ≥4.1.2
  • Data Science: Pandas ≥1.3.0, NumPy ≥1.20.0, Scikit-learn ≥1.0.0
  • Visualization: Matplotlib ≥3.5.0, Seaborn ≥0.11.0

Advanced Features

  • Hyperparameter Optimization: Optuna ≥2.10.0
  • Experiment Tracking: MLflow ≥1.23.0

📁 Project Structure

Project-NLP/
├── data/
│   └── de_poems.parquet         # German poetry dataset
├── models/
│   ├── feedforward_nn/
│   │   └── w2v.ipynb            # MLP implementation
│   └── recurrent_nn/
│       └── w2v.ipynb            # RNN implementation
├── README.md                     
└── requirements.txt             # Python dependencies

🚀 Getting Started

Prerequisites

  • Python 3.8+
  • CUDA-compatible GPU (recommended for training)

Installation

  1. Clone the repository

    git clone <repository-url>
    cd Project-NLP
  2. Create a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Download German language model for spaCy

    python -m spacy download de_core_news_sm

Usage

  1. Data Preprocessing

    • The notebooks automatically handle text tokenization using spaCy
    • German stop words are removed and lemmatization is applied
  2. Word2Vec Training

    • Custom Word2Vec models are trained on the poetry corpus
    • Vector dimensions: 100-500
    • Window size: 5-20
  3. Model Training

    • Open the respective Jupyter notebooks in models/
    • Follow the cell-by-cell execution for training and evaluation

📈 Performance Metrics

The models are evaluated using comprehensive metrics:

  • Accuracy: Overall classification performance
  • Precision/Recall: Per-class performance
  • F1-Score: Balanced measure of precision and recall
  • ROC Curves: Model discrimination ability
  • Confusion Matrix: Detailed error analysis

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published