A comprehensive Natural Language Processing (NLP) project for analyzing and classifying German poetry using deep learning techniques. This project focuses on predicting the century of creation for German poems through text analysis and neural network architectures.
This project implements and compares different deep learning approaches for German poetry classification, specifically designed to predict the historical period (century) when poems were written. It combines traditional NLP techniques with modern deep learning architectures to analyze the evolution of German literary styles.
The project uses a curated dataset of German poems (data/de_poems.parquet) containing:
- Title: Poem titles
- Text: Full poem content
- Author: Poet information
- Creation: Year of creation (converted to centuries for classification)
The project implements and compares two main neural network architectures:
- Multi-layer perceptron (MLP) architecture
- Word2Vec embeddings for text representation
- Dense layers for classification
- Sequential processing of text data
- LSTM/GRU-based architecture
- Enhanced temporal pattern recognition
- Deep Learning: PyTorch ≥1.10.0
- NLP Processing: spaCy ≥3.2.0, NLTK ≥3.6.0, Gensim ≥4.1.2
- Data Science: Pandas ≥1.3.0, NumPy ≥1.20.0, Scikit-learn ≥1.0.0
- Visualization: Matplotlib ≥3.5.0, Seaborn ≥0.11.0
- Hyperparameter Optimization: Optuna ≥2.10.0
- Experiment Tracking: MLflow ≥1.23.0
Project-NLP/
├── data/
│ └── de_poems.parquet # German poetry dataset
├── models/
│ ├── feedforward_nn/
│ │ └── w2v.ipynb # MLP implementation
│ └── recurrent_nn/
│ └── w2v.ipynb # RNN implementation
├── README.md
└── requirements.txt # Python dependencies
- Python 3.8+
- CUDA-compatible GPU (recommended for training)
-
Clone the repository
git clone <repository-url> cd Project-NLP
-
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Download German language model for spaCy
python -m spacy download de_core_news_sm
-
Data Preprocessing
- The notebooks automatically handle text tokenization using spaCy
- German stop words are removed and lemmatization is applied
-
Word2Vec Training
- Custom Word2Vec models are trained on the poetry corpus
- Vector dimensions: 100-500
- Window size: 5-20
-
Model Training
- Open the respective Jupyter notebooks in
models/ - Follow the cell-by-cell execution for training and evaluation
- Open the respective Jupyter notebooks in
The models are evaluated using comprehensive metrics:
- Accuracy: Overall classification performance
- Precision/Recall: Per-class performance
- F1-Score: Balanced measure of precision and recall
- ROC Curves: Model discrimination ability
- Confusion Matrix: Detailed error analysis