Skip to content

Latest commit

 

History

History
77 lines (67 loc) · 2.91 KB

README.md

File metadata and controls

77 lines (67 loc) · 2.91 KB

Imports: isort Code Style: black pre-commit.ci status

My Reddit: an exploratory NLP project on reddit data

This website is a fully featured demonstrator of a Natural Language Processing (NLP) application: from data extraction to end user application, we've covered the whole lifecycle of a NLP application with a modern and open-source NLP data stack.

flowchart TD
    %% EL[T]
    reddit[[Reddit API]] --- ELT[\Extract Load - python/dagster/]
    ELT --> posts{{posts + comments}}
    %% [EL]T DBT processes
    posts --- pdbt[\Transform - DBT/]
    pdbt --> stats{{Statistics}}
    pdbt --> txt{{Texts}}
    %% Dashboards
    stats --- cube[\Aggregation/DataViz - Cube JS/] --> Dashboards
    %% Clean texts
    txt --- textacy[\Preprocessing - Textacy/] --> clean{{Clean Texts}}
    %% Subreddit prediction (trivial)
    txtcattr[\Train - Spacy/Transformers/] --> perfs{{Performances}}
    perfs --- cube
    clean --- txtcattr --> srmodel([Models])
    clean --- txtcatpr[\Batch Predictions/] --> prtxtcat{{Predictions}}
    srmodel --- txtcatpr
    %% APIs
    srmodel --> modelapi[[Prediction API - FastAPI]]
    %% Apps
    srmodel --> apps[Applications - Streamlit]
    clean --> apps
Loading

Legend:

flowchart TD
    legendapi[[API]]
    legendtr[\Transformation/]
    legenddb{{Data}}
    ledendweb[Web UI]
Loading

Architecture

architecture

Features

NLP Features

Implemented:

  • Syntactic analysis with Spacy
  • Topic Modeling with BERTopic
  • Text classification with custom model

Coming Soon:

  • Language detection
  • Topic Modeling algorithm comparison

Product

  • A dashboard app based on cubeJS
  • A frontend which integrates all administration web UIs
  • Data collection on demand & on schedules
  • Cloud Database (BigQuery)
  • On demand & scheduled model training
  • APIs to serve models
  • MLOps for NLP with explainability
  • Interactive Apps