Skip to content

Latest commit



199 lines (185 loc) · 8.73 KB

File metadata and controls

199 lines (185 loc) · 8.73 KB

OCD vs. Autism (A Reddit Thread NLP Analysis)

A project by Graham Waters, 2022


Executive Summary

The end goal for our client is likely a more clinical application of classification to assist users seeking help on public forums by using psychoanalysis from text data; however, this is beyond the scope of our initial study. Instead, we hope that by learning how these subreddits present linguistically, we gain insight into the most predictive features that can serve as the first stepping stone toward such a clinical application in the future.

Note: This study is focused solely on linguistic features present in Reddit posts and is not a formal means of diagnosis for identifying autism-spectrum or obsessive-compulsive disorder.

A Table of Contents


Data Collection

Reddit API

We collected data from the r/Autism and r/OCD subreddits using the Reddit API. We used the Python library requests to make requests to the API and parse the JSON response. The following code snippet shows the request and parsing steps.

import requests
from bs4 import BeautifulSoup

url = ''
headers = dict()
headers.update(dict(Accept='application/json', Authorization='Bearer <token>'))
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
posts = soup.findAll('div', class_="listing-item-container")

The BeautifulSoup library was used to parse the HTML content of the page. We then iterated over the posts and extracted the post title, link, and body.


We also collected keywords from the posts. We did this by searching for the keyword in the post's text. If the keyword was found, we appended it to a list.

def get_keywords(post):
    """Get the keywords from a post"""
    # Get the keywords from the post
    keywords = set()
    for word in re.split("\W+", post.text):
        if word in keywords:
    return keywords

Feature Engineering

Text Preprocessing

We preprocessed the text data by removing punctuation and lower casing the words. We also removed stop words and added them to the stop words list.

stop_words = set(stopwords.words("english"))

# Remove Punctuation
def remove_punctuation(text):
    """Remove punctuation from a string"""
    return ''.join(ch for ch in text if ch not in stop_words)

# Lower Case
def lowercase(text):
    """Lower case a string"""
    return text.lower()

Model Building

Logistic Regression

Logistic regression is a binary classification model that uses log odds as its output. In this study, we used the scikit-learn library to build a logistic regression model. The following code snippet shows how we built our model.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression(C=1e6)'self_text', 'author'), axis=1),
accuracy = accuracy_score(lr.predict(df_ocd.drop(columns=('self_text', 'author'), axis=1)),
print(f'Accuracy: {accuracy}')


Adaboost is an ensemble learning algorithm that combines multiple weak learners into a strong learner. It is often used in conjunction with logistic regression. The following code snippet shows how we built our adaboost model.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
adaboost = AdaBoostClassifier(n_estimators=100, random_state=0)'self_text', 'author'), axis=1),
accuracy = accuracy_score(adaboost.predict(df_ocd.drop(columns=('self_text', 'author'), axis=1)),
print(f'Accuracy: {accuracy}')

Decision Tree

Decision trees are a type of tree-based machine learning algorithm that are commonly used in classification problems. They are useful because they are easy to understand and interpret. The following code snippet shows how we built our decision tree model.

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
dt = DecisionTreeClassifier(random_state=0)'self_text', 'author'), axis=1),
accuracy = accuracy_score(dt.predict(df_ocd.drop(columns=('self_text', 'author'), axis=1)),
print(f'Accuracy: {accuracy}')

Keyword Vectorizer

In order to extract features from the text data, we first need to tokenize the text. This can be done using the sklearn library's CountVectorizer. We then passed the tokens to the TfidfTransformer to generate tf-idsf vectors.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_ocd.drop(columns=('self_text', 'author'), axis=1).astype(str))
tf_vocab = Counter().most_common(len(stop_words)+5)
for i in range(10):
    print(i, len(set(tokens)))
    for j, k in enumerate(tf_vocab):
        if k >= 5:
        X_k = np.array(np.where(X == k))
        X_j = np.zeros((0, 0))
        for x in X_k:
            X_j += 1 * x
        X = np.vstack((X, X_j))
    print(i + 1, len(set(tokens)))
    for j, k in enumerate(tf_vocab):
        if k > 4:
        X_k = np.array(np.where(X == k))
        X_j = np.zeros((0, 0))
        for x in X_k:
            X_j += 1 * x
        X = np.hstack((X, X_j))
        print(i + 2, len(set(tokens)))


After building the models, we plotted the results. For the logistic regression model, we created a bar plot showing the predicted probability of being OCD versus the actual value. The following code snippet shows how we generated the figure.

ax = plt.subplot(111)
x =
y = lr.predict(x)
bar_width = .35
color_map = sns.light_palette("Greens", 10)
colors = color_map.as_hex()
labels = list(range(len(
rects =, y, width=bar_width, label='Predicted Probability', edgecolor=None, align="center")
ax.set_yticks(np.arange(0, 1.05, .25))
ax.set_title("Logistic Regression Predictions")
fig = plt.gcf()

For the adaboost model, we used a confusion matrix to show the performance of the classifier. The following code snippet shows how we generated the figure.

import numpy as np
from matplotlib import pyplot as plt
confusion_matrix = pd.crosstab(, df_ocd.prediction)
cm = confusion_matrix(, df_ocd.prediction)
num_classes = cm.sum(1).max()+1
class_names = list(range(num_classes))
row_positions = np.argsort(
col_indices = np.argpartition(,
for row_number, col_name in zip(row_positions, class_names):
    fig = plt.figure()
    ax = plt.subplot(2, num_classes, row_number)
    cnt = conf_matrix.iloc.get_value(row_position=row_number, column_label=col_name)
    ax.barh(class_names, cnt)
    ax.set_xlim(0, num_classes)
    ax.set_xticks(np.arange(0, num_classes, 1))

Finally, for the decision tree model, we used a dendrogram to visualize the structure of the tree. The following code snippet shows how we generated the figure.

# Plotting the Dendogram
from scipy.cluster.hierarchy import linkage
linkage_obj = linkage(distance_func=euclidean)
dendro = linkage_obj.apply(X)


The final results are presented below. Results
