HackerNews posts exploration using Spark RDD (Big Data)

This project uses the Spark RDD (resilient distributed dataset ) API for data processing.

The dataset consists of nearly a million submitted HackerNews posts.

The Jupyter notebook walks through a number of exploratory tasks. The file spark_rdd.py has the same tasks defined as functions.

Requirements (March 2019)

You need to have JDK 8 installed. Other versions will not work.

https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Make sure that the environment variable JAVA_HOME is correct and pointing to your JDK8 installation.

Packages

Using conda: conda create -n hacknews-spark python=3 jupyter numpy pandas matplotlib seaborn pyspark=2.3.0 pyarrow

pip install pyspark==2.3.0 pyarrow

Getting it working on Windows 10 (March 2019)

This worked for me to set JAVA_HOME, and works for checking it too (taken from a stackoverflow answer):

Find the JDK installation directory. On my windows 10, 64bit system, the Java 8 JDK is at: C:\Program Files\Java\jdk1.8.0_191

do not use the jre directory, which has an almost identical path, (e.g. C:\Program Files\Java\jre1.8.0_191 )
only use the path to the jdk directory, don't add any sub-folders such as \bin
don't use any other versions you may have (e.g. I also have Java 11 at C:\Program Files\Java\jdk-11.0.1).

Set the JAVA_HOME Variable.

Right-click 'My Computer' or 'This PC' icon in Windows Explorer and select Properties (or use some other way to get to System Properties).
Click the Advanced tab (or Advance System Settings, depending on Windows version), then click the Environment Variables button.
Under System Variables, click New.
Enter the variable name as JAVA_HOME.
Enter the JDK installation path as the variable value.
Click OK, Apply, OK, etc

Note: You might need to restart Windows

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
README.md		README.md
hackernews-spark-rdd.ipynb		hackernews-spark-rdd.ipynb
resample-data.py		resample-data.py
spark_rdd.py		spark_rdd.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HackerNews posts exploration using Spark RDD (Big Data)

Requirements (March 2019)

Packages

Getting it working on Windows 10 (March 2019)

About

Uh oh!

Releases

Packages

Languages

r-build/hackernews-spark

Folders and files

Latest commit

History

Repository files navigation

HackerNews posts exploration using Spark RDD (Big Data)

Requirements (March 2019)

Packages

Getting it working on Windows 10 (March 2019)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages