Skip to content

Example repository for NLTK execution on PySpark cluster with Cloudera Data Science Workbench

Notifications You must be signed in to change notification settings

InfoDigger/NLTK-pyspark

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLTK-example

This example shows how to distribute PySpark with python packages. It is based on this blog.

How to use

  1. Open workbench with Python and run setup.sh
  2. Set environmental valiable PYSPARK_PYTHON as ./NLTK/nltk_env/bin/python
  3. Reopen workbench and run pyspark_nltk.py

Key points for destribute python packages with conda

  • Create conda environment and zip them.
  • Set spark.yarn.appMasterEnv.PYSPARK_PYTHON with your conda environment in spark-defaults.conf
    • e.g.) spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NLTK/nltk_env/bin/python
  • Set environmental variable: PYSPARK_PYTHON=./NLTK/nltk_env/bin/python

About

Example repository for NLTK execution on PySpark cluster with Cloudera Data Science Workbench

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 52.0%
  • Shell 48.0%