This project is based largely on the work I did for my Splunk Yelp project not too long ago. The impetus for it came about when a local tech company pinged me about the possibility of working for them, and I wanted to see what other people thought of their company. Thus, Splunk Glassdoor was born!
This app will tell you the following:
- Avg ratings/number of ratings over time
- Recent pros, cons, and advice to management
- Tag cloud of words from pros, cons, and advice to management
In real-life, I've used this app to check out potential employers.
This app uses Splunk Lab, an open-source app I built to effortlessly run Splunk in a Docker container.
- Docker
SPLUNK_START_ARGS=--accept-license bash <(curl -s https://raw.githubusercontent.com/dmuth/splunk-glassdoor/master/go.sh ) ./urls.txt
- The file
urls.txt
should contain one URL per line, and each URL should be a business's review page from Glassdoor. - Since some businesees can have thousands of reviews, this script will pick up where it left off if interrupted.
- This grabs the HTML from review pages uses Beautiful Soup to parse the reviews and then export them to the
logs/
directory. I looked into using Glassdoor's API, but when I went to the signup page, it was a broken page that was mostly blank. So I tried 🤷. - The script is single threaded, but reasonably efficient. (and I don't want to DoS Glassdoor's website) I've clocked downloads at 5,000 in a little over 8 minutes, or about 600 reviews a minute.
- The file
- Go to https://localhost:8000/, log in with the password you set, and you'll see the Glassdoor Reviews Dashboard.
- Q: Dashboards show
Search is waiting for input...
- A: You need to select a venue in the dropdown! If no items are in the dropdown, that means no data was ingested. Did you run the command to download some Glassdor reviews?
Mostly for my benefit, these are the scripts that I use to make my life easier:
./bin/build.sh
- Build the Python and Splunk Docker containers./bin/push.sh
- Upload the Docker containers to Docker Hub./bin/devel.sh
- Build and run the Splunk Docker container with an interactive shell./bin/run-download-reviews.sh
- Run the script to download reviews directly./bin/stop.sh
- Stop the Splunk container./bin/clean.sh
- Stop Splunk, and remove the data and logs
I'd like to thank Splunk, for having such a kick-ass data analytics platform, and the operational excellence which it embodies.
Also:
- This text to ASCII art generator, for the logo I used in the script.
- Excessive CPU Usage
- In Docker on OS/X, if you have thousands and thousands of files, Splunk persistently uses like 70% of the CPU. Not good. I think it's more a Docker thing than Splunk thing, but I could write a workaround as follows:
- Download reviews to a SQLite database with SQLAlchemy
- When downloads are done, dump all reviews for that business to a single JSON file in the logs/ directory
- Workaround: Run
index=main earliest=-10y | stats count
and when the number of events stops going up, stop Splunk, remove the contents of thelogs/
directory, and restart Splunk.
- In Docker on OS/X, if you have thousands and thousands of files, Splunk persistently uses like 70% of the CPU. Not good. I think it's more a Docker thing than Splunk thing, but I could write a workaround as follows:
- Sometimes you'll see a yellow exclamation point with the text "Field 'words' does not exist in the data" on the Advice Tag Cloud. The underlying search appears to be executing normally, so I can trying to sort this one out.
Splunk is copyright by Splunk. Apps within Splunk Lab are copyright their creators, and made available under the respective license.