SoundCloud Comment Scraper

First ever coding project - a GUI for scraping and filtering SoundCloud comments. Needs a bit of work but does the job!

Built with Python, Selenium and Tkinter. Has an array of built-in and custom filters which can be stacked. Originally intended for music research, hence the GUI. Would have been more effective as a web app but I learnt a lot about Tkinter in the process...

Setup

Install and set up virtualenvwrapper: https://virtualenvwrapper.readthedocs.io/en/latest/
Set up and workon a virtual env and install all project dependencies from requirements.txt
Navigate to sc_comment_scraper/src
Run the command add2virtualenv . (this adds the module to PYTHONPATH): https://virtualenvwrapper.readthedocs.io/en/latest/command_ref.html#add2virtualenv
Change dir into sc_comment_scraper/src/app
Run py main.py

Example

Run all the steps in the Setup section above - the GUI should pop up
Select manual in URL Input Method
Enter https.//soundcloud.com/bonobo/bonobo-and-jacques-greene-fold in the URL Entry box (or another track URL that you know has comments)
Check Merge csv files as in Scraping Settings, and enter a filename (e.g. 'example')
Check None in Filters
Press Start
Once completed, check the src/csv_exports directory and look for your CSV files

Help

This is an old project - my first one! There are plenty of bugs and badly-written bits of code so apologies if things don't always run smoothly...

If you get errors, the first thing to check are the advanced scraping settings. Try increase the wait time to account for a slower network.

If you are still not getting the results you think you should be, try close the app (ctrl C in the terminal) and start over. Also try searching for the csv_exports folder in case this has been created somewhere other than the package root. Otherwise, try adjusting the other settings and filters, or choose a new url!

If the writing to CSV fails, the data may still be available via a backup UI in the terminal.

On macOS at least, screenshots will crash the app...

To Do

Refactoring

Replace error handling with robust testing/type checking and only targeted try/except blocks (no catch-alls)
Review OOP structure throughout: currently lots of large, tightly-coupled classes and functions with poor cohesion.
Fix selenium waiting - time.sleep() is not the best way to go about this! Need to use implicit waits etc.
Rewrite scraping code to select timestamp, datetime and comment body all at the same time and bind this data in an object. Would prevent errors if DOM data incomplete.
Filter system's use of eval() is insecure. Could move this code into a method in the Filter class.

Extensions

Complete the data analysis frame, analyse themes, spam etc.
add 'files created here' label to finished scroll bar
validate GUI interface more - highlight empty inputs, add placeholders etc. Maybe a help button explaining a few things.
UHD mode only works on Windows
MacOS doesn't seem to automatically find txt files in some cases?
Make into a web app?

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
src		src
tests		tests
.flake8		.flake8
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SoundCloud Comment Scraper

Setup

Example

Help

To Do

Refactoring

Extensions

About

Releases

Packages

Languages

License

jakejones2/sc_comment_scraper

Folders and files

Latest commit

History

Repository files navigation

SoundCloud Comment Scraper

Setup

Example

Help

To Do

Refactoring

Extensions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages