This school project was created for the PSZ (Pronalaženje Skrivenog Znanja, en. Data Mining and Semantic Web) course, which is part of the Master studies at the School of Electrical Engineering, University of Belgrade.
The project consisted of crawling the discogs website in order to gather data for albums, artists and songs.
After gathering it, the raw data was pre-processed and then stored in a SQLite database (the psz_database.db
file
in the data
folder), which was the first task of the project. The remaining 4 tasks were centered around processing
the data, visualizing it and running unsupervised learning algorithms (in my case only clustering algorithms).
The whole project statement (in Serbian) is located in the docs
folder.
In order to run the code, you need to have Python 3.x installed.
You will also need the following python packages:
- requests
- beautifulsoup4
- fuzzywuzzy
- python-Levenshtein
- regex
- matplotlib
- numpy
- cyrtranslit
- scikit-learn
- bokeh
The data on the pages can be structured differently, which caused me some difficulties when I tried to scrape it. Below are some examples of the pages with different structures.
- Albums
- versions; tracklist with times but no credits
- tracklist with some addition labels but no times
- tracklist with credits but no times; album credits
- tracklist with credits and times; album credits
- tracklist with no additional data; album credits
- tracklist with no link to songs but with credits
- additional separators in the tracklist
- Artists
- Songs