Folders and files Name Name Last commit message
Last commit date
parent directory
View all files
Web Crawler based on Requests and Beautiful Soup
Open the baby-kingdom page
Select the forum theme and note down the theme ID
Modify crawler.sh
Set theme ID, start page and end page (end page should not be larger than the final page of the theme!)
Set the folder path (not file name) where you want to save the results
It is suggested to create tmux session to run the code, in case you need to terminate them
You may concurrently run two sessions using the same IP to increase the efficiency
For websites requiring scrolling down and loading, it is better to use easy spider
Easy Spider is based on Selenium. For websites with relatively fixed structure and only text data is required, Scrappy or BS are more effective
The result file will be updated every topic. So an exception would not cause a complete data loss
If you want to know the entry number and time cost of each task, check result.csv
Regular expression and Xpath is being used to accurately select and clean the collected data
You can’t perform that action at this time.