- Open the baby-kingdom page
- Select the forum theme and note down the theme ID
- Modify crawler.sh
- Set theme ID, start page and end page (end page should not be larger than the final page of the theme!)
- Set the folder path (not file name) where you want to save the results
- It is suggested to create tmux session to run the code, in case you need to terminate them
- You may concurrently run two sessions using the same IP to increase the efficiency
- For websites requiring scrolling down and loading, it is better to use easy spider
- Easy Spider is based on Selenium. For websites with relatively fixed structure and only text data is required, Scrappy or BS are more effective
- The result file will be updated every topic. So an exception would not cause a complete data loss
- If you want to know the entry number and time cost of each task, check result.csv
- Regular expression and Xpath is being used to accurately select and clean the collected data