LLM-Finetuning/crawler at main · Tonystark64/LLM-Finetuning

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
crawler.py		crawler.py
crawler.sh		crawler.sh
records.csv		records.csv
链接.txt		链接.txt

README.md

Web Crawler based on Requests and Beautiful Soup

Usage

Open the baby-kingdom page

Select the forum theme and note down the theme ID

Modify crawler.sh

Set theme ID, start page and end page (end page should not be larger than the final page of the theme!)

Set the folder path (not file name) where you want to save the results

Suggestion

It is suggested to create tmux session to run the code, in case you need to terminate them

You may concurrently run two sessions using the same IP to increase the efficiency

For websites requiring scrolling down and loading, it is better to use easy spider

Easy Spider is based on Selenium. For websites with relatively fixed structure and only text data is required, Scrappy or BS are more effective

Feature

The result file will be updated every topic. So an exception would not cause a complete data loss

If you want to know the entry number and time cost of each task, check result.csv

Regular expression and Xpath is being used to accurately select and clean the collected data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawler

crawler

README.md

Web Crawler based on Requests and Beautiful Soup

Usage

Suggestion

Feature

Files

crawler

Directory actions

More options

Directory actions

More options

Latest commit

History

crawler

Folders and files

parent directory

README.md

Web Crawler based on Requests and Beautiful Soup

Usage

Suggestion

Feature