Skip to content

Latest commit

 

History

History
80 lines (50 loc) · 2.29 KB

README.md

File metadata and controls

80 lines (50 loc) · 2.29 KB

Minecraft Wiki Crawler

Installation

We use python 3.11. We have tested on macOS and Ubuntu 20.04. You can follow the instructions below to run it.

Install Requirements

pip install -r requirements.txt

Get Started

You can just run python main.py.

In the main.py:

if __name__ == '__main__':
    urls_dir = Path('crawler_data/rought_urls')
    base_url = 'https://minecraft.wiki'
    urls = [
        'https://minecraft.wiki/w/Mob',
        'https://minecraft.wiki/w/Block',
        'https://minecraft.wiki/w/Item',
        'https://minecraft.wiki/w/Tutorials',
        'https://minecraft.wiki/w/Biome',
        'https://minecraft.wiki/w/Smithing',
        'https://minecraft.wiki/w/Structure'
    ]
    url_crawl(base_url=base_url, urls=urls, output_dir=urls_dir, rough=True)
    crawl(urls_dir=urls_dir, output_dir=Path('crawler_data/rough'))
    
    content_dirs = [
        Path(path) for path in Path('crawler_data/rough').iterdir() if path.is_dir()
    ]
    print(content_dirs)
    split_content(content_dirs=content_dirs)
  • urls_dir: This is the directory where the crawled url will be placed.

  • base_url: Minecraft Wiki Url

  • urls: There are 22 categories. You can select which categories to crawl by appending urls of categories to urls.

    image-20240524上午93305274

  • url_crawl(): Fisrt crawl all pages urls of your selected categories and save them to urls_dir

  • crawl(): According to urls in urls_dir, crawl contents of all pages, including text, lists, tables.

  • split_content(): It is used to split files whose word count exceeds the limit, splitting them in content blocks to ensure that the word count of each file after splitting does not exceed the limit as much as possible.

TODO