wiki-crawler

The toolkit for moving IPFS wikipedia articles to the Great Web.

Status: alpha

Very Important Section!!!

This is very alpha soft. Highly recommend to create special account for this crawler.
Keep data/links.csv and don't remove it to avoid invalid transactions to network

Requirements

ipfs version 0.4.22
python3

Preparations

Clone this repo and got into it:

git clone https://github.com/SaveTheAles/wiki-crawler.git
cd wiki-crawler

Install all requirements
Install python packages:

pip3 install -r requirements.txt

Fill config.py with your personal credentials.
Fill data/queries.txt with keywords you interested in for parsing. Every word from the new line.

Launch

Launch IPFS daemon

ifps daemon

Run:

python3 main.py

What's going on?!?!?!

The crawler gets keywords from your data/queries.txt and search for article titles on wikipedia by those keywords and create cyberlinks:

query -> [titles]
[titles] -> query

After that crawler gets every article in distributed wikipedia by the title it found and create cyberlinks:

[titles] -> [articles]

And finally, it gets links from the articles with query keyword and cyberlink them too:

[articles] -> [links]

All you created cyberlinks storing at data/links.csv

Extra tools

cids.py - tool for extracting all CIDs you crawled to data/cids.txt. Should be usefu if you need to pin your CIDs to the remote machine with IPFS node or IPFS cluster.
rpc_check.py - tool for extra check if your address cyberlinked some cyberlinks. You can use it to avoid invalid transactions with already links existed.

Whishlist

Move wallet.py and transaction.py to cyber-py library and refactor
Add Mongo or another db as local storage for cyberlinks
Include rpc_check.py as a parallel process

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiki-crawler

Very Important Section!!!

Requirements

Preparations

Launch

What's going on?!?!?!

Extra tools

Whishlist

All contributions are welcome

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
.gitignore		.gitignore
README.md		README.md
a-parser.py		a-parser.py
cids.py		cids.py
config.py		config.py
main.py		main.py
methods.py		methods.py
requirements.txt		requirements.txt
rpc_check.py		rpc_check.py
transaction.py		transaction.py
wallet.py		wallet.py

SaveTheAles/wiki-crawler

Folders and files

Latest commit

History

Repository files navigation

wiki-crawler

Very Important Section!!!

Requirements

Preparations

Launch

What's going on?!?!?!

Extra tools

Whishlist

All contributions are welcome

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages