web-Crawler-d-orbit

Qualification for D-Orbit. Developing a simple web crawler in python

Configuration file

Configuration of the web crawler can be modified in config.json with the following parameters:

{
    "max_concurrency": Max number of concurrent asyncio tasks,
    "session_limit": Max number of concurrent request in a session,
    "timeout_session": Timeout for a session,
    "max_retry": Number of retries for a single request,
    "async_wait": Sleep time for async requests, not overload server,
    "skip_robots": If True skip robots.txt file and bypass all rules
    "log_level": "Select the log level, DEBUG, INFO, WARNING, ERROR, CRITICAL",
}

How to tun

pip install -r requirements.txt

python .\webCrawler.py -s [URL_TO_START]

Example

python .\webCrawler.py -s https://docs.python.org/

Output will be in the ./logs and the site map will be saved in ./site_map.csv

Problem description

PDF To obtain a simple web crawler as requested I need:

a queue of urls to visit and data structure for visited urls and to save the site map
series of functions to GET a page PARSE HTML and EXTRACT/ANALYZE links
concurrency thread manager for async requests and parsing of HTML to obtain the links
series of validation for the links, domain, robots.txt, visited urls
error handling for HTTP and task scheduling

Output:

Print each visited URL and provide a list of links found in the page Site map is saved in csv file in path ./site_map.csv

Solution

Data structure

Async Queue of urls to_visit, it already implements a lock, so it is thread safe, get and put are O(1)
Set of visited_urls, appending the urls, it is O(1) to check if a url was visited
Dict for site_map, saving the url as a key and the list of links as value, only insertion is done, so it is O(1)

One improve could be the use of PriorityQueue for to_visit, to prioritize the link task order, maybe the with path less long to retrieve at one time large number of links, or other logic. For now, used a simple FIFO queue.

Visited urls can also be improved with a bloom filter, to reduce the time of check, something like that

Site map could be a redundant data structure as visited_urls already contain visited links, but it is useful to separate site map and link already visited. In future implementation, is possible remove visited_urls and use site_map to do the check. More site and link rule can be applied, like remove not reachable from site_map after GET request.

Concurrency

Web crawler is based on a lot of I/O operation, for http connection and parsing, so I choose to use of AsyncIO library Main task get next URL to visit from the queue (if queue not empty), main task will:

get the page, using async http client
parse the page HTMl to a Set of links
analyze the links Set and add to the queue if valid and not visited

All task are waited in asyncio gather, error are handled with try/except and logged.

Number of task will be controlled by a semaphore, and a sleep time between requests will be used to avoid overloading the server, and to wait that all task are finished.

The download of the page are managed by a persistent HTTP session that have a limit of concurrent request, all request will have a timeout and a number of retries.

Parsing the HTML using selectolax parser, an open source library, faster than beautifulsoup (x4) and lxml(x2).

As profiling result show, the bottleneck is the parsing of the HTML, and is more time-consuming than GET.

To analyze the links used a simple regex and urlparse to

validate and check the urls
filter by domain
check robots.txt rules for path
static files in path
visited urls are also checked

To deal with robots.txt rules, download and parse the file, and check if the url is allowed to be visited. Robot check can be skipped if in the configuration skip_robots is set to True.

Task time to complete was in the range of 0.0x seconds, for a total of 100-120 links per second.

Test coverage

Unit test can be found in /test folder Coverage result are in this folder

Test	Coverage	Test Pass
helper.py	91%	5/5
http_manager.py	44%	2/3
logger.py	83%	1/1
parser.py	85%	3/3
spider.py	44%	1/1

Tested all static functions, and the main functions of the web crawler, but I need to improve the test coverage of the http_manager, and the spider.

Had some trouble to test the http_manager, because it is async function, so I need to mock a server and test the async response. I will need to study more about it to improve the test coverage.

All process and code behavior where extensive tested with use of logging file and print, and with a lot of different sites. Result was as expected, covered all code and improved error hanfling.

EDIT:

I improved the test coverage of the http_manager, and the spider, now the coverage is: Not used mocking, that will me be better, but tested with a real site and the result was as expected.

Test	Coverage	Test Pass
helper.py	94%	5/5
http_manager.py	77%	7/7
logger.py	83%	1/1
parser.py	85%	3/3
spider.py	71%	3/3

Also fixed some minor bugs and improved a lot the code coverage. The new test coverage results can be found in this folder

Profiling of the code

Code could be better optimized, but it is already fast, the bottlenecks are: the parsing of the HTML that occupy 30% of the time and link check with urlparse with other 30%.

For further code improvement, will need to use a better parser, already tried 3 and make significant improve form beautifulSoup to xlml and finally with selectolax.

Another improvement could be: Do two subroutines of tasks, one for the HTTP request GET and another for the PARSING, will improve the performance, but make the code more complex. GET time is less than 5% time so is possible to group all the GET requests in a tasks do them concurrent, and then parse the HTMLs in another task, maybe with multiprocessing. With the awaited results for GET do the parsing in async tasks, and repeat for new lnk found: that will maybe improve the performance.

Also, urlparse is an expensive operation, and can be improved with dedicated library Is need to decompose the URL in parts, instead of use of regex, is more safe, but more study is needed to improve the performance.

Logging

Log in stdout and file, with different levels, DEBUG, INFO, WARNING, ERROR, CRITICAL. Logs can be found inf /logs folder, and the log level can be changed in the config.json file.

Some test results

Libraries used

Requirements can be found in the requirements.txt file

aiohttp~=3.8.4
lxml~=4.9.3
asynctest~=0.13.0
selectolax~=0.3.19

I Used All standard libraries, but some specific async libraries:

selectolax: html parser
aiohttp: async HTTP client
pylint: code quality

Pylint

Never used Pylint before, but as you suggested in the prev meeting, I used it to check the code quality, and I got the following results:

Last version: Your code has been rated at 9.47/10
20240203: Your code has been rated at 9.27/10
20240202: Your code has been rated at 8.86/10
20240201: Your code has been rated at 7.78/10 - First run

Pylint helped me to make the code more readable and clean, use best practice and improve the documentation, also useful to find some minor bugs, typo and improve the code quality.

New parser

selectolax

Tried beautifulsoup is 2/3 slower, lxml is 2 time slower,so used selectolax

Resources used

A loooot of Stackoverflow and online docs of the libraries
Old but good study
A Billion pages crawler
Theory
Parser
My old Web parser and telegram Bot

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
docs		docs
logs		logs
sitemap		sitemap
src		src
testing_result		testing_result
.gitignore		.gitignore
README.md		README.md
config.json		config.json
requirements.txt		requirements.txt
web_crawler.py		web_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-Crawler-d-orbit

Configuration file

How to tun

Problem description

Solution

Data structure

Concurrency

Test coverage

Profiling of the code

Logging

Some test results

Libraries used

Pylint

New parser

Resources used

About

Releases

Packages

Languages

kito129/web-Crawler-d-orbit

Folders and files

Latest commit

History

Repository files navigation

web-Crawler-d-orbit

Configuration file

How to tun

Problem description

Solution

Data structure

Concurrency

Test coverage

Profiling of the code

Logging

Some test results

Libraries used

Pylint

New parser

Resources used

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages