Qualification for D-Orbit. Developing a simple web crawler in python
Configuration of the web crawler can be modified in config.json
with the following parameters:
{
"max_concurrency": Max number of concurrent asyncio tasks,
"session_limit": Max number of concurrent request in a session,
"timeout_session": Timeout for a session,
"max_retry": Number of retries for a single request,
"async_wait": Sleep time for async requests, not overload server,
"skip_robots": If True skip robots.txt file and bypass all rules
"log_level": "Select the log level, DEBUG, INFO, WARNING, ERROR, CRITICAL",
}
pip install -r requirements.txt
python .\webCrawler.py -s [URL_TO_START]
Example
python .\webCrawler.py -s https://docs.python.org/
Output will be in the ./logs
and the site map will be saved in ./site_map.csv
PDF To obtain a simple web crawler as requested I need:
- a queue of urls to visit and data structure for visited urls and to save the site map
- series of functions to GET a page PARSE HTML and EXTRACT/ANALYZE links
- concurrency thread manager for async requests and parsing of HTML to obtain the links
- series of validation for the links, domain, robots.txt, visited urls
- error handling for HTTP and task scheduling
Output:
- Print each visited URL and provide a list of links found in the page
Site map is saved in csv file in path
./site_map.csv
- Async Queue of urls to_visit, it already implements a lock, so it is thread safe, get and put are O(1)
- Set of visited_urls, appending the urls, it is O(1) to check if a url was visited
- Dict for site_map, saving the url as a key and the list of links as value, only insertion is done, so it is O(1)
One improve could be the use of PriorityQueue for to_visit, to prioritize the link task order, maybe the with path less long to retrieve at one time large number of links, or other logic. For now, used a simple FIFO queue.
Visited urls can also be improved with a bloom filter, to reduce the time of check, something like that
Site map could be a redundant data structure as visited_urls already contain visited links, but it is useful to separate site map and link already visited. In future implementation, is possible remove visited_urls and use site_map to do the check. More site and link rule can be applied, like remove not reachable from site_map after GET request.
Web crawler is based on a lot of I/O operation, for http connection and parsing, so I choose to use of AsyncIO library Main task get next URL to visit from the queue (if queue not empty), main task will:
- get the page, using async http client
- parse the page HTMl to a Set of links
- analyze the links Set and add to the queue if valid and not visited
All task are waited in asyncio gather, error are handled with try/except and logged.
Number of task will be controlled by a semaphore, and a sleep time between requests will be used to avoid overloading the server, and to wait that all task are finished.
The download of the page are managed by a persistent HTTP session that have a limit of concurrent request, all request will have a timeout and a number of retries.
Parsing the HTML using selectolax parser, an open source library, faster than beautifulsoup (x4) and lxml(x2).
As profiling result show, the bottleneck is the parsing of the HTML, and is more time-consuming than GET.
To analyze the links used a simple regex and urlparse to
- validate and check the urls
- filter by domain
- check robots.txt rules for path
- static files in path
- visited urls are also checked
To deal with robots.txt rules, download and parse the file, and check if the url is allowed to be visited.
Robot check can be skipped if in the configuration skip_robots
is set to True.
Task time to complete was in the range of 0.0x seconds, for a total of 100-120 links per second.
Unit test can be found in /test
folder
Coverage result are in this folder
Test | Coverage | Test Pass |
---|---|---|
helper.py | 91% | 5/5 |
http_manager.py | 44% | 2/3 |
logger.py | 83% | 1/1 |
parser.py | 85% | 3/3 |
spider.py | 44% | 1/1 |
Tested all static functions, and the main functions of the web crawler, but I need to improve the test coverage of the http_manager, and the spider.
Had some trouble to test the http_manager, because it is async function, so I need to mock a server and test the async response. I will need to study more about it to improve the test coverage.
All process and code behavior where extensive tested with use of logging file and print, and with a lot of different sites. Result was as expected, covered all code and improved error hanfling.
EDIT:
I improved the test coverage of the http_manager, and the spider, now the coverage is: Not used mocking, that will me be better, but tested with a real site and the result was as expected.
Test | Coverage | Test Pass |
---|---|---|
helper.py | 94% | 5/5 |
http_manager.py | 77% | 7/7 |
logger.py | 83% | 1/1 |
parser.py | 85% | 3/3 |
spider.py | 71% | 3/3 |
Also fixed some minor bugs and improved a lot the code coverage. The new test coverage results can be found in this folder
Code could be better optimized, but it is already fast, the bottlenecks are: the parsing of the HTML that occupy 30% of the time and link check with urlparse with other 30%.
For further code improvement, will need to use a better parser, already tried 3 and make significant improve form beautifulSoup to xlml and finally with selectolax.
Another improvement could be: Do two subroutines of tasks, one for the HTTP request GET and another for the PARSING, will improve the performance, but make the code more complex. GET time is less than 5% time so is possible to group all the GET requests in a tasks do them concurrent, and then parse the HTMLs in another task, maybe with multiprocessing. With the awaited results for GET do the parsing in async tasks, and repeat for new lnk found: that will maybe improve the performance.
Also, urlparse is an expensive operation, and can be improved with dedicated library Is need to decompose the URL in parts, instead of use of regex, is more safe, but more study is needed to improve the performance.
Log in stdout and file, with different levels, DEBUG, INFO, WARNING, ERROR, CRITICAL.
Logs can be found inf /logs
folder, and the log level can be changed in the config.json
file.
Requirements can be found in the requirements.txt
file
aiohttp~=3.8.4
lxml~=4.9.3
asynctest~=0.13.0
selectolax~=0.3.19
I Used All standard libraries, but some specific async libraries:
-
selectolax: html parser
-
aiohttp: async HTTP client
-
pylint: code quality
Never used Pylint before, but as you suggested in the prev meeting, I used it to check the code quality, and I got the following results:
- Last version: Your code has been rated at 9.47/10
- 20240203: Your code has been rated at 9.27/10
- 20240202: Your code has been rated at 8.86/10
- 20240201: Your code has been rated at 7.78/10 - First run
Pylint helped me to make the code more readable and clean, use best practice and improve the documentation, also useful to find some minor bugs, typo and improve the code quality.
Tried beautifulsoup is 2/3 slower, lxml is 2 time slower,so used selectolax
- A loooot of Stackoverflow and online docs of the libraries
- Old but good study
- A Billion pages crawler
- Theory
- Parser
- My old Web parser and telegram Bot