Skip to content

Qualification for D-Orbit. Developing a simple web crawler in python

Notifications You must be signed in to change notification settings

kito129/web-Crawler-d-orbit

Repository files navigation

web-Crawler-d-orbit

Qualification for D-Orbit. Developing a simple web crawler in python

Configuration file

Configuration of the web crawler can be modified in config.json with the following parameters:

{
    "max_concurrency": Max number of concurrent asyncio tasks,
    "session_limit": Max number of concurrent request in a session,
    "timeout_session": Timeout for a session,
    "max_retry": Number of retries for a single request,
    "async_wait": Sleep time for async requests, not overload server,
    "skip_robots": If True skip robots.txt file and bypass all rules
    "log_level": "Select the log level, DEBUG, INFO, WARNING, ERROR, CRITICAL",
}

How to tun

pip install -r requirements.txt

python .\webCrawler.py -s [URL_TO_START]

Example

python .\webCrawler.py -s https://docs.python.org/ 

Output will be in the ./logs and the site map will be saved in ./site_map.csv

Problem description

PDF To obtain a simple web crawler as requested I need:

  • a queue of urls to visit and data structure for visited urls and to save the site map
  • series of functions to GET a page PARSE HTML and EXTRACT/ANALYZE links
  • concurrency thread manager for async requests and parsing of HTML to obtain the links
  • series of validation for the links, domain, robots.txt, visited urls
  • error handling for HTTP and task scheduling

Output:

  • Print each visited URL and provide a list of links found in the page Site map is saved in csv file in path ./site_map.csv

Solution

Crawler Schema

Data structure

  • Async Queue of urls to_visit, it already implements a lock, so it is thread safe, get and put are O(1)
  • Set of visited_urls, appending the urls, it is O(1) to check if a url was visited
  • Dict for site_map, saving the url as a key and the list of links as value, only insertion is done, so it is O(1)

One improve could be the use of PriorityQueue for to_visit, to prioritize the link task order, maybe the with path less long to retrieve at one time large number of links, or other logic. For now, used a simple FIFO queue.

Visited urls can also be improved with a bloom filter, to reduce the time of check, something like that

Site map could be a redundant data structure as visited_urls already contain visited links, but it is useful to separate site map and link already visited. In future implementation, is possible remove visited_urls and use site_map to do the check. More site and link rule can be applied, like remove not reachable from site_map after GET request.

Concurrency

Web crawler is based on a lot of I/O operation, for http connection and parsing, so I choose to use of AsyncIO library Main task get next URL to visit from the queue (if queue not empty), main task will:

  • get the page, using async http client
  • parse the page HTMl to a Set of links
  • analyze the links Set and add to the queue if valid and not visited

All task are waited in asyncio gather, error are handled with try/except and logged.

Number of task will be controlled by a semaphore, and a sleep time between requests will be used to avoid overloading the server, and to wait that all task are finished.

The download of the page are managed by a persistent HTTP session that have a limit of concurrent request, all request will have a timeout and a number of retries.

Parsing the HTML using selectolax parser, an open source library, faster than beautifulsoup (x4) and lxml(x2).

As profiling result show, the bottleneck is the parsing of the HTML, and is more time-consuming than GET.

To analyze the links used a simple regex and urlparse to

  • validate and check the urls
  • filter by domain
  • check robots.txt rules for path
  • static files in path
  • visited urls are also checked

To deal with robots.txt rules, download and parse the file, and check if the url is allowed to be visited. Robot check can be skipped if in the configuration skip_robots is set to True.

Task time to complete was in the range of 0.0x seconds, for a total of 100-120 links per second.

Test coverage

Unit test can be found in /test folder Coverage result are in this folder

Test Coverage Test Pass
helper.py 91% 5/5
http_manager.py 44% 2/3
logger.py 83% 1/1
parser.py 85% 3/3
spider.py 44% 1/1

Tested all static functions, and the main functions of the web crawler, but I need to improve the test coverage of the http_manager, and the spider.

Had some trouble to test the http_manager, because it is async function, so I need to mock a server and test the async response. I will need to study more about it to improve the test coverage.

All process and code behavior where extensive tested with use of logging file and print, and with a lot of different sites. Result was as expected, covered all code and improved error hanfling. Total Coverage

EDIT:

I improved the test coverage of the http_manager, and the spider, now the coverage is: Not used mocking, that will me be better, but tested with a real site and the result was as expected.

Test Coverage Test Pass
helper.py 94% 5/5
http_manager.py 77% 7/7
logger.py 83% 1/1
parser.py 85% 3/3
spider.py 71% 3/3

Also fixed some minor bugs and improved a lot the code coverage. The new test coverage results can be found in this folder

Profiling of the code

Profiling

Code could be better optimized, but it is already fast, the bottlenecks are: the parsing of the HTML that occupy 30% of the time and link check with urlparse with other 30%.

For further code improvement, will need to use a better parser, already tried 3 and make significant improve form beautifulSoup to xlml and finally with selectolax.

Another improvement could be: Do two subroutines of tasks, one for the HTTP request GET and another for the PARSING, will improve the performance, but make the code more complex. GET time is less than 5% time so is possible to group all the GET requests in a tasks do them concurrent, and then parse the HTMLs in another task, maybe with multiprocessing. With the awaited results for GET do the parsing in async tasks, and repeat for new lnk found: that will maybe improve the performance.

Also, urlparse is an expensive operation, and can be improved with dedicated library Is need to decompose the URL in parts, instead of use of regex, is more safe, but more study is needed to improve the performance.

Profile time

Logging

Log in stdout and file, with different levels, DEBUG, INFO, WARNING, ERROR, CRITICAL. Logs can be found inf /logs folder, and the log level can be changed in the config.json file.

Some test results

Libraries used

Requirements can be found in the requirements.txt file

aiohttp~=3.8.4
lxml~=4.9.3
asynctest~=0.13.0
selectolax~=0.3.19

I Used All standard libraries, but some specific async libraries:

Pylint

Never used Pylint before, but as you suggested in the prev meeting, I used it to check the code quality, and I got the following results:

  • Last version: Your code has been rated at 9.47/10
  • 20240203: Your code has been rated at 9.27/10
  • 20240202: Your code has been rated at 8.86/10
  • 20240201: Your code has been rated at 7.78/10 - First run

Pylint helped me to make the code more readable and clean, use best practice and improve the documentation, also useful to find some minor bugs, typo and improve the code quality.

New parser

selectolax

Tried beautifulsoup is 2/3 slower, lxml is 2 time slower,so used selectolax

Resources used

About

Qualification for D-Orbit. Developing a simple web crawler in python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published