This program given a list of configured urls and one or more regular expressions to be checked in the url performs it asynchronously.
- dataset_generator.py : Used to make the dataset to test the final_program.
- test_final.py: Contains simple unit test for the final_program.
- final_program.py: Contains code using asyncio, aiohttp, aiofiles for performing the task given in introduction.
- 'ae.csv': Contains a list of url that is used by dataset generator to make the dataset.
- Clone the repository or download the zip file and go inside the directory.
- Make a python3 virtual environment.
python3 -m venv .cgi-env
- Activate the virtual environment.
source .cgi-env/bin/activate
- Install the requirements.
pip install -r requirements.txt
- Run dataset_generator.py to generate the dataset.
python dataset_generator.py
- Run test_final.py to perform unit tests.
python test_final.py
- Run final program to start performing the checks!!
python final_program.py
- Initially, it seemed that the question asked for a crawler but after reading that the urls needs to be checked periodically, I decided against it. As the url frontier would increase drastically and thought to stick to what's clearly been asked.
- URL normalization is not performed because aiohttp performs it automatically.
- I tried using the semaphore to control the number of requests but it was resulting in too many open files exception. So, in order to control the throughput, I went with a Queue. The delay makes sures that not all the urls are in queue at once and the number of requests controls the number of GET requests. To increase throughput, reduce delay and add more requests and to decrease throughput vice versa.
- Using aiofiles, aiohttp and asyncio to implement asynchronous execution.
- To test cleanup using the signal, please just comment the latter part.
- Especially the unit tests!
- Implement priority queue and using freshness to decide which pages to fetch. Also needs to decide freshness criterion. But, this goal is again not mentioned.
- Can add random user agents to requests to not get blocked by the server.
- If the url list contains domains from same sites, then politeness needs to be implemented. If t is the response time, the next page from the domain can be fetched after 5t. In this program, we can implement somewhat of politness by increasing the delay of producer.
- Working on graceful cleanup. Maybe need to look into aiohttp more in depth.
- Can add logging.
1.https://realpython.com/async-io-python/
2.https://www.roguelynn.com/words/asyncio-graceful-shutdowns/