GitHub - rohanchauhan/async-url-regex-checker: Asynchronously checks a set of urls using regular expressions for email, image etc

Introduction

This program given a list of configured urls and one or more regular expressions to be checked in the url performs it asynchronously.

Files:

dataset_generator.py : Used to make the dataset to test the final_program.
test_final.py: Contains simple unit test for the final_program.
final_program.py: Contains code using asyncio, aiohttp, aiofiles for performing the task given in introduction.
'ae.csv': Contains a list of url that is used by dataset generator to make the dataset.

Installation Instructions:

Clone the repository or download the zip file and go inside the directory.
Make a python3 virtual environment.

python3 -m venv .cgi-env

Activate the virtual environment.

source .cgi-env/bin/activate

Install the requirements.

pip install -r requirements.txt

Run dataset_generator.py to generate the dataset.

python dataset_generator.py

Run test_final.py to perform unit tests.

python test_final.py

Run final program to start performing the checks!!

python final_program.py

Design Decisions:

Initially, it seemed that the question asked for a crawler but after reading that the urls needs to be checked periodically, I decided against it. As the url frontier would increase drastically and thought to stick to what's clearly been asked.
URL normalization is not performed because aiohttp performs it automatically.
I tried using the semaphore to control the number of requests but it was resulting in too many open files exception. So, in order to control the throughput, I went with a Queue. The delay makes sures that not all the urls are in queue at once and the number of requests controls the number of GET requests. To increase throughput, reduce delay and add more requests and to decrease throughput vice versa.
Using aiofiles, aiohttp and asyncio to implement asynchronous execution.
To test cleanup using the signal, please just comment the latter part.

Things that could have been improved:

Especially the unit tests!
Implement priority queue and using freshness to decide which pages to fetch. Also needs to decide freshness criterion. But, this goal is again not mentioned.
Can add random user agents to requests to not get blocked by the server.
If the url list contains domains from same sites, then politeness needs to be implemented. If t is the response time, the next page from the domain can be fetched after 5t. In this program, we can implement somewhat of politness by increasing the delay of producer.
Working on graceful cleanup. Maybe need to look into aiohttp more in depth.
Can add logging.

References

1.https://realpython.com/async-io-python/

2.https://www.roguelynn.com/words/asyncio-graceful-shutdowns/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Files:

Installation Instructions:

Design Decisions:

Things that could have been improved:

References

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
ae.csv		ae.csv
dataset_generator.py		dataset_generator.py
final_program.py		final_program.py
requirements.txt		requirements.txt
test_final.py		test_final.py

rohanchauhan/async-url-regex-checker

Folders and files

Latest commit

History

Repository files navigation

Introduction

Files:

Installation Instructions:

Design Decisions:

Things that could have been improved:

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages