Create a command-line application that, given a list of website URLs as input, visits them and finds, extracts and outputs the websites’ logo image URLs and all phone numbers (e.g. mobile phones, land lines, fax numbers) present on the websites.
- Download the repository zip file and extract the folder named "dbassign-master" to your desktop.
- Open the Command Prompt\PowerShell\Bash and navigate to the "dbassign-master" folder.
- Build the docker image using the following command
docker build -t dnb .
- Run the docker image using the following command
docker run -i dnb < [PATH_TO_TXT_FILE]
- Run the docker image using the following command
cat [PATH_TO_TXT_FILE] | docker run -i dnb
The application support two arguments:
-
-v, --verbose
- Print logs to std.err during application run. For exampledocker run -i [IMAGE_NAME] -v < [PATH_TO_TXT_FILE]
-
-t, --threads
- Specifying how many threads to use. Default is 16. For exampledocker run -i [IMAGE_NAME] -t [NUM_OF_THREADS] < [PATH_TO_TXT_FILE]
After research during the assignment, I found two main methods to extract the website's logo.
- Search the
<link>
tag that stores the favicon inside the<head>
tag and extracts the website's logo. - Search the first
<img>
or<svg>
tags that appear inside the<body>
tag while assuming that, in absolute most cases, the first image is the website's logo.
After quite long consideration, I decided to use method number two while comparing my results to the example result you attached to the assignment description.