Tool to scrap emails from a list of websites. Takes the list of websites as a parameter, in format TSV (tab separated values) e.g.:
WEBUID WEBSITE URL
abc123 Website 1 http://websitename1.url
azk988 Website 2 http://websitename2.url
gju386 Website N http://websitenameN.url
Usage example: ./extract_mail_from_url.sh list_of_websites_and_urls_in_tbs_format.tbs
.
The tool downloads the website using wget
and then searches for different email addresses formats with regular expressions:
- text@text.domain
[a-z0-9.-]@[a-z0-9.-].[a-z]
- text (at) text.domain
[a-z0-9.-] (at) [a-z0-9.-].[a-z]
- text(at)text.domain
[a-z0-9.-](at)[a-z0-9.-].[a-z]
- text[at]text[dot]domain
[a-z0-9.-][at][a-z0-9.-][dot][a-z]
- text[ät]text.domain
[a-z0-9.-][ät][a-z0-9.-].[a-z]
- text [at] text.domain
[a-z0-9.-][at][a-z0-9.-].[a-z]
- text [at] text [punkt] domain
[a-z0-9.-][at][a-z0-9.-][punkt][a-z]
- text(at)text(dot)domain
[a-z0-9.-](at)[a-z0-9.-](dot)[a-z]
- text at text.domain
[a-z0-9.-] at [a-z0-9.-].[a-z]
- text [at] text [dot] domain
[a-z0-9.-] [at] [a-z0-9.-] [dot] [a-z]
Configurable variables in the tool:
- TIMEOUT: default
180
seconds, can be passed as EVN VAR; doesn't work in MacOS sincetimeout
is not a standard command indarwin
. If you want this option to work in MacOS, read https://gist.github.com/dasgoll/7b1a796d6e42cb66508bc504bb518f82 - RETRIES: default
3
times; number of times the website will be tried to get downloaded. - FILTER_LIST_FILE: default
filter_list
; name of the filter list of optional words to exclude from the emails addresses scrapped by the tool. - TMP_FILE: default
"website_"${WEBSITE_LIST_FILE}
; temporary file where the website is downloaded and then deleted after being processed for email scrapping. - OUTPUT_FILE: default
${WEBSITE_LIST_FILE}"_WITH_MAILS.tsv"
; filename where the extration tool will output the results of the scrapping.