Skip to content

A simple and streamlined Python script to extract and filter links from a remote HTML resource.

License

Notifications You must be signed in to change notification settings

the-real-tokai/grablinks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GrabLinks

GitHub GitHub Code Size in Bytes Mastodon Follow Twitter Follow

Synopsis

grablinks.py is a simple and streamlined Python 3 script to extract and filter links from a remote HTML resource.

Requirements

An installation of Python 3 (any version above 3.5 should do fine). Additionally the 3rd-party Python modules requests and beautifulsoup4 are required. Both modules can be easily installed with Python's package manager pip, e.g.:

pip --install requests --user
pip --install beautifulsoup4 --user

Usage

usage: grablinks.py [-h] [-V] [--insecure] [-f FORMATSTR] [--fix-links]
                    [--images] [-c CLASS] [-s SEARCH] [-x REGEX]
                    URL

Extracts, and optionally filters, all links (`<a href=""/>') from a remote
HTML document.

positional arguments:
  URL                   a fully qualified URL to the source HTML document

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version number and exit
  --insecure            disable verification of SSL/TLS certificates (e.g. to
                        allow self-signed certificates)
  -f FORMATSTR, --format FORMATSTR
                        a format string to wrap in the output: %url% is
                        replaced by found URL entries; %text% is replaced with
                        the text content of the link; other supported
                        placeholders for generated values: %id%, %guid%, and
                        %hash%
  --fix-links           try to convert relative and fragmental URLs to
                        absolute URLs (after filtering)
  --images              extract `<img src=""/>' instead `<a href=""/>'.

filter options:
  -c CLASS, --class CLASS
                        only extract URLs from href attributes of <a>nchor
                        elements with the specified class attribute content.
                        Multiple classes, separated by space, are evaluated
                        with an logical OR, so any <a>nchor that has at least
                        one of the classes will match.
  -s SEARCH, --search SEARCH
                        only output entries from the extracted result set, if
                        the search string occurs in the URL
  -x REGEX, --regex REGEX
                        only output entries from the extracted result set, if
                        the URL matches the regular expression

Report bugs, request features, or provide suggestions via
https://github.com/the-real-tokai/grablinks/issues

Usage Examples

# extract wikipedia links from 'www.example.com':
$ grablinks.py 'https://www.example.com/' --search 'wikipedia'
https://ja.wikipedia.org/wiki/仲間由紀恵
https://ja.wikipedia.org/wiki/黒木華
https://ja.wikipedia.org/wiki/清野菜名
…
# extract download links from 'www.example.com', create a shell script
# on-the-fly and pass it along to sh to fetch things with wget:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' --format 'wget "%url%"' | sh
# Note: Do not do that at home. It is dangerous! 😱
# alternatively just pass to wget directly:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' | sort -u | wget -i-
# extract/ handle links like
# <a href="https://example.com/a-cryptic-ID">proper-filename.ext</a>
$ grablinks.py 'https://www.example.com/' --format 'wget '\''%url%'\'' -O '\''%text%'\' > fetchfiles.sh
$ sh fetchfiles.sh
# Note: %text% is not sanitized by grablinks.py for safe shell usage. It is
#       recommended to verify this before executing things automatically

History

1.9 28-Dec-2024 Identify with proper user agents for remote requests
--fix-links: Update input/ response URL in case of redirections
--fix-links: Improved handling of some path edge-cases
Avoid unnecessary (re-)encoding (assume all loaded data as bytes)
Added basic support for 'file://' URIs
1.8 21-Nov-2024 Added support for "<img src="">" via '--images'.
1.7 21-Jan-2024 Disable urllib3 warnings when '--insecure' is used.
1.6 2-Dec-2023 Added '--insecure' argument to disable SSL/TLS certificate verification
Added support for '%text%' placeholder in format string (<a>text</a>)
1.5 24-Nov-2022 Added a (fixed) timeout to the remote request.
1.4 30-May-2022 Improved handling of passing multiple classes to '--class'.
1.3 6-Feb-2021 Fix: handling of common edge cases when '--fix-links' is used.
1.2 16-Aug-2020 Fix: in some cases links from "<a>" tags without a 'class' attribute were not part of the result.
1.1 7-Jun-2020 Initial public source code release

About

A simple and streamlined Python script to extract and filter links from a remote HTML resource.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages