GrabLinks

Synopsis

grablinks.py is a simple and streamlined Python 3 script to extract and filter links from a remote HTML resource.

Requirements

An installation of Python 3 (any version above 3.5 should do fine). Additionally the 3rd-party Python modules requests and beautifulsoup4 are required. Both modules can be easily installed with Python's package manager pip, e.g.:

pip --install requests --user
pip --install beautifulsoup4 --user

Usage

usage: grablinks.py [-h] [-V] [--insecure] [-t TAG] [-a ATTRIBUTE]
                    [-f FORMATSTR] [--fix-links] [-c CLASS] [-s SEARCH]
                    [-x REGEX]
                    URL

Extracts, and optionally filters, links (`<a href=""/>') from a remotely or
locally stored HTML document.

positional arguments:
  URL                   a fully qualified URL to the source HTML document

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show version number and exit
  --insecure            disable verification of SSL/TLS certificates (e.g. to
                        allow self-signed certificates)
  -t TAG, --tag TAG     extract from given tag (default: `a'), also see
                        `--attribute'
  -a ATTRIBUTE, --attribute ATTRIBUTE
                        extract from given attribute (default: `href'), also
                        see `--tag'
  -f FORMATSTR, --format FORMATSTR
                        a format string to wrap in the output: %url% is
                        replaced by found URL entries; %text% is replaced with
                        the text content of the link; other supported
                        placeholders for generated values: %id%, %guid%, and
                        %hash%
  --fix-links           try to convert relative and fragmental URLs to
                        absolute URLs (after filtering)

additional filters:
  -c CLASS, --class CLASS
                        only extract URLs from href attributes of <a>nchor
                        elements with the specified class attribute content.
                        Multiple classes, separated by space, are evaluated
                        with an logical OR, so any <a>nchor that has at least
                        one of the classes will match.
  -s SEARCH, --search SEARCH
                        only output entries from the extracted result set, if
                        the search string occurs in the URL
  -x REGEX, --regex REGEX
                        only output entries from the extracted result set, if
                        the URL matches the regular expression

Report bugs, request features, or provide suggestions via
https://github.com/the-real-tokai/grablinks/issues

Usage Examples

# extract wikipedia links from 'www.example.com':
$ grablinks.py 'https://www.example.com/' --search 'wikipedia'
https://ja.wikipedia.org/wiki/仲間由紀恵
https://ja.wikipedia.org/wiki/黒木華
https://ja.wikipedia.org/wiki/清野菜名
…

# extract download links from 'www.example.com', create a shell script
# on-the-fly and pass it along to sh to fetch things with wget:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' --format 'wget "%url%"' | sh
# Note: Do not do that at home. It is dangerous! 😱

# alternatively just pass to wget directly:
$ grablinks.py 'https://www.example.com/' --search 'download.example.org' | sort -u | wget -i-

# extract/ handle links like
# <a href="https://example.com/a-cryptic-ID">proper-filename.ext</a>
$ grablinks.py 'https://www.example.com/' --format 'wget '\''%url%'\'' -O '\''%text%'\' > fetchfiles.sh
$ sh fetchfiles.sh
# Note: %text% is not sanitized by grablinks.py for safe shell usage. It is
#       recommended to verify this before executing things automatically

# extract images
$ grablinks.py 'https://www.example.com/' --tag 'img' --attribute 'src'

History

1.10	12-Dec-2025	Added '--tag' and '--attribute' for more versatility Deprecated '--images' switch (internally wraps to '--tag'/ '--attributes' now) Improved 'file://' handling Other small improvements
1.9	28-Dec-2024	Identify with proper user agents for remote requests --fix-links: Update input/ response URL in case of redirections --fix-links: Improved handling of some path edge-cases Avoid unnecessary (re-)encoding (assume all loaded data as bytes) Added basic support for 'file://' URIs
1.8	21-Nov-2024	Added support for "<img src="">" via '--images'.
1.7	21-Jan-2024	Disable urllib3 warnings when '--insecure' is used.
1.6	2-Dec-2023	Added '--insecure' argument to disable SSL/TLS certificate verification Added support for '%text%' placeholder in format string (<a>text</a>)
1.5	24-Nov-2022	Added a (fixed) timeout to the remote request.
1.4	30-May-2022	Improved handling of passing multiple classes to '--class'.
1.3	6-Feb-2021	Fix: handling of common edge cases when '--fix-links' is used.
1.2	16-Aug-2020	Fix: in some cases links from "<a>" tags without a 'class' attribute were not part of the result.
1.1	7-Jun-2020	Initial public source code release

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
grablinks.py		grablinks.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GrabLinks

Synopsis

Requirements

Usage

Usage Examples

History

About

Releases

Packages

Languages

License

the-real-tokai/grablinks

Folders and files

Latest commit

History

Repository files navigation

GrabLinks

Synopsis

Requirements

Usage

Usage Examples

History

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages