Declutters URLs in a lightning fast and flexible way, for improving input for web hacking automations such as crawlers and vulnerability scans.
godeclutter is a very simple tool that will take a list of URLs, clean those URLs, and output the unique URLs that are present. This will reduce the number of requests you will have to make to your target website, and also filter URLs that are most likely uninteresting.
godeclutter will perform the following steps on your URL host section:
- Clean http:// URLs pointing to the default SSL port (443) and vice-versa, since they are mostly CDN error pages.
- Clean port notation of URLs pointing to the default protocol ports, since those ports are already implied by the protocol scheme. (such as :443 and :80)
- Clean http:// URLs if a https:// to the same host and port is present, since 99,9% of those cases will just be a redirect to https://.
- Remove uninteresting media extensions such as png, jpg, css, etc. (This one will keep .js files since those are sometimes interesting)
- Remove urls with uninteresting words such as bootstrap, jquery, node_modules, etc.
- Sort query parameters
- Lowercase all schemes and hostnames, since upper-casing is irrelevant for those.
- Replace all lower-case URI encoding escapes to upper-case, to maintain a standard.
- Decode unnecessary escapes for characters that are not special in the URL context (i.e http://example.com/%41 ).
- Remove empty query strings (i.e http://example.com/? )
- Remove trailing slashes, this is rather aggressive but filters a lot of un-interesting duplicates on the majority of the cases. (i.e http://host/path/ -> http://host/path)
- Normalize dot segments, also rather aggressive but useful when working with dirty sources. (http://host/path/./a/b/../c -> http://host/path/a/c)
go install github.com/c3l3si4n/godeclutter@HEAD
You can send URLs by sending them to stdin.
user@arch ~/D/g/godeclutter (main)> cat test_urls.txt
https://1.1.1.1:443/
http://1.1.1.1/
http://1.1.1.1:80/
https://1.1.1.1:443/
https://1.1.1.1:80/
https://1.1.1.1:8443/
https://1.1.1.1:8443/
https://1.1.1.1:443/
https://1.1.1.1:8443/
https://1.1.1.1:8443/
https://1.1.1.1:443/?
http://1.1.1.1:443/?
https://1.1.1.1:80/?a
https://1.1.1.1:443/?
https://1.1.1.1:443/?
https://1.1.1.1:443/?1=1
https://1.1.1.1:443/?a=a&b=1
https://1.1.1.1:443/?a=a&b=1
https://1.1.1.1:443/a.js?a=a&b=1
https://1.1.1.1:443/fiqef.html?a=a&b=1
https://1.1.1.1:443/fmef.jpg?b=1&a=a
https://1.1.1.1:443/?b=1&a=a
https://1.1.1.1:443/a.js?
https://1.1.1.1:443/a.jpg
https://1.1.1.1:8443/
https://1.1.1.1:443/
https://1.1.1.1:443/node_modules/
https://1.1.1.1:443/path/scripts/jquery.js
user@arch ~/D/g/godeclutter (main)> cat test_urls.txt | ./godeclutter -bw -be -c -p
https://1.1.1.1/
https://1.1.1.1:8443/
https://1.1.1.1/?1=1
https://1.1.1.1/?a=a&b=1
https://1.1.1.1/a.js?a=a&b=1
https://1.1.1.1/fiqef.html?a=a&b=1
https://1.1.1.1/a.js
$> ./godeclutter -h
Usage of ./godeclutter:
-be
Blacklist Extensions - clean some uninteresting extensions. (default true)
-bec string
Blacklist Extensions - Specify additional extensions separated by commas to be cleared along the default ones.
-bw
Blacklist Words - clean some uninteresting words. (default true)
-bwc string
Blacklist Words - Specify additional words separated by commas to be cleared along the default ones.
-bwl string
Blacklist Words - Defines the level of word blocking. Values can be: minimal,aggressive (default "minimal")
-c Clean URLs - Aggressively clean/normalize URLs before outputting them. (default true)
-p Prefer HTTPS - If there's a https url present, don't print the http for it. (since it will probably just redirect to https)
- @s0md3v for making uro
- @PuerkitoBio for making the amazing purell go library
- @ameenmaali for making urldedupe which was one of the first tools for doing that kind of work.