-
Notifications
You must be signed in to change notification settings - Fork 80
leakStandardizer
LeakStandardizer is a tool designed to help getting a clean text file out of a leak full of junk. Most of the time, leaks are found with a lot of lines containing either no password or email, lines can contain invalid email (non standard or modified emails), lines with non ascii/non printable characters etc.
LeakStandardizer will help you in extracting 99% of the credentials inside your initial leak. Nothing is ever deleted, you will be able to see what lines are not considered as valid, for further processing.
Usage : python3 leakStandardizer.py <leakFile> <cleanOutput> <errorOutput>
.
The previous command line will extract the matching credentials lines into the cleanOutput file. Lines not matching the expected format will be put inside the errorOutput file.
When launching leakStandardizer, the program will expect you to provide a regular expression. Lines matching this regular expression will be considered as valid. The regular expression must specify where is the email and where is the hash or the plain password for each line it parses. This specification is done through the following named capturing groups : 'email', 'hash' and 'plain'.
Eg : (?P<email>.+?):(?P<hash>.+?):(?P<plain>.*)
The previous regular expression covers lines containing an email, a hash and a plain password separated by a colon. As this syntax is quite tedious, some shortcuts have been implemented :
-
$fast_email$
: matches an email 99% posix compliant and quite fast :(?P<email>[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)
-
$posix_email$
: matches an email, official posix regex but slower :(?P<email>(?:[a-zA-Z0-9!#$%&\'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&\'\.*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")\.?@(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-zA-Z0-9-]*[a-zA-Z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]))
-
$sha1$
: matches a SHA1 hash :(?P<hash>[0-9a-fA-F]{40})
-
$md5$
: matches an MD5 hash :(?P<hash>[0-9a-fA-F]{32})
Lines matching your regular expression will go in your cleanOutput file, under the following format : email:hash:plain
, one credential per line.
Lines that do not match your regular expression, or lines containing non utf-8 characters will go in your errorOutput file for later use.
The only restriction that you must take into account is that leakStandardizer is designed to process files made of several lines, each file representing a credential. If your leak is not respecting this statement (mysql dump for example, where credentials are inside a one-line INSERT statement), you will have to manually edit it (sed/awk) to get back to an easily parsable file : one credential per line.
A common process using this tool is to concatenate your initial leak files together into a single file, use leakScraper along with a first pattern (regular expression), then, fire leakscraper against your error file using a second pattern and so on. Each iteration will reduce the number of error lines produced. The final goal is to have exclusively lines containing either no password or invalid/no email inside your error file.