Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated entries found on WPA-Length wordlists #28

Closed
sirblackjack opened this issue Aug 20, 2017 · 8 comments
Closed

Duplicated entries found on WPA-Length wordlists #28

sirblackjack opened this issue Aug 20, 2017 · 8 comments
Assignees

Comments

@sirblackjack
Copy link

There are duplicated entries for some words in the Top 31 Million, Top 102 Million and Top 1.8 Billion files. As an example, the word 'password' can be found on both line 1 and line 11,853,466 of the files.

I am not good with Unix commands, but the files can be easily fixed using SQL / MySQL. I already fixed them with the code I am sharing below, including also removing words with length of 7 characters. For the 31 Million file, 302,363 entries were removed after cleaning. This code is an example for the 31 Million wordlist, but the same code can be used for the other wordlist just changing the name of the txt file:

/* Creates a Database named 'WPA' */
CREATE DATABASE WPA;
USE WPA;

/* Creates a table named 'Top31MillionWPA'
with two columns: a unique auto_incremental 'id' to keep the
popularity order and 'word' containing the text.
Uses utf8_bin to compare strings case-sensitively */

CREATE TABLE Top31MillionWPA(
id BIGINT NOT NULL AUTO_INCREMENT, Word varchar(255)
, PRIMARY KEY (id), INDEX IX_word (word)
) AUTO_INCREMENT=1 COLLATE utf8_bin;

/*Temporary settings for speed up load of text file*/
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;

/*Loads the text file into the table, into the 'word' column.
 The id column will get automatically populated
////// Change directory and filename accordingly //////
 */ 
LOAD DATA INFILE '/tmp/Top31Million-probable-WPA.txt' INTO TABLE Top31MillionWPA(word);

/* Back to default settings*/
set unique_checks = 1;
set foreign_key_checks = 1;
set sql_log_bin=1;

 /*  This will keep the first entry of the duplicates only in a new table
  this is faster than deleting the duplicates (at the cost of storage space) */
CREATE TABLE Top31MillionWPAclean SELECT Top31MillionWPA.* FROM Top31MillionWPA
LEFT OUTER JOIN(
	SELECT MIN(id) AS FirstID, word
	FROM Top31MillionWPA
	GROUP BY word
	) AS KeepFirst ON
	Top31MillionWPA.id = KeepFirst.FirstID
	WHERE KeepFirst.FirstID IS NOT NULL;

/* Delete original MySQL table  */
DROP TABLE Top31MillionWPA;	
	
/* CREATE Primary Key on new table to speed up the query */
ALTER TABLE Top31MillionWPAclean
ADD PRIMARY KEY (id);
 
/* Create clean text file keeping the popularity order. 
Also, the output is only words of length >= 8 characters
////// Change directory and filename accordingly //////
 */
SELECT word INTO OUTFILE '/tmp/Top31Million-probable-WPA-clean.txt'
FROM Top31MillionWPAclean WHERE LENGTH(word)>=8 ORDER BY id ASC;

/* Delete the MySQL table  */
DROP TABLE Top31MillionWPAclean;
@berzerk0 berzerk0 self-assigned this Aug 20, 2017
@berzerk0
Copy link
Owner

I'll look into this. This was one of the first issues of the project and is the difference between Rev 1 and 1.1/2

I wonder if there is some difference that bash is able to detect that the method you are using doesn't see. Or if somehow we gained some \r line endings in the last committ.

It is also possible, however, that I simply goofed.

@sirblackjack
Copy link
Author

I just checked the file on Microsoft SQL Server 2014 instead of MySQL, and the file is fine! No duplicates found.

For some strange reason MySQL is not treating properly some words/characters. As an example SQL shows line 11,853,466 as "pAsSwOrD", while MySQL incorrectly shows "password" even though it should treat the words case sensitive with those settings.

Anyway, I will look into what MySQL is doing wrong, but SQL is fine.

I can delete the words with length = 7 very fast/easily with SQL and share them if you want.

@berzerk0
Copy link
Owner

Phew, good to hear about the capitalization issues.

It would be great if you could delete the 7char lines, but I'll admit I'm a bit hesitant to use a different method to clean up the files. The original duplicate issue came from a mix up of \r and \n line endings. Can you control what kind of line endings the output would have?

I'd predict, not being familiar with SQL, that Microsoft SQL would use \r line endings by default.
It would be possible for me to replace the \r endings with \n like I've done before in bash, but if I were to have to do that, I might as well remove the 7chars using bash as well.

Is that something you can control?

@sirblackjack
Copy link
Author

Yes, in Microsoft SQL you can choose between {CR}{LF} , {CR} or {LF} as a row delimiter when exporting to a text file.

@berzerk0
Copy link
Owner

@sirblackjack, if you were to create those - what would be the best way to deliver them to me? Since they'd be so large I don't think a push from your end would work, and I've got other files to wrap up into a torrent.

I suppose you could 7zip them to a much smaller size and use Mega Upload

@sirblackjack
Copy link
Author

Done! I uploaded the files with 7z compression to Mega:
https://mega.nz/#F!OVBn0YwR!3U4D0-2_mZiWasu5JWl4rA

The only thing I did is output the words with LENGTH >=8
They use Line Feed {LF}

I finally did everything using SQLite on a CentOS VPS. Which ended up being much faster for this task than MySQL (I tried both), and also doesn't have any problem with case sensitivity.

I have Microsoft SQL Server Express Edition on my desktop, but it has a 10gb database limit which was a problem for the largest file, and I also my upload speed is quite slow. Downloading/Uploading directly between my VPS and Mega was a much faster option for me.

@berzerk0
Copy link
Owner

berzerk0 commented Sep 2, 2017

Looks good! No \r's! Thanks, I'll wrap this up into a release

@berzerk0
Copy link
Owner

In some of the Release 2 files, a blankspace character was at the end of every line. In these cases, I would remove the final blankspace character from all lines. However, some files did not have consistency when it came to beginning or ending with blankspace characters. In this instance, I would leave them in place, since I had reason to believe the blankspaces were part of the data.

However, the '\r' vs '\n' problem that initially caused this problem is fixed in Release 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants