Duplicated entries found on WPA-Length wordlists #28

sirblackjack · 2017-08-20T13:31:09Z

There are duplicated entries for some words in the Top 31 Million, Top 102 Million and Top 1.8 Billion files. As an example, the word 'password' can be found on both line 1 and line 11,853,466 of the files.

I am not good with Unix commands, but the files can be easily fixed using SQL / MySQL. I already fixed them with the code I am sharing below, including also removing words with length of 7 characters. For the 31 Million file, 302,363 entries were removed after cleaning. This code is an example for the 31 Million wordlist, but the same code can be used for the other wordlist just changing the name of the txt file:

/* Creates a Database named 'WPA' */
CREATE DATABASE WPA;
USE WPA;

/* Creates a table named 'Top31MillionWPA'
with two columns: a unique auto_incremental 'id' to keep the
popularity order and 'word' containing the text.
Uses utf8_bin to compare strings case-sensitively */

CREATE TABLE Top31MillionWPA(
id BIGINT NOT NULL AUTO_INCREMENT, Word varchar(255)
, PRIMARY KEY (id), INDEX IX_word (word)
) AUTO_INCREMENT=1 COLLATE utf8_bin;

/*Temporary settings for speed up load of text file*/
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;

/*Loads the text file into the table, into the 'word' column.
 The id column will get automatically populated
////// Change directory and filename accordingly //////
 */ 
LOAD DATA INFILE '/tmp/Top31Million-probable-WPA.txt' INTO TABLE Top31MillionWPA(word);

/* Back to default settings*/
set unique_checks = 1;
set foreign_key_checks = 1;
set sql_log_bin=1;

 /*  This will keep the first entry of the duplicates only in a new table
  this is faster than deleting the duplicates (at the cost of storage space) */
CREATE TABLE Top31MillionWPAclean SELECT Top31MillionWPA.* FROM Top31MillionWPA
LEFT OUTER JOIN(
	SELECT MIN(id) AS FirstID, word
	FROM Top31MillionWPA
	GROUP BY word
	) AS KeepFirst ON
	Top31MillionWPA.id = KeepFirst.FirstID
	WHERE KeepFirst.FirstID IS NOT NULL;

/* Delete original MySQL table  */
DROP TABLE Top31MillionWPA;	
	
/* CREATE Primary Key on new table to speed up the query */
ALTER TABLE Top31MillionWPAclean
ADD PRIMARY KEY (id);
 
/* Create clean text file keeping the popularity order. 
Also, the output is only words of length >= 8 characters
////// Change directory and filename accordingly //////
 */
SELECT word INTO OUTFILE '/tmp/Top31Million-probable-WPA-clean.txt'
FROM Top31MillionWPAclean WHERE LENGTH(word)>=8 ORDER BY id ASC;

/* Delete the MySQL table  */
DROP TABLE Top31MillionWPAclean;

berzerk0 · 2017-08-20T15:13:31Z

I'll look into this. This was one of the first issues of the project and is the difference between Rev 1 and 1.1/2

I wonder if there is some difference that bash is able to detect that the method you are using doesn't see. Or if somehow we gained some \r line endings in the last committ.

It is also possible, however, that I simply goofed.

sirblackjack · 2017-08-21T11:26:02Z

I just checked the file on Microsoft SQL Server 2014 instead of MySQL, and the file is fine! No duplicates found.

For some strange reason MySQL is not treating properly some words/characters. As an example SQL shows line 11,853,466 as "pAsSwOrD", while MySQL incorrectly shows "password" even though it should treat the words case sensitive with those settings.

Anyway, I will look into what MySQL is doing wrong, but SQL is fine.

I can delete the words with length = 7 very fast/easily with SQL and share them if you want.

berzerk0 · 2017-08-22T21:22:02Z

Phew, good to hear about the capitalization issues.

It would be great if you could delete the 7char lines, but I'll admit I'm a bit hesitant to use a different method to clean up the files. The original duplicate issue came from a mix up of \r and \n line endings. Can you control what kind of line endings the output would have?

I'd predict, not being familiar with SQL, that Microsoft SQL would use \r line endings by default.
It would be possible for me to replace the \r endings with \n like I've done before in bash, but if I were to have to do that, I might as well remove the 7chars using bash as well.

Is that something you can control?

sirblackjack · 2017-08-23T12:01:23Z

Yes, in Microsoft SQL you can choose between {CR}{LF} , {CR} or {LF} as a row delimiter when exporting to a text file.

berzerk0 · 2017-08-27T16:45:56Z

@sirblackjack, if you were to create those - what would be the best way to deliver them to me? Since they'd be so large I don't think a push from your end would work, and I've got other files to wrap up into a torrent.

I suppose you could 7zip them to a much smaller size and use Mega Upload

sirblackjack · 2017-09-02T12:25:16Z

Done! I uploaded the files with 7z compression to Mega:
https://mega.nz/#F!OVBn0YwR!3U4D0-2_mZiWasu5JWl4rA

The only thing I did is output the words with LENGTH >=8
They use Line Feed {LF}

I finally did everything using SQLite on a CentOS VPS. Which ended up being much faster for this task than MySQL (I tried both), and also doesn't have any problem with case sensitivity.

I have Microsoft SQL Server Express Edition on my desktop, but it has a 10gb database limit which was a problem for the largest file, and I also my upload speed is quite slow. Downloading/Uploading directly between my VPS and Mega was a much faster option for me.

berzerk0 · 2017-09-02T21:56:53Z

Looks good! No \r's! Thanks, I'll wrap this up into a release

berzerk0 · 2018-02-20T23:34:34Z

In some of the Release 2 files, a blankspace character was at the end of every line. In these cases, I would remove the final blankspace character from all lines. However, some files did not have consistency when it came to beginning or ending with blankspace characters. In this instance, I would leave them in place, since I had reason to believe the blankspaces were part of the data.

However, the '\r' vs '\n' problem that initially caused this problem is fixed in Release 2.

berzerk0 self-assigned this Aug 20, 2017

berzerk0 closed this as completed Feb 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated entries found on WPA-Length wordlists #28

Duplicated entries found on WPA-Length wordlists #28

sirblackjack commented Aug 20, 2017

berzerk0 commented Aug 20, 2017

sirblackjack commented Aug 21, 2017

berzerk0 commented Aug 22, 2017

sirblackjack commented Aug 23, 2017

berzerk0 commented Aug 27, 2017

sirblackjack commented Sep 2, 2017

berzerk0 commented Sep 2, 2017

berzerk0 commented Feb 20, 2018

Duplicated entries found on WPA-Length wordlists #28

Duplicated entries found on WPA-Length wordlists #28

Comments

sirblackjack commented Aug 20, 2017

berzerk0 commented Aug 20, 2017

sirblackjack commented Aug 21, 2017

berzerk0 commented Aug 22, 2017

sirblackjack commented Aug 23, 2017

berzerk0 commented Aug 27, 2017

sirblackjack commented Sep 2, 2017

berzerk0 commented Sep 2, 2017

berzerk0 commented Feb 20, 2018