-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated entries found on WPA-Length wordlists #28
Comments
I'll look into this. This was one of the first issues of the project and is the difference between Rev 1 and 1.1/2 I wonder if there is some difference that bash is able to detect that the method you are using doesn't see. Or if somehow we gained some \r line endings in the last committ. It is also possible, however, that I simply goofed. |
I just checked the file on Microsoft SQL Server 2014 instead of MySQL, and the file is fine! No duplicates found. For some strange reason MySQL is not treating properly some words/characters. As an example SQL shows line 11,853,466 as "pAsSwOrD", while MySQL incorrectly shows "password" even though it should treat the words case sensitive with those settings. Anyway, I will look into what MySQL is doing wrong, but SQL is fine. I can delete the words with length = 7 very fast/easily with SQL and share them if you want. |
Phew, good to hear about the capitalization issues. It would be great if you could delete the 7char lines, but I'll admit I'm a bit hesitant to use a different method to clean up the files. The original duplicate issue came from a mix up of \r and \n line endings. Can you control what kind of line endings the output would have? I'd predict, not being familiar with SQL, that Microsoft SQL would use \r line endings by default. Is that something you can control? |
Yes, in Microsoft SQL you can choose between {CR}{LF} , {CR} or {LF} as a row delimiter when exporting to a text file. |
@sirblackjack, if you were to create those - what would be the best way to deliver them to me? Since they'd be so large I don't think a push from your end would work, and I've got other files to wrap up into a torrent. I suppose you could 7zip them to a much smaller size and use Mega Upload |
Done! I uploaded the files with 7z compression to Mega: The only thing I did is output the words with LENGTH >=8 I finally did everything using SQLite on a CentOS VPS. Which ended up being much faster for this task than MySQL (I tried both), and also doesn't have any problem with case sensitivity. I have Microsoft SQL Server Express Edition on my desktop, but it has a 10gb database limit which was a problem for the largest file, and I also my upload speed is quite slow. Downloading/Uploading directly between my VPS and Mega was a much faster option for me. |
Looks good! No \r's! Thanks, I'll wrap this up into a release |
In some of the Release 2 files, a blankspace character was at the end of every line. In these cases, I would remove the final blankspace character from all lines. However, some files did not have consistency when it came to beginning or ending with blankspace characters. In this instance, I would leave them in place, since I had reason to believe the blankspaces were part of the data. However, the '\r' vs '\n' problem that initially caused this problem is fixed in Release 2. |
There are duplicated entries for some words in the Top 31 Million, Top 102 Million and Top 1.8 Billion files. As an example, the word 'password' can be found on both line 1 and line 11,853,466 of the files.
I am not good with Unix commands, but the files can be easily fixed using SQL / MySQL. I already fixed them with the code I am sharing below, including also removing words with length of 7 characters. For the 31 Million file, 302,363 entries were removed after cleaning. This code is an example for the 31 Million wordlist, but the same code can be used for the other wordlist just changing the name of the txt file:
The text was updated successfully, but these errors were encountered: