Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate items #2

Closed
Miserlou opened this issue Apr 26, 2017 · 14 comments
Closed

De-duplicate items #2

Miserlou opened this issue Apr 26, 2017 · 14 comments
Assignees

Comments

@Miserlou
Copy link

Miserlou commented Apr 26, 2017

#Looks like there could be quite a few dupes in here, for instance, "password" is at 1 and 19: https://github.com/berzerk0/Probable-Wordlists/blob/master/Real-Passwords/WPA-Length/Top76-probable-WPA.txt

@Miserlou
Copy link
Author

Good project though!

Would love to see a list of WPA-formatted passwords that come just from router/wifi sources, not user-passwords.

@berzerk0 berzerk0 self-assigned this Apr 26, 2017
@berzerk0 berzerk0 added this to the Running as we speak milestone Apr 26, 2017
@berzerk0
Copy link
Owner

berzerk0 commented Apr 26, 2017

Duplication - this is me getting caught with the classic invisible newline between windows and Linux.
Rev 1.1 will have this fixed in the main files, the Chunk files will take longer.

WPA-formatted sources - I have found Wordlists that include "WPA" in the title, but that isn't much of a guarantee that they exclusively come from router/wifi sources.

It is also possible (and equally not possible, as I am asserting this with zero evidence) that the trends for common passwords do not change dramatically if they are used for a Router or for an email address. It seems just as likely to me that people see it as a generic "password" rather than "the Wifi password."

I'll see if I can find some sources with more background, but I have doubts.

EDIT
Of course, today I went somewhere where the Guest Wifi password was "wireless guest"

@WiseNerd
Copy link

WiseNerd commented Apr 26, 2017

Easy fix for the dupes that worked for me was issuing:%s^M\+ in vim to kill the trailing blank space artifacts from windows, and then issuing uniq -u passfile.txt > cleanpassfile.txt. Cool project.

@ghost
Copy link

ghost commented Apr 26, 2017

@WiseNerd So if you already fixed it, why not make a PR?

@iancnorden
Copy link

PR from me shortly for de-dupe. Great work.

@berzerk0
Copy link
Owner

@iancnorden You're gonna beat me to the punch!
I have the desktop chugging away, but won't be back to upload changes for a half day or so

@iancnorden
Copy link

Now it's a race! I had not realized the size, Git clone is still chugging away!

@WiseNerd
Copy link

@blobgo well my macbook's limited ddr2 memory would be neutered by sanitizing that entire thing, I fixed a small part mostly out of curiosity. But was hoping to save somebody some time nonetheless :)

@iancnorden
Copy link

De-dupes still running.

@berzerk0
Copy link
Owner

berzerk0 commented Apr 26, 2017

Initial De-Dupes (up to ~30 Million Non-Spec and WPA) are done, looks like I can't do the big ones in parallel - probably done by tomorrow.

Or so I thought, they didn't come out right.

@WiseNerd I was using

awk '!seen[$0]++' hasDupes > doesntHaveDupes 

which I assumed started at the top and worked its way down, but then for one of the files it popped "password" out of the 2nd slot. No way.

uniq 

only works if two lines are next to one another, unfortunately.

I might just have to compile again from sources - unless @iancnorden 's experience comes up with a solid de-duping

@iancnorden
Copy link

iancnorden commented Apr 26, 2017

Chewing on the folder with Top2Bill*

164/958 completed, started around 1400 eastern.

If curious, thanks to https://github.com/ltdenard ... and this will have to continue overnight at this rate.

for f in ls -lha .| tail -n+4 | awk '{print $10}'; do sort -u ${f} > /tmp/tmp1 && mv /tmp/tmp1 ./${f}; done;

@berzerk0 berzerk0 changed the title De-deplicate items De-duplicate items Apr 27, 2017
@palexhorse
Copy link

Can all unique combinations be put into a new file, or do you just want the duplicates removed?

@berzerk0
Copy link
Owner

For Rev 1.1 we aim to just remove the duplicates while otherwise preserving order.
The "duplicates" are likely illusory, where there probably are invisible newline characters splitting them up.
This has some effect on overall accuracy once they have been removed.

Rev 2.0 will have the newlines weeded out at the source, so this problem will not carry over.

@berzerk0
Copy link
Owner

De-Duped Rev 1.1 is live now, but does not contain the largest files.

Rev 1.2 will, in torrents with compression.

Closing this in light of the release of 1.1 and the impending release of 1.2

@berzerk0 berzerk0 modified the milestones: Closed, In process May 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants