Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to Carefully Look Through The Domains #362

Closed
sr093906 opened this issue Feb 16, 2023 · 21 comments
Closed

Request to Carefully Look Through The Domains #362

sr093906 opened this issue Feb 16, 2023 · 21 comments
Assignees
Labels
question Further information is requested

Comments

@sr093906
Copy link

sr093906 commented Feb 16, 2023

https://github.com/zakird/crux-top-lists

The dataset adheres as closely as possible to user-initiated pageloads (e.g., it excludes traffic from iframes).

So, please treat them as domains visited by real human. And based on such an assumption, not a few can/should be treated as FPs.

The list is generated by downloading the repo's latest csv file and stripping http:// and https://

After that, entries seen in Fake, Threat Intelligence Feeds, DoH/VPN/TOR/Proxy Bypass (complete edition), Safesearch not supported, Dynamic DNS, Badware Hoster and Personal are removed.

Finally, common entries between the processed file and the raw domain version of ultimate blacklist are listed.

There are some bet and porn sites there, of course. For others, some are clearly FPs like those staring with blog., login, and others.

5.txt

@sr093906 sr093906 added the question Further information is requested label Feb 16, 2023
@hagezi
Copy link
Owner

hagezi commented Feb 16, 2023

Skimming the list, I found some trackers, crappy hosts, ad domains, popup porn ads, subdomains of main domains blocked for good reason and co. Analyzing this list would take days to weeks. At the moment I don't have the time ...

@durablenapkin
Copy link

Interesting dataset - might be handy for TLD discovery but other than that it's almost impossible to analyze/test

@sr093906
Copy link
Author

Given scant interest and negative feedback, I will look through the list myself to find out 'a few' FPs based on my mood. Stay Turned.

@hagezi
Copy link
Owner

hagezi commented Feb 17, 2023

Thanks for the help.

@hagezi hagezi closed this as completed Feb 17, 2023
@sr093906
Copy link
Author

Please re-open it. I haven't finished it.

@sr093906
Copy link
Author

I even haven't finished those beginning with 'a'.

@hagezi
Copy link
Owner

hagezi commented Feb 17, 2023

Ok, sorry.

You can also post the domains in this issue, no need to open a new one for each. But, as you like ...

@hagezi hagezi reopened this Feb 17, 2023
@martijk
Copy link

martijk commented Feb 18, 2023

Some numbers for those interested:

| List | Domains in top 1M |
|---|---|
| oisd small  | 597 |
| oisd big  | 2481 |
| HaGeZi multi light  | 2200 |
| HaGeZi multi normal  | 3034 |
| HaGeZi multi pro | 3542 |
| HaGeZi multi pro++ | 4871 |
| HaGeZi multi ultimate | 6511 |

Of course being part of the top 1 million most visited websites doesn't mean that it's a legit domain, so be careful with jumping to conclusions.

I applaud your efforts, by the way. Maybe a small subset of this list can be used to check whether a list is fit for inclusion, same as what oisd does. For example, some legit looking webshops are loaded from the NoTracking list, which in turn got them from here, which with all due respect looks like a pretty obscure and not frequently updated list. That raises the question whether NoTracking has a strict enough inclusion policy (and in return HaGeZi as well).

@hagezi
Copy link
Owner

hagezi commented Feb 18, 2023

I applaud your efforts, by the way. Maybe a small subset of this list can be used to check whether a list is fit for inclusion, same as what oisd does. For example, some legit looking webshops are loaded from the NoTracking list, which in turn got them from here, which with all due respect looks like a pretty obscure and not frequently updated list. That raises the question whether NoTracking has a strict enough inclusion policy (and in return HaGeZi as well).

Thanks for the advice, I'll see how I can get a handle on this.

@notracking: How do you see that? I would think about removing the source mentioned.

hagezi added a commit that referenced this issue Feb 18, 2023
@hagezi
Copy link
Owner

hagezi commented Feb 19, 2023

@sr093906
First of all, thanks for all your effort. Before you start the next wave, please wait, I'm just straightening a few things. When the build of the new lists is online I'll let you know. Then please continue testing against the new Ultimate. It will be ready in a few hours. Thanks ...

@hagezi hagezi mentioned this issue Feb 19, 2023
@sr093906
Copy link
Author

I hope the adjustments will save me much time.

@hagezi
Copy link
Owner

hagezi commented Feb 19, 2023

@sr093906 "cleaned" Ultimate is online. Should find less now ...

@sr093906
Copy link
Author

@hagezi Thanks for notification. I will continue the check later.

https://github.com/MISP/misp-warninglists/

Whitelist resources. Maybe some lists will be helpful.

@hagezi
Copy link
Owner

hagezi commented Feb 20, 2023

@sr093906 I've done more cleanup, the build is running now and will be through in a few hours. I'll let you know ...

@hagezi
Copy link
Owner

hagezi commented Feb 21, 2023

@sr093906 Update is live, cleaned pro to ultimate.

FYI:

Toplists: https://github.com/hagezi/dns-data-collection/tree/main/top

toplist.txt - Umbrella
toplist.tranco.txt - Tranco
toplist.chrome.txt - Chrome

@sr093906
Copy link
Author

Thanks for letting me know. I will check.

@hagezi
Copy link
Owner

hagezi commented Feb 23, 2023

@sr093906 STOP posting potential phishing domains to whitelist, check the phishing sources and report them there. If they are removed from the phishing lists, they disappear from my lists too!

Thanks,
Gerd

@hagezi
Copy link
Owner

hagezi commented Feb 23, 2023

@sr093906 Please spare me with these Chrome-Toplist Crap sites from the lower ranks, I use for my TIF the Umbrella/Tranco Toplist as Whitelist. So the hosts you reported are not on either toplist if they are blocked by my TIF. Report them upstream if you think they are false positives.

Thanks,
Gerd

@hagezi
Copy link
Owner

hagezi commented Feb 23, 2023

I have now spent hours on these issues. I cleaned up the lists using the Chrome Toplist. Everything that was safe to remove was removed.

Done.

@hagezi hagezi closed this as completed Feb 23, 2023
@notracking
Copy link

I applaud your efforts, by the way. Maybe a small subset of this list can be used to check whether a list is fit for inclusion, same as what oisd does. For example, some legit looking webshops are loaded from the NoTracking list, which in turn got them from here, which with all due respect looks like a pretty obscure and not frequently updated list. That raises the question whether NoTracking has a strict enough inclusion policy (and in return HaGeZi as well).

Thanks for the advice, I'll see how I can get a handle on this.

@notracking: How do you see that? I would think about removing the source mentioned.

Well, Stonecrushers list is basically a scraped version of:
https://www.watchlist-internet.at/unserioese-webseiten/
https://www.watchlist-internet.at/about-us/
https://www.oiat.at/

Though I will remove/disable it because it should have (at least) excluded their "Problematische Online-Shops" list, which mostly has shops with bad service (based on user reports).

@hagezi
Copy link
Owner

hagezi commented Mar 5, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants