Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Donate crawl data to the Internet Archive #353

Open
upintheairsheep opened this issue Dec 22, 2023 · 1 comment
Open

Donate crawl data to the Internet Archive #353

upintheairsheep opened this issue Dec 22, 2023 · 1 comment

Comments

@upintheairsheep
Copy link

upintheairsheep commented Dec 22, 2023

Hello, this is more related to Brave Search itself. but can you get in contact with the Internet Archive and donate crawl data to the wayback machine? Alexa Internet did that until it's disintegration in 2020 by Amazon. The wayback machine is an extremely useful resource that is used all across the world by researchers, journalists, and basically anyone on YouTube doing an investigation related to something online, like the origin of an urban legend, for instance. Since you already have the Wayback Machine integrated into the browser, the chance of a link completely lost to time should decrease if you donate the crawl data. The crawl data donated by Brave would be extremely helpful, and ask the Archive staff to give you a list of all archived URLs on the wayback machine, deduplicate them, and add the links that are both not crawled by Brave and are still up to the search results, to make a third search engine to rival Google and Bing. Other good sources of links could be https://ODCrawler.xyz , and many AI image datasets.

@andreas-hartmann
Copy link

I also think the dataset generated by this project should become public or be linked to an existing public web crawler project instead of creating another walled garden index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants