-
Notifications
You must be signed in to change notification settings - Fork 24
Search engine #8
Comments
+1 static indexes (and any objects computed out of the archives) should be static and versioned in merkledag just like the archives themselves. dependency: file metadata ipfs/ipfs#36 |
@rht I'm leaning towards the first one too. I added the latter two mainly because I believe YaCy works that way. The two main problems that (I think) need to be solved for static indexes are: discovering which nodes provide them (ipfs/notes#15), and how to aggregate them all. Index aggregation could either be done a) entirely client-side (easy to implement but inefficient, especially once the network gets large), or b) through cooperation between index nodes (more difficult to implement, but also more efficient for clients). @jbenet suggested that CRDT could be used to achieve (b), but I'm not familiar enough with it to make any intelligent comments :). I do know that trie merging is commutative (unlike some other search trees or hash tables), so it at least fits within CRDTs assumptions. The main thing to keep in mind is that these indexes are likely to become quite large, so you'd want to coordinate the index merging without nodes having to replicate much of each other's indexes (not sure how this would work within CRDT). One possible way of doing this is to partition prefixes across the nodes (since you only need to look at shared prefixes when performing the merge), but I'm not sure how this would work within IPFS. |
Looks like @krl is working on search functionality for starlog: https://github.com/krl/aolog (IRC) |
If you go with one of the first two options, you lose flexibility -- mainly because the IPFS executable is determining what the index should look like. If you go with option 3, then you can have any number of indices with different structures storing different things. I don't think the search functionality should be integrated into the IPFS structure itself, but should instead be built on top as a web service. The real question you should be asking is not what an index should look like, but what a database built on top of the IPFS system should look like. |
@nbingham1 None of them necessarily need to be integrated into ipfs itself, I'm just assuming nodes have some kind of indexing service running, which could be an entirely separate executable (and you could have multiple different ones running if you wanted). In terms of what a database on top of ipfs would look like, I'm in favor of one big json object transparently distributed over ipfs. You could then easily build radix trees and such on top of that. |
@nbingham1 @davidar indeed. very much indeed. |
Ah, ok, that makes much more sense. So the nodes run the indexing service, and updates the index whenever that node adds a file, but the index is still stored in a database format in the IPFS network. I was misinterpreting the options (Sorry! just found out about IPFS Yesterday!). |
@nbingham1 No need to apologise, clarifying questions are good :) |
Merging the Tries via gossip is really viable, essentially each node just shares it's "head node" of the trie they store on ipfs with it's peers. Each peer preforms a recursive merge of the index (which is normally the complexity of the smaller index as you can re-use unaltered entries) and then propagates the new index. Then we can implement search in js by pointing to that ipfs node's search index by /ipns/local/searchindexroot or somesuch. Then anybody who wants can implement a search engine on client side with whatever UI or features they want. This falls into the classic Byzantine Generals problem of people messing with the index. A proof of accuracy based block-chain actually looks like a decent solution to this problem. |
@BrendanBenshoof cool, sounds good Do we actually need a blockchain though? I was thinking all of the (non-malicious) nodes would agree on the merged root node after the gossiping finishes, after which you should just be able to do a simple popular vote? |
We could rotate it into web of trust but straight democracy is highly vulnerable to sybil attacks. We just wandered into the big open issue in distributed systems, how to deal with consensus with attackers. I'm not a fan of proof of work blockchains as a general solution to this problem. I like the idea of a block-chain here because there is a clear "proof of truthiness" in that nodes can sanity check a "sub-index" against the contents of the file it indexes to ensure it is at least accurate. Once all the added "sub-indexes" have been inspected, you can also confirm the deterministic merging of them into the "primary index" and agree on the new "root" hash. My only worry is that it is too wasteful in processing power and too complex. It might be more efficient just to deal with crap in the database, A block would look like:
Basic use dissemination would look like:
This also has the side effect of aggressively caching newly indexed files by all the search engine participants. It also feeds into my preference that indexing be an active choice by a content producer rather than a spider. |
@BrendanBenshoof OK, I think I see what you mean, but distributed systems and associated attacks isn't really my area, and it sounds like you're more familiar with it than I am. Just to clarify what my thoughts were (but as you say, I haven't thought about possible attacks on this). Suppose you have two nodes, and they swap information in order to agree on the merge of both their indexes. They then both sign this hash, so that everyone else can verify that they're satisfied this merged index contains their own subindexes. You then do a similar thing with merged indexes, etc, so that you end up with a global root signed by everyone with a stake in it. Thinking about it some more, I agree popular vote wouldn't work in general, I guess what I meant was a vote from nodes that you trust (eg the bootstrap nodes), which you assume aren't malicious, but some of them might make mistakes. But like I said, I have no idea how well these things will work in practice ☺ @jbenet @whyrusleeping thoughts? |
Maybe cothorities are relevant here? Warning: I came across this recently and I still try to make sense of it. To summarize from the slides: Collective authorities want to make tamper-evident public logs while solving the forking- and freshness problem without falling back to a proof-of-work scheme, risk of temporary forks or a 51% attack vulnerability. I imagine IPFS could offer this and individual projects can use it to sign their latest versions and indexes? |
@cryptix I haven't read the paper fully. Regardless, from its scope and aim, it can be used in 2 ways:
Also, since it is not crucial to get timestamp of indexes (of an immutable content) in order, I don't think a full-blown blockchain (be it proof-of-work or not) is necessary. There should be a much faster way than method 2 above. An alternative to fingerprinting the index is verify the code used to build the index, provided that it can be verified to run in a sandbox with known parameters. (worth looking: SCP, which uses quorum slice, which enables more organic growth of signing nodes) |
(ok the fingerprint check is basically @BrendanBenshoof's "proof-of-truthiness", which drawbacks have been mentioned. I favor the weak ordering method described in the CRDT discussion, if there is a protocol for this that has been fleshed out) |
I think it would make sense for index providers to sign the aggregated index once they've verified it contains their own index. That way "verification" is simply checking that it's been signed by the provider of the part of the index you're interested in. |
I made a prototype search tool here (also available here if it's unavailable on ipfs). The search index is built from scraping json search results from a public searx instance (which itself aggregates results from major search providers). This means the only things being searched are page title, URL, and a snippet of the page's text. It also means the search index only has about 1100 items. The client-side search is done with lunr.js. (search source, scraper source) It's nothing impressive but was fun to hack together! |
@doesntgolf Very nice to see progress here ;) |
I've just launched an early alpha of a proper search engine, based on elasticsearch. http://ipfs-search.com |
@dokterbob thanks for sharing, looks very cool :) Maybe it's something you can submit a PR to add to https://github.com/ipfs/awesome-ipfs ? |
On a related vertical note, for the GA4GH Cancer Gene Trust, an IPFS based distributed genomic store, I've implemented an elastic search based crawler with web UI showing the network and allowing for searching by gene and clinical features. |
nobody working on a distributed search engine? Something where searching costs money. |
@magik6k was experimenting with pre-computed sharded search for Wikipedia dumps. Magik, any input in this? |
There is https://github.com/magik6k/distributed-wiki-search. The code there isn't great, though it should give some idea how it may be implemented. |
I think this is a critical point if there shall be an HTML page-web (hyperlinked information space ... the whole of HTML based web pages out there in the HTTP world) So I don't know if there is going to be a web-layer on top of IPFS, this is going to be a tough nut to crack... I must admit that I discovered IPFS fairly recently, but I did not find anywhere that searching is a critical point for the plans to build the web 3.0, which it is... |
Having a decentralized search engine is definitely a difficult challenge, and I think this will be the job of a company rather than the IPFS core team. Google search already works fine with IPFS through gateways, so centralized search engines should be able to function just fine while a decentralized solution is developed by the community. |
As mentioned by dokterbob above, we are currently developing an ipfs search engine that is indexing within IPFS, not searching through a gateway. Actually, our search engine listens to DHT chatter to index content. We just revived the search engine, but it is in an early (let's say) PoC mode. But in genereal, it works. But there's a lot to do. |
@chainify-io Would you be interested in another collaborator in your efforts? I've been writing down ideas for a distributed search for web pages and it'd be cool to discuss and see if we can work together. |
Getting documents archived on IPFS is one thing, but we also need to be able to search through them. Given that these archives are eventually going to become too large to fit on a single machine, it's likely that the search index will need to also be distributed over the IPFS network (e.g. with each node providing an index of the contents of their local blockstore). Some possible ways this could be achieved:
Looking through the IRC logs, I've seen several people express interest in an IPFS search engine (@whyrusleeping @rht @jbenet @rschulman @zignig @kragen), but haven't been able to find any specific proposals. Perhaps we could coordinate here?
The text was updated successfully, but these errors were encountered: