#Web search for a planet: Google cluster architecture
- Commodity hardware
- Optimize for best aggregate throughput
- Replicate and parallelize to achieve that goal
- Price/performance beats peak performance in distributed systems
- Multiple clusters of 1000+ nodes around the world
- DNS load balancing for user geoproximity
- processing of the query is entirely local to a cluster
- GWS (Google Web Servers) receive the query and format output into HTML
- Consult inverted index that maps words with sets of documents (hit list)
- Determine set of relevant documents by combining all hit lists and giving scores to the documents
- Scale of multiple TBs of data for both hit lists and the index
- MapReduce used to parallelize search by dividing the inde into multiple shards
- If shard replica is down ignore it and try to bring it back
- Result: list of docids, sorted
- Using list of docids, compute URIs and titles
- Document servers, called docservers fetch docs directly from disk
- MapReduce this operation
- Docservers effectively have access to a low latency version of the entire WWW
- System is optimized so that read-only is the only needed operation basically
- Divide query stream into multiple streams
- Because individual shards need not to communicate with each other:
- more shards means faster service, almost linearly