by Kragen Sitaker; see also Publications
Distributed posting list joins (aka inverted list intersections) are
the biggest unsolved problem for a geographically-distributed
full-text web-search engine (according to the Nutch FAQ, if I read it
correctly, in the section
entitled “Will Nutch be a distributed, P2P-based search engine?”).
Here are some thoughts which I think include a workable solution.
I talked about posting list colocation before, in a post entitled
“distributed peer-to-peer full-text web searching”, in February 2004,
on kragen-tol.
Posting list colocation seems similar to the coincidence problem for
pub-sub rendezvous discussed in
Consistent Hashing Multicast
— I wonder if there’s a way to cut down the numbers in a similar
way? Maybe as discussed in Loo, Huebsch, et al.’s [http://iptps04.cs.ucsd.edu/papers/loo-hybrid.pdf “The Case for a
Hybrid P2P Search Infrastructure”] you could avoid
moving large posting lists around by moving queries to them instead of
moving them to queries? Hmm…
That paper also mentions something about join indexes, which sound
like they could be a useful optimization for frequent or expensive
joins.
In the same conference, Shi, Yang, et al. published a paper entitled
[http://iptps04.cs.ucsd.edu/papers/shi-keysearch.pdf “Making Peer-to-Peer Keyword Searching Feasible Using Multi-level
Partitioning”].
This brilliant paper proposes a solution to the distributed
posting-list join problem, and this solution appears to me to be a
major advance that should make peer-to-peer keyword searching of
filenames work noticeably better, but it seems to me that it doesn’t
yet solve the problem for full-text web search.
Basically it just puts each group of documents in its own group of
nodes (called a level-L subgroup in the paper) which is assumed to be
tightly connected. Unfortunately the second-largest posting list in a
query must be able to be transmitted over the network within the group
within the required response time. Putting some realistic numbers on
this, “http” and “www” might each have 16GB of total posting-list
data, each group of nodes might have 1Mbps of bandwidth available to
any member, and you might want 1-second response time. This means you
must divide your document space into at least 128000 groups, and send
any query to all 128000 groups. (The paper includes a tree multicast
protocol that makes it possible to do this with reasonable latency.)
This unfortunately is comparable to the total number of nodes that can
be expected to participate in such a system, so when you try to use
this technique to replace Google, it becomes indistinguishable from
the partition-by-document technique used by Gnutella, and requires
five orders of magnitude too much work from the network as a whole for
most queries. This will cause it to perform poorly under many
simultaneous queries, which doesn’t appear to be considered in the
paper.
The paper actually claims that its result is practical for full-text
web search, but its performance results are simulated. They find on
the order of 30GB of total data transferred for a search of 3 * 10^9
documents in a network of 16384 nodes, or about 1.8MB transferred per
node on average. On a 1Mbps link, that’s 15 seconds. Accordingly,
the latency numbers presented stop around 2 seconds per search with
L=4 and 100 million documents. (The paper is ambiguous, but I believe
the 30GB is per search, not a total over several queries.)
So here’s a slight tweak on this idea which, I think, makes it
practical: the aggregate bandwidth available to a particular posting
list should be proportional to the size of the posting list. We
achieve this by distributing posting lists for more common terms over
a larger number of nodes.
My “more on distributed peer-to-peer full-text web searching” post in
February 2004 gives a challenge problem: 18 million postings for “criteria”, one
million postings each for “autism” and “eye-contact”, with an
intersection of only 4200, in under a second.
Suppose your chosen strategy was to distribute the 2M postings for
“autism” and “eye-contact” to the machines where the “criteria”
postings lived, then have those machines do the merge and send all the
remaining results to a single collection point, in under a second; and
suppose that all the machines involved have only half a megabit per
second available in each direction. Each posting is perhaps five
bytes, so we have 10 megabytes of posting data to scatter/gather in a
second, or 80 megabits; that means that the “autism” and “eye-contact”
postings must be scattered among 80 machines each, for a total of
12500 postings per machine. Each of those machines then generates a
half-megabit spike of traffic for a second sending its load to the
“criteria” machines (of which there are 1440, so each sender must send
to 18-20 destinations); those machines filter the incoming packets in
a streaming fashion against their local hash tables of “criteria”
postings, leaving only two small lists of autism&criteria and
eye-contact&criteria that must be merged.
If all these machines must maintain in-RAM hash tables of all posting
list parts they hold, and they’re not allowed to hold more than 12500
postings of any particular posting list (so they can transmit them
over the network in a second), then each hash table might be 12500 *
(8 (pointer table bytes) + 4 (nextpointer bytes) + 8 (docid bytes) =
20 bytes) = 250 kB. This means that a 1GB machine can be responsible
for 4000 full-sized posting-list parts, or 50 million postings.
(Maybe use a sorted list instead and save a factor of 4?)
For reference, my current mailbox contains 300MB, 1.2 million terms,
14 million postings, and 36000 documents (mail messages). If those
numbers are typical, this scheme needs one network node per 130
thousand documents or 1GB of text. This scales linearly with RAM on
those machines, but that still means we can only index the documents
that fit in these machines’ RAM. This limitation doesn’t change with
available bandwidth.
Unfortunately, you also need enough network nodes that the longest
posting list can be distributed at 12500 postings per node, and
practically speaking, I think that means that you need one node per
12500 documents. This limitation doesn’t change with available RAM,
but it does change with available bandwidth. You’d need about 5
megabits (per acceptable response time, e.g. per second, or per two
seconds) each way for it to equal the limitation of 1GB RAM.
So that’s an interesting pair of limitations. In this scheme, we can
only effectively use about 200MB of RAM per megabit of bandwidth, or
vice versa, and either one buys us about 25000 documents or 200MB
indexed.
1GB RAM now retails around $200, or about $40 per 200MB.
The Shi et al. paper discusses ways to dramatically reduce bisection
bandwidth requirements; these are not discussed here on the theory
that the planet-wide internet probably has sufficient bisection
bandwidth. However, their geographic clustering technique does apply
to the design discussed here.
I’m guessing that right now core bandwidth costs about $20 per
bidirectional megabit per second per month. (This means that every
two months, the bandwidth cost equals the memory cost, and that’s only
going to get shorter.) Indexing the 40 terabytes of data on the Web in
this fashion would require 40 000 one-gigabyte-RAM, five-megabit
machines, which would cost about $100 each per month to run if they
were doing searches flat-out all the time. Burst bandwidth is cheaper
than sustained bandwidth, though.
It’s very reasonable to expect 40 000 machines, or even ten times that
number, to participate in a global peer-to-peer network.
If you were going to build this system today, the user interface
should probably be in Korean, then translated into Japanese, then
possibly translated into Chinese and English; although there are more
potential participants in China and English-speaking countries, they
tend to lag far behind Korean and Japanese participants in available
bandwidth.