URL History Bloom Filters

Social recommendation networks

Often I’m interested in the same web pages my friends read. Sometimes, reading their blogs tells me what they read; del.icio.us helps, too. However, those approaches take some work per page: you have to explicitly post a link to each page to tell me you liked it. It’s nearly the same amount of work as telling me about cool web pages when we’re hanging out at the water cooler. So we never publish links to the majority of pages we visit, so our friends never find out we liked them.

Collaborative filtering: statistical recommendations

Several services, such as Alexa’s “what’s related” service and Amazon’s “similar items” service, take advantage of a central point of control in a system to track many people’s reading. Then, they predict what I will be interested in, from statistical profiles of people like me. These services provide many of the benefits of social recommendation networks, but they don’t require as much effort on the part of the users. They are known as collaborative filtering systems.

Cobrowsing: publishing your URL history

In 1997, in a web page entitled Communal Web Browsing, I suggested that you could discourage children from browsing porn sites by just publishing a list of what they were browsing in real time where their parents or other community members could watch; this avoids the problems Solid Oak Software has so vividly illustrated with censorware. Off-the-shelf software such as VNC, BackOrifice, and Groove support sharing of a web browser. Most Americans, however, would prefer not to publicize their web-browsing habits indiscriminately. Even public-access computer terminals in public libraries commonly have some sort of privacy shield on the screen to restrict viewing to the person sitting at the machine.

If I were to build a piece of software that published every URL I visited on the Web, my friends could easily see which pages I had visited, with no extra effort on my part. If my friends also installed this software, I could see which pages had been most frequently visited by my friends — presumably the most popular ones would also be of interest to me. My friends might not be willing to install that software, though, because they might prefer a greater measure of privacy, such as that granted by software that requires an explicit action to publish each link.

Bloom filters

There are these things called Bloom filters; they’re basically compact, lossy representations of sets of hash values. You feed a Bloom filter a hash, and it returns “yes” or “no”. “Yes” means that the set it represents it may or may not contain the hash value; “no” means that it definitely does not. The probability of a “yes” for a value it doesn’t contain is adjustable, and it’s lower for larger Bloom filters.

Schachter and Ceglowski’s LOAF, a pun on FOAF, uses Bloom filters to help you prioritize your email by letting you recognize which incoming email comes from your friends’ correspondents — with an adjustable probability of a false positive. If that probability were 50%, then you’d expect a new correspondent to appear to be a correspondent of about 10 out of 20 friends whose LOAF files you have. The probability is about 1% that, by chance, they will appear to be a correspondent of more than 15, and only about 5% that they will appear to be a correspondent of more than 13. But if they actually are a correspondent of 7 out of of your 20 friends, the probability is about 50% that they will appear to be a correspondent of more than 13.

As the number of your LOAF-using correspondents increases, you can reliably detect a much smaller proportion of correspondents.

This protects the privacy of your friends’ address books, since you can’t tell for sure whether any particular friend has an address in their address book — because of the false positive probability. But it still gives you an aggregate answer about your community’s familiarity with any particular email address.

Decentralized collaborative filtering by publishing URL histories through Bloom filters to your social network

Suppose we apply this to URLs instead. Each day I’ll publish an updated Bloom filter containing all the URLs I’ve seen in the last few weeks, with a relatively high false-positive rate. Every few weeks, I’ll clear it out, and new URLs will accumulate for a few weeks. Now all my friends have a hint, whenever they see a URL, whether or not I also saw it. They can preferentially follow links that several of their friends thought were interesting.

Because I actually only visit a few hundred URLs per week, even a relatively low false-positive rate suffices to protect my privacy. If there are four billion URLs on the public web, and my Bloom filter’s false-positive rate is 0.1%, then four million URLs match the filter without me having visited them (in the last few weeks), while only a few hundred match it because I have visited them. So the fact that a particular URL matches my filter is pretty weak evidence that I have actually visited it. Practically speaking, a false-positive rate closer to 10% might be better, since some URLs are much more popular than others, so the prior probability might be high enough to matter.

A URL-recommending engine can accumulate these different Bloom filters over time and compare them against my browser history to see which ones are most closely correlated to it, in order to figure out which ones are the best predictors of where I want to go next. A naive Bayesian reasoner could simply compute the probability, given that a URL does or does not appear in a particular Bloom filter, that it appears in my history; then it can adjust all these probabilities together to come up with a predicted interest level. A smarter engine might discover pairwise or even more complex dependencies between the filters and discount those results — if Adam and Rohit always go to the same URLs, I’d like to not count their two filters as if they were independent.

The question remains how to acquire URLs to recommend in the first place. You can’t simply pull them out of the filters — that’s the point. You could look at the stream of links in HTML pages I look at, or the stream of links in a bunch of RSS feeds, or the output of Technorati, del.icio.us, Syndic8, or pubsub.com, or my email, IRC, and IM channels.

Then you have the question of how to present the results. You could display them on an HTML page, sorted with the most relevant links first, or you could rewrite HTML pages on the way in to my browser with icons or colors representing popularity.

Attacks

An attack due to Dan Gould: to find out whether someone has visited a particular URL, instead of testing that URL alone, test many URLs that tend to be associated with it — for example, pages it links to, or pages that link to it, or several pages in a sequence. Then test to see whether a statistically unlikely number of these URLs are “hits” in the filter — it will tell you whether the person has visited some subset of those URLs, but not which ones. But that may be enough.

Privacy-Enabling Collaborative Filtering

John Canny from UC-Berkeley has a couple of papers on that: Collaborative filtering with Privacy and Collaborative Filtering with Privacy via Factor Analysis. More recently, a group of researchers from Haifa published a paper titled “Privacy-Enhanced Collaborative filtering” in PEP05’s proceedings.

Bloom Filter Resources

Math

Pei Cao has good mathematic exploration of Bloom filters. It also provides a quite useful table showing false positive rate under various m/n and k combinations. According to Pei’s calculation, “it seems that 4 bits per count would be amply sufficient”. when m/n=4, k=3 reaches the minimal false positive rate of 0.147

Applications of Bloom filters

Andrei Broder (IBM Research) and Michael Mitzenmacher (Harvard) have a nice survey paper on Network Applications of Bloom Filters.

More recently, a paper by researchers from UT-Austin and IBM Research talks about how to refine web search results by removing near-duplicate documents detected by using Bloom filters as a means of computing similarities of documents.

Steve Bellovin (AT&T Labs Research) and William Cheswick (Lumeta) proposed a new search sheme based on Bloom filters and Pohlig-Hellman encryption.

Coding Bloom filter and tricks

Maciej Ceglowski has a good post on Perl.com

Some Motley Bloom Tricks

Mark Fischer’s post on flipcode.com

MIT netcom

Rice FreePastry

Notes of BF

The origin:
In his 1970 seminal paper “Space/Time Trade-offs in Hash Coding with Allowable Errors”, Burton Bloom introduced a new efficient hash-coding method with allowable errors with a formula to calculate the trade-offs among three computational factors: allowable fraction of errors (false positive rate), space (bits of hash area), and time (average time to reject a message as a member of the given set). People name this data structure as “Bloom Filter”.

The design of BF:
• Lee Gremillion in his 1982 paper “Designing a Bloom Filter for Different File Access”, demonstrates a simulation approach to the design of a BF. The argument there is basically to use simulation to determine the actual values of different parameters (e.g., size of the BF).
• James Mullin in his 1983 paper “A Second Look at Bloom Filters” proposes an analytic approach of design BF based on probability. His work has been widely quoted in later literature about BF.
• Ramakrishna in his 1989 paper “Practical Performance of Bloom Filters and Parallel Free-Text Searching”, focuses on the issue of choosing the “right” transformations/hash functions. He examines a class of hash functions (Hc,d mapS A into B):

Hc,d(x) = ((cx + d) mod p) mod m, and H1 = {Hc,d( ) | 0 < c < p, 0< d < p)}

Where The keys are assumed to be integers drawn from a universe A, A = (1, 2, . . . , p – l), p is a prime;
Let B denote the range of hash addresses, B = (0,1, . . . , m – 1).
Based on the results of his simulations, he concludes that “by choosing hash transformations randomly from the class H1 the theoretical error rate can be
achieved in practice”.
• The hash functions being used are desirable to be perfectly random meaning that they map each item in the universe to a random number uniform over the range of {1,…,m}, where m is the length of the Bloom Filter. While what hash functions to use in practice still remains to be an open questions, currently MD5 is widely used. (Also take a look at Rabin’s fingerprints. It seems that it hashes words)

Different versions/variants of BF:
• Standard BF:
A Bloom filter for representing a set S = {x1 , x2 , . . . , xn } of n elements is described by an array of m bits, initially all set to 0. A Bloom filter uses k independent hash functions h1 , . . . , hk with range {1, . . . , m}. For each element x ∈ S, the bits hi (x) are set to 1 for 1 ≤ i ≤ k. A location can be set to 1 multiple times, but only the first change has an effect.
To check if an item y is in S, we check whether all hi (y) are set to 1. If not, then clearly y is not a member of S. If all hi (y) are set to 1, we assume that y is in S, although we are wrong with some probability (false positive rate).

Some cool properties of BF:
1. A Bloom filter that represents the union of two sets can be obtained by taking the OR of the two bit vectors of the original Bloom filters.
2. Bloom filters can easily be halved in size, allowing an application to dynamically shrink a Bloom filter. Suppose that the size of the filter is a power of 2. To halve the size of the filter, just OR the first and second halves together. When hashing to do a lookup, the highest order bit can be masked.
3. Bloom filters can also be used to approximate the intersection between two sets. The inner product of the two bit vectors can be used as a measure of their similarity.

• Another version of standard BF:
Each hash function has a range of m/k consecutive bit locations disjoint from all the others. The total number of bits is still m, but the bits are divided equally among the k hash functions. According to Andrei Broder and Michael Mitzenmacher’s survey paper, this version shares the same asymptotical performance with the one above. Despite the false positive rate is slightly higher than the previous version, this version might be useful for implementation reasons: e.g., ease the potential exploitation of parallelization of array accesses by dividing the bits among the hash functions.

The minimum value for the false positive rate f is reached when p = 1/2, or equivalently k = ln 2 · (m/n). In this case the false positive rate f is (1/2)k ≈ (0.6185)m/n, but in practice, k must be an integer, and a smaller, sub-optimal k might be preferred since this reduces the number of hash functions that have to be computed. They also derived that the lower bound of the false positive rate f is only (0.5)m/n.

• Counting Bloom Filter:
If you have a set that is subject to change, inserting a new element into a BF is easy (hash it k times and set corresponding bits to 1), but deleting an element is tricky. You cannot set all the k bits resulting from hashing the element to 0, because the bits might be set by other elements.
To prevent this problem, Li Fan et al. proposed the idea of counting Bloom Filter, where each entry in the Bloom filter is a small counter instead of a single bit. When an item is inserted, the corresponding counters are incremented; when an item is deleted, the corresponding counters are decremented. To avoid counter overflow, we choose sufficiently large counters (4 bits per counter is usually large enough).

• Compressed BF:
Michael Mitzenmacher (Harvard) proposed Compressed Bloom filters, which improve the performance when BFs are passed as messages. In our case, BF is not only a data structure of containing users’ history URLs, but also a message being published and thus transmitted on the Internet, so the idea of compressing BF as a message (reduce the transmission size) is straightforward. The focus of optimization is not on the size of a BF but rather the number of 1’s in a BF because BF is usually sparse and we would only transmit the 1’s. So, in his 2002 paper “Compressed Bloom Filter”, Mitzenmacher illustrates that using a larger, but sparser, Bloom filter can yield the same false positive rate with a smaller number of transmitted bits. So, the tradeoff here is between the size of BF and the transmission size. Alternatively, one can use the same number of transmitted bits but improve the false positive rate, or find a more suitable tradeoff between the two.