CommerceNet Workshop on Decentralized Commerce


20 June 2005


Wharton West, San Francisco


co-sponsored by CommerceNet and Supernova

The world of e-commerce today is dominated by centralized business models, blown up to immense proportions by the reduced friction and increased velocity of the Internet: the world’s largest bookstore, largest garage sale, largest music store, largest advertising agency, and so on. The next intriguing phase supports decentralized commerce, allowing myriad small players across entire industries to connect directly.

Contents


Program

See [1] for updated information from the Supernova website.

Prediction Markets

8:30 AM -12 PM: Prediction Markets provide more reliable forecasts of
election results than polls. They can also be used to get
improved forecasts on sales levels, project completion dates,
or to get more detailed data on consumer preferences than
focus groups provide. We’ll discuss how they are being applied
in business contexts to incorporate divergent views into
business plans.We’ll introduce the idea and its history in the first session, then proceed to current examples, tools, and discuss the future of this new approach.

  • Session I: Foundations
    • David Pennock, a researcher at Yahoo! Labs who is currently running the large-scale Tech Buzz market experiment in conjunction with O’Reilly.
    • Robin Hanson, the inventor of the term Idea Futures, will be speaking on the latest developments in prediction markets.
  • Session II: Tools & Applications
    • Bernardo Huberman, the editor of the seminal text on market-oriented architecture, ‘The Ecology of Computation’, will be speaking about his team’s latest work at HP Labs.
    • Emile Servan-Schreiber, President and CEO of NewsFutures, an independent company providing prediction markets to corporations and media websites.
    • Walter Yuan, Technical Manager, Social Science Experimental Laboratory (SSEL) at Caltech will talk about their new jMarkets software

As background, please see the recent success of the DIMACS Workshop on Markets as Predictive Devices ( our notes).

Microformats

1 PM – 2:30 PM: Microformats are a set of simple open data format standards for more/better structured blogging and web microcontent publishing in general. In contrast to dedicated XML schemas, microformats for calendar events or business cards are familar XHTML with a slight level of semantic styling on top. This session will bring together top speakers from several leading blogging companies to unveil support for the latest microformats to join the stable. We’ll have some very exciting news to share there!

  • Introduction: Why CommerceNet sponsors Microformats.org. Rohit Khare & Adam Rifkin.
  • Tutorial: What are Microformats? Tantek Çelik.
  • Panel: Implementations & Implications
    • Michael Sippey, Six Apart
    • Kevin Marks, Technorati / Tag Tuesday
    • Kragen Sitaker, CommerceNet zLab
    • Marty Tenenbaum, CommerceNet
    • Tantek Çelik, Technorati / Microformats.org

Co-located Sessions

2:30 PM – 4 PM : Chris Anderson, the writer who recently unleashed the meme of the Long Tail of Electronic Commerce.

4 PM – 5:30 PM: “If, as the Cluetrain Manifesto declares, markets are conversations, how can strategists and marketers effectively tap into those conversations in an authentic way?”

Call for Participation

Some portion of this is due to technological changes, such as service-oriented architectures (SOA) and business services networks (BSNs). A larger measure of credit, though, goes to cultural changes that encourage creative new mechanisms for solving problems whose answers are scattered across many people in different organizations.

Prediction markets are one example: by creating artificial financial securities that pay off under specified conditions, communities can aggregate the dispersed knowledge pertaining to forecasting quarterly sales or product development schedules. Another example is the innovative use of Information markets to gauge demand for different combinations of features for products not yet built. Web advertising is a third: now that some giants have shown that auctioning keyword advertising a nickel at a time can add up to billions of dollars, new entrepreneurs are developing new ad networks that free independent content providers to control the content of ads that appear next to their copy and to share revenues in new ways.

This workshop is being co-sponsored by CommerceNet at Wharton’s West Coast Campus as part of Supernova, the renowned executive conference on decentralization in the high-technology industry at large. It will be an intimate, interactive setting to discuss the challenges and opportunities facing innovative models for decentralized commerce.

Submission Information

Submissions are now closed. For more information, please contact Chris Hibbert for markets-related initiatives, and Rohit Khare for other aspects, including decentralized marketing and advertising.

Registration

Workshop registration, fees, and lodging information can be found at the Supernova site.

by Rohit Khare; see also Publications

New kind of information space, the Personal Web.

Two player economic model: user (god) spends attention on “stuff”; robot fetches and organizes stuff to maximize user-wealth within constraints on storage, bandwidth, and archiving requirements (such as ‘keep everything directly to or from me’).

The robot can find other bots to trade with. The robot can only make promises of effort it will undertake in the future, so an additional constraint on “good” players is that they not over-commit – but the problem is determining the credibility of those outsiders to begin with. In other words, everyone is born with a worthless currency, but might be willing to extend credit denominated in it.

So what is there to trade? The system is initially quiescent. The first change is that someone’s user says something. That utterance is a now a bargaining chip that the robot can use – modulo the costs of transporting it (mainly opportunity costs of abandoned latency; in theory, you can always buy more bandwidth).

The immediately consequent question is, how does the “buyer” robot know how much to pay for this new item? We could turn this system on its head and say that the only form of progress is for buyers to make offers first, and that merely sending bits is anti-social and risks shutting off your credit.

Suppose the basic HELO message is “my name is robot-# and I’m interested in <matcher>s” where <matcher> is a list of URLs, or public-keys (to check signatures), or a regular-expression, or a prefix of a hash of a URL or a bloom-filter, or…? Well, that’s not quite enough, I’d like to give a sense of the relative value of those predicates, so it might be a matcher-price vector (btw, is an item that matches multiple predicates worth the sum? The max?).

But since time is the enemy – latency being the variable to minimize while maximizing user-wealth – there’s a countervailing impulse to keep it short and snappy; you might want to guess what parts of your “order book” are most worth revealing to that other robot. Similarly, we’d probably need a built-in mechanism for deferred transmission: “I’ll accept any item up to 1K, but longer than that please substitute a URL where I can go fetch the content myself once I decide to.” But then again, that can be safely ignored for now as a ‘mere matter of representation’ – there are lower-level protocols to rely on to ration bandwidth amongst competing concurrent item downloads.

So the price is denominated in a pairwise currency that, I believe, has a lifespan longer than a TCP session (thus, the robots need longer-lived identities. That pairwise currency “XY-bucks” represents some unit of effort Y will expend on behalf of X in some distant future (YX-bucks being the reverse). This means that for robot-X to make a wise decision about where to invest, it will need very sophisticated models of the “inflation rate” of these other currencies; the level to which there may or may not be inter-market learning about the reliability of XY-bucks based on rumors of the value of XZ-bucks and ZY-bucks; and more basic risk-management principles limiting how much XY-bucks X trusts to leave ‘on deposit’ in the ledger of the Central Bank of Y.

So here’s the game:

Robot X serves a god, user U. 

Robot X is endowed with a finite amount of storage, a finite bandwidth to the internet, and faces a vector of constraints of the form <other-robot-ip-address, max-bandwidth, avg-latency>. 

User U gives Robot X a vector of <matcher, utility> orders that indicate U's current demand for information. 

Robot X can connect to any other Robot Y and announce a (portion of the) demand vector of <matcher, XY-utility> that indicate X's current demand. 

Robot X can send any item of information to Robot Y; if and only if that item passes a matcher test previously announced by Y, it can assume it gained 

YX-bucks of utility as previously offered (and, yes, there needs to be a periodic announcement of balances to prevent misunderstandings accumulating - and, in fact, someday the Central Bank of Y may be well-served to publish a complete accounting of all of its balances to all of its peers to show that it is not inflating away the currency to worthlessness - because while there are notionally separate YX and YZ currencies, it seems that the money supply of Y is highly correlated in those markets - it's limited by Y's users' utility at best.) 

And that's about it - though there are surely wrinkles around the edges like how X can "disconnect" from Y (or, more effectively, firewall its packets out from consideration entirely); or how Y can warn X of what languages it will accept <matchers> in; and so on. 

So to reduce this problem into more practicable form, assume that all bandwidth limits are expressed as bin sizes per unit time (say, 1 second). Also, for now let's assume that the time it takes to evaluate a <matcher> on an <item> is negligible - though that could later be allocated agorically too. 

For each timeslice { 
Make_most_profitable_move() 
} 

which could be refined into:

While (clock-is-ticking) { 
// constraints: 
update_total_bandwidth (); 
update_pairwise_bandwidth (); 
update_pairwise_exchange_rate_estimates (); 
update_user_demand_curve(); 
foreach (trader T) { 
response_basket = pick_items_from_inventory_with_max_price   (T's_demand_curve, T's_max_bandwidth); 
request_basket = pick_items_from_demand_curve_with_max_price (U's_demand_curve, T's_max_bandwidth); 
//decision: do I want to stack up more XT-bucks or spend them? 
If (Receivables[T] - price_of(request_basket)) > 
(Payables[T] - price_of(response_basket)*exchange_rate_estimates[T]) 
then proposed_action = response_basket; 
else proposed_action = request_basket; 
} 
execute(solve_bin_packing_problem(all_proposed_actions, total_bandwidth)) 
} 

Scenario

Vicky, a vegetarian blogger, is a very controversial figure who is only avidly read by a small community of professionals and activists. Occasionally, though, she publishes a tasty recipe that gets widely syndicated around the Web.

Ralph, a cook, only blogs about interesting recipes. He has a much broader audience because he test-cooks and reviews recipes of all sorts, from all kinds of diets.

Lou, a nutritionist, primarily advocates low-carb diets and only occasionally considers vegetarian recipies, and then mainly to mock them.

Tony’s Tomatoes, an advertiser, wants to entice readers to his site. He’s willing to pay a commission on his sales. To approximate that effect, he’s willing to pay (at most) P cents per click-through.

PotatoHead, another advertiser, has an informational site advocating the health-benefits of carbohydrates in tuber form. It’s an ‘astroturf’ blog run by a lobbying group, Potato Farmers Underground. This organization is willing to pay-per-impression to promote their brand.

Keyword Advertising

Keyword-driven “Sponsored Search Auctions” are the norm today. These permit centralized ad networks to dynamically insert ads alongside publishers’ contents (or, as one satirical site would have it, replace the content entirely).

The primary (publicly-acknowledged) criteria for insertion are:

  1. Keyword matching against previously-crawled & indexed text of the page. (n.b. the ad keywords need not occur in the text of the ad);
  2. Click-Through Rate (CTR), as a proxy for how effective/helpful the ad copy is; and
  3. Prices, typically according to a second-price sealed-bid auction.

Obviously, this does not account for taste: the model does not treat advertising as an endorsement by the content-publisher. While in traditional media, this sort of “church/state separation” between Editors and Publishers is a hallowed goal, it can be counterproductive when taken to extremes. Ultimately, editorial content cannot be separated from advertising content.

Today, systems include black-list and white-list systems for publishers to limit access to acceptable advertisers, but these mechanisms are quite clumsy. They fundamental “flaw” is that sponsored search auctions seek only to maximize revenue, not relevance.

Some Problems with Sponsored Search Auctions (SSA)

Ralph is pretty much the only happy player: he panders to the broadest possible audience and gets paid roughly in proportion to traffic he aggregates.

  1. PotatoHead is out of luck. Their boring brand advertising doesn’t get clicked-through on very much, so they have to set a cost-per-click-through that’s dramatically high to compensate for that.
  2. Lou is upset. Unfortunately, setting such a high price for potato ads means that even the most marginal attack on tubers from Lou’s articles will summon up a “Tasty Spuds!” offer to his readers.
  3. Vicki is paid too little. She doesn’t make as much money as she ‘deserves’ to: her recipes get read far and wide, but when her tomato-and-potato-pie shows up on Ralph’s site, she gets nothing.
  4. Tony’s Tomatoes pays too much. When tomato ads show up on Lou’s site, his carnivorous readers just click through for fun — they know they’re running up Tony’s bills without any intention of buying.

Some Opportunities for Sell-Side Ads (SSA)

This scenario is contrived to highight a few underlying issues: clickfraud, keyword mismatch, transactional vs. brand advertising…but most of all, the lack of informed consent between publishers and advertisers.

Now, the virtue of Web Advertising today is its scalability; it’s absurd to expect an independent publisher to review millions of ads, nor for an independent advertiser to select from millions of publishers. However, there are other techniques for harnessing the ‘wisdom of crowds’ — a workable decentralized solution suggests sharing reputation and tags.

Even before tackling scalability, though, we should consider the “how society works” part of our motto to “build software that works the way society works.” In real life, Vicki is a very important person and ought to be well-compensated for her effort. That’s because many readers like Ralph are so moved by her work that it informs their own intellectual efforts: Vicki has influence.

Transferring Power to Sellers

Placement. In an ad auction, the process of matching ads to content is largely automated. The choice of which ad will be placed is only partially determined by a publisher’s choice of keywords (itself a weak proxy for intent). To move the power to the player that’s selling ad space, publishers ought to directly control which advertisements are placed on their sites.

Pricing. In an ad auction, the actual clearing price of a given ad is determined by the competitive actions of other advertisers. By calculating payoffs using a fixed limit and dividing by the ultimate number of hits, the price of an ad is determined by the competitive actions of other publishers. This is reminiscent of pari-mutuel betting (in its strengths and weaknesses).

The problem that emerges from these two power shifts, however, is that such a scheme is completely impractical at scale.

Instead, what if we could decentralize the task of checking which ads are reputable for each publisher — what if, rather than wading through the vast universe of possible ads for ingredients in all of Ralph’s recipes, he could just wait for Vicki to sort them out and recommend just the right ad to run when he syndicates her recipe?… and of course, pay Vicki for her trouble, say 20% of his revenue?

Our Demo Storyline

  • Tony’s Tomatoes would like to limit their financial exposure to a fixed amount over a fixed period of time. This implies the price depends on the supply of competing publishers’ ad space (which is by no means a homogenous supply curve — each publisher has different opportunity costs of runing other ad campaigns).
  • Tony offers Vicki first crack at running this ad. If she agrees, she adds the ad to her feed.
  • Ralph quotes Vicki’s blog and also decides he can sell more tomatoes, too, and copies the ad to his site as well. Now, he’s getting a share of the ultimate payout for the ad — but Vicki gets credit, too. The referral ‘tax’ encourages Vicki to get others to run the ad for her.
  • Lou, who does not want to endorse vegetables, can simply run veal ads instead.
  • PotatoHead, who wants brand advertising to a broad audience, can directly approach Ralph (or vice versa) and propose an effective-CPM price.

Obvious flaws? [Note: this is not strictly true with a fixed budget: a dishonest Vicki would run the ad precisely once and hide it from the world… Vicki needs to be in competition with other “roots,” too]

What the server does

Unlike content that gets cut-and-pasted and spindled-and-mutilated, an ad must be served up identically each time. Thus, an advertiser’s first preference would be to run their ‘own’ ad server. The server also has to provide a browseable catalog of current ad campaigns to join. This is an admittedly centralized component, and could only be eliminated by a widely-trusted peer-to-peer web hosting service with hit tracking.

Our server needs to credit each ‘hit’ for serving an ad (and potentially, for each click-through, too) to the proper publisher’s account. It also needs to create new “downline” ads when new publishers join the network to trace their provenance.

Finally, the server needs to implement a payoff algorithm(s) to allocate the pot amongst the participating publishers. At a 0% referral fee, this becomes a cost-per-click/hit model; at 100% you have Ross Mayfield’s extreme CPI model where only bloggers that get quoted get paid. In between, we look forward to experimenting with varying ratios; another proposal of Ross’s is to use 1/PageRank as the rate…

Applications

What could you do with such a device?

  • send email to other users of similar devices, as voice (perhaps encoded at 16 kilobits per second).
    • essentially this lets you talk to friends and family even when they’re not nearby.
  • send it over the Internet as well, perhaps using Bluetooth to a cellphone.
  • keep notes to yourself — prices, to-do items, etc.
  • compose and publish vocal essays, prayers, songs, etc.
  • listen to the essays, etc. of others
  • be notified of events happening out of your sight but within multihop radio range
  • use voice synthesis software to read existing text, such as email or ebooks
    • a small camera with OCR would help here

Most of these applications would run over a dynamically-routed mesh network that was mostly disconnected most of the time, so they would probably need to do store-and-forward flood-fill — transmit each piece of information to many more nodes than the shortest path, in hopes that one of them will eventually be able to relay it to its destination.

The rest of this page tries to estimate the feasibility of such a device.

Existing similar devices and related work

Perhaps the machines built for the Berkeley Motes project (the Intel iMotes and the Atmel ATmega-based MICA motes from Crossbow ) and similar sensor-network projects could be adapted for personal-terminal use. In particular, the MICA motes’ CPUs consume 24 milliwatts when running at 4MHz and 50 microwatts when asleep; they have 512K flash, 25x25x6mm (not including battery), a 40-kbps radio (30mW receive, 75mW transmit), and a 10-bit ADC.

Jason Hill built a smaller Berkeley mote version called Spec that’s a 2×2.5mm ASIC (in .35-micron), requiring only external oscillator, power, and antennas; the materials cost of that whole system is estimated at $0.62 plus antenna, and it includes:
an AVR-like RISC core… 3K of memory, 8 bit On-chip ADC, FSK radio transmitter, Paged memory system, communication protocol accelerators, register windows, 32 Khz oscillator, SPI programming interface, RS232 compatible UART, 4-bit input port, 4-bit output port, encrypted communication hardware support, memory-mapped active messages, FLL based frequency synthesizer, Over-sampled communication synchronization,…

Power sources

There are several possibilities here: disposable batteries, mechanical generators, grid power, solar cells, and car-battery (or motorcycle-battery) power. Since this is a popular problem, there are several other solutions on the horizon , including methanol-powered fuel cells, body-heat-powered thermocouples, piezoelectric ambient vibration harvesters, MEMS internal-combustion engines, and beta-particle-harvesting nuclear “batteries”.

Disposable batteries

These are widely available and can supply a lot of power — AA cells can supply 100mA to 500mA (according to eyespyvideo’s AA and AAA battery tests; otherpower confirms they can supply well over 60mA; Duracell claims its alkaline AA batteries can supply around 300 mA at 1.1V) but cost a substantial amount of money — comparable to the up-front device cost. One of the sales-points of the Freeplay wind-up radio is that it doesn’t have expensive removable batteries that can be stolen or hoarded by higher-ranking family members.

Connections to disposable batteries are very simple and cheap, costing a few cents, but they suffer from mechanical wear and breakage.

Mechanical generators

Trevor Baylis’s Freeplay foundation’s Lifeline radio has a wind-up spring that powers an electrical alternator to power the radio. Even more than disposable batteries, these can practically supply large amounts of power, up to several watts; however, they also wear out and break, unlike the other power sources discussed here.

I don’t know how much useful mechanical generators cost, but I think they can be manufactured for $1 or less, since they are simply electrical motors. However, the Freeplay radios’ pricing seems to start at $60 , or maybe $52.

You could even consider putting only the coil of the generator on the device, perhaps rectified with a germanium bridge, and allow users to charge it by changing the magnetic flux in its environment. Any sufficiently-strong permanent magnet would work. This design has no moving parts; a variant of it is the Forever Flashlight, which contains the permanent magnet inside of it. While the magnet is a moving part, its surface can wear a great deal without making it less useful, rendering this system essentially maintenance-free, if heavy and noisy.

Grid power

In places where grid power is available at least once a week or so, it’s possible simply to charge the device’s internal batteries from it then. However, this requires power-supply electronics to step down the mains power to the low voltages used by computing devices, which cost money — I don’t know how much — and does limit its applicability and convenience. Also, it may impose a tight power budget, since charging happens only rarely.

Solar cells

half-square-foot solar cell provides 1.8 watts, costs $57

0.2mm-thick PowerFilm amorphous power cells supply 22 mA at 3V in 50x37mm, or 25 mA in 100x25mm, or 50mA in 100x37mm, or at 6V, supply 100mA in 100x150mm. These numbers are all around 30-40 microwatts per square millimeter, and all refer to direct sunlight. Also, they seem to weigh about 8 milligrams per (peak) milliwatt, and cost about $4 plus 5.6 cents per peak milliwatt ($8.89 for the 22mA/3V, $7.89 for the 25mA, $13.89 for the 50mA, and $33.89 for the 6V/100mA). I’m guessing that the $4 per item cost is mostly the cost of buying items in quantity 1 for retail. Other vendors offered 3V/50mA 114x37mm thin-film panels for $2.75 in quantity, which is about 1.8 cents per peak milliwatt, as of five years ago, or 3V/50mA at $5.95 in quantity one.

Since the solar cells only charge during the day, they would put the device on a tighter power budget than the other power-source candidates, except possibly grid power — 150mW solar peak permits perhaps 25mW average usage.

New flexible solar cells producing almost three times as much energy are in development at the University of Toronto, but designing a device based on these is risky.

Car or motorcycle batteries

In much of the world, car and motorcycle batteries are much more ubiquitous than grid power (e.g. Design that Matters claims that “ the most common source to be found in Mali is a 12VDC lead acid battery, typically used in car/truck or motorcycle applications “, and they are also closer to the low-voltage DC power needed by portable electronics. Perhaps connecting to one of these batteries periodically could serve to recharge such a device with minimal additional charging circuitry — perhaps a cheap DC-DC converter.

In a digital-pony-express system like the Cambodian Motoman system , email moves from town to town by motorcycle anyway; in the absence of any other power source, you could get your charge at the same time as your email, particularly if you can charge quickly.

CPU

There are many CPUs that can operate under 25 milliwatts at a few MIPS; Atmel’s AVR-based FPSLIC (AT94K05/10/40AL) claims 100 microamps standby plus 2-3 milliamps per MHz , at close to one MIPS per MHz, at 3.0-3.6V — so that’s 0.3 milliwatts plus 6-10 milliwatts per MIPS, plus a 5K-40K-gate FPGA. 16x16mm TQFP, e.g.

FPGAs might be interesting for audio applications if they can do codecs more power-efficiently than a CPU, which seems plausible. Certainly codec ASICs use much less power than CPUs running the same codecs in software.

FPGAless CPUs like the ATtiny11 can use still less power: it runs at 4MHz on 6.6mW, idles at 1.5mW , and has a power-down mode of under 3 microwatts, and can be throttled up to 8MHz. That’s in an 8-pin PDIP/SOIC. These CPUs cost around $1.50 in quantity one from Jameco .

The CPUs mentioned so far are rather cramped; they’re amply fast enough to handle basic control applications, but their RAM is tiny. It seems that it’s hard to find CPUs with more than a few kiB of on-die RAM; but moving up to the high-end, Atmel’s ARM920T-based feature-packed AT91RM9200 still uses only 60mW at 200MIPS, 31x31mm. (I can’t find pricing data on it, though.) The similar EP9302 from Cirrus is supposed to cost $10 in volume , but uses dramatically more power, around 500mW. Other low-power Atmel ARM microcontrollers cost $9-$25 each, even in volume, from Digikey ; the cheapest ($9) is the AT91M40800-33AI-ND , which does 40MHz; static power consumption ranges from 12.5 to 250 microamps at 3.6V, so under 1mW (possibly dramatically under), and dynamic consumption is around 1-5 milliwatts per MHz, plus up to about another milliwatt per MHz for peripherals. It’s been available since 2000, so I suspect that today’s CPUs could do considerably better than 1 MIPS per milliwatt, perhaps 4 MIPS per milliwatt. But maybe they would cost more.

Ideally you’d be able to run off-the-shelf software for as many features as possible; I think this MMUless ARM CPU can run uCLinux, but I’m not sure.

Energy storage

The quantities of energy required to be stored for such a device might occupy a wide range; solar pagers might need to store only enough energy for night operation, which might be quite minimal, while pagers powered off the grid might need to store enough energy for a week’s operation, and at the extreme end, a device the user can re-power by shaking (like the Forever Flashlight) might only need a few minutes of power consumption. It appears that 25mW power consumption when active may be a feasible power budget; perhaps the devices might be active for an hour or two per day, yielding 2mW average power consumption, plus idle, which might be another 2mW for a total of 4mW, or 1.3 mA at 3V.

So an energy storage system might usefully store anything from 1 mAh (if the user can easily add more energy) to 20 mAh (for a system solar) to 110mAh (for a weekly-grid-charged system).

These numbers are on the very low end for rechargeable batteries; in fact, it’s down in the range where carbon aerogel “supercapacitors” can work. Commonly these cost less than $1, can be charged to 2.8V, and have a capacitance of a farad or so, which means they hold one coulomb (ampere-second) of charge per volt. An ampere-second is about 280 mAh, so while discharging from 2.8V to 2.0V, such a capacitor delivers 220 mAh, or 530 mWh.

Capacitors weigh more, cost more, and take up more space per unit of energy storage than batteries; their advantages are that they can be charged very quickly (typically they can be charged at 30A, so they can be fully charged from empty in 93ms) and they can survive more charge cycles than a battery (they are commonly used in applications where they are partly charged and discharged tens or hundreds of times per second, and survive for many years.) Clearly the ability to fully charge in well under a second is more important if you’re charging from an itinerant motorcycle than if you’re charging from a solar cell.

(to do: insert actual cost and size figures in this section)

Volatile memory

I think DRAM is far too power-hungry for such a device, so static CMOS RAM may be the only usable alternative. A few megabytes would be nice in order to support a wide range of applications, but only a few kilobytes are needed for the basic networking and voice-mail logic. However, if the CPU is running audio codecs, it will probably need at least a few hundred kilobytes to support that.

(to do: insert actual cost, power consumption, and size figures in this section)

Nonvolatile Memory

There are only two presently feasible alternatives for nonvolatile memory in such a cheap and low-power device: “flash” EEPROM, and nothing. The “volatile” memory might be sufficiently nonvolatile under some circumstances; if it can survive for weeks or months on the residual energy stored in the energy storage device when the device has too little power to run the CPU, particularly if recharging is quick and easy (as with the mechanical generator and solar types), the advantage of “flash” may be small.

(to do: insert actual cost, power consumption, and size figures in this section.)

Radio

Communication costs a lot of power , with each bit transmitted costing at least as much as a thousand CPU instructions. Typical medium-range radios consume 1-20 milliwatts and have ranges from meters to tens of meters.

It would be useful if the radios were compatible with other widely-deployed devices, such as Bluetooth phones, to provide Internet connections without further NRE efforts and deployment risks. Bluetooth data transfers practically run at a few hundred kbps, so could transfer voice messages coded at 16kbps around 10x-50x real-time; an hour of voicemail might transfer in 2-5 minutes.

The Mote devices use their own low-speed radio protocol to transfer data between them at a few tens of kilobits per second. In general, you can get longer range by either decreasing bit-rate or increasing power usage.

One of the power-saving features of the Motes is that they run their radios in receive mode (costing 30mW) only a small fraction of the time, when they expect to be receiving data. As long as their peers know when to expect them to be listening, they can schedule data transfers for that time.

If our average power budget is on the order of 4mW, and listening on radio costs 30mW, we probably need to keep its duty cycle under about 1%.

TDM power control

Most of the following techniques were explained to me by Barbara Hohlt from the Berkeley Motes project, although the system described is not the same as the one she designed — only inspired by it.

If you listen at certain predetermined times for new conversations, you can achieve this low duty cycle. I’ll refer to the period of time you’re listening as your “timeslice”. Your timeslice recurs at regular intervals; I’ll call this interval the “cycle time”. And, finally, you need to transmit a heartbeat or “beacon” packet every few cycles that contains your node id, clock reading, and timeslice assignment, for synchronization and network topology maintenance; I’ll call this the “heartbeat interval”.

In real TDMA systems, you are only allowed to transmit or receive during your assigned timeslices, and you have to allocate them dynamically through a process of negotiation. In this system, none of this is necessary; you simply assign yourself a timeslice and periodically tell your neighbors when it is, and they will wait until that timeslice to initiate conversations with you. Multiple-access interference can be handled simply by retransmitting unacknowledged frames a small random number of cycles later.

The timeslice-length we choose is limited on the low end by clock drift considerations, and on the high end by latency and promiscuity considerations.

Clock-drift considerations merely say that your timeslice must be large enough to accommodate your neighbor’s relative clock drift to your own. Typically quartz clocks have accuracy close to a part in a million, so your timeslices need to be no smaller than perhaps 1/100 000 of the heartbeat interval.

Data-rate considerations might seem to say that our timeslice must be long enough to receive a useful amount of data, but since the timeslice allocation is only used to allow neighbors to initiate data transfers, it does not. The transmission can continue as long after your timeslice as desired, as long as the conversation remains active.

Periodically, in order to connect with new neighbors, we will want to listen through a whole heartbeat time in order to detect unexpected beacon transmissions from new neighbors, which I term “promiscuous reception”. This will require orders of magnitude more energy than our regularly scheduled listening, and so should happen orders of magnitude less frequently. But it must happen often enough to guarantee the discovery of new neighbors with acceptable latency.

Gossip can tell you about new neighbors that your neighbors already know about, but when two previously disconnected groups connect, it’s desirable to discover this relatively rapidly. For example, when the farmer comes home from working alone in the field, it is desirable that his messages to his friends and family be transmitted within a few minutes.

So at least every few minutes, you have to promiscuously listen through a full hearbeat interval. Keeping the duty cycle low means that a full heartbeat interval should be no more than 1/1000 of a few minutes, so it’s at most less than a second.

The number of cycles per heartbeat

If you were to tell each of your neighbors your clock reading every interval, then your time-slice could be 1/100 000 of the interval. For example, you could listen for 50 microseconds out of each five-second interval. This gives your radio a receive duty cycle (in idle) of 1/100 000, which is great, but also a transmit duty cycle of perhaps 10/100 000, if you have ten neighbors on average, and that costs perhaps 20 to 40 times as much. (You can reverse this by broadcasting your clock reading during your own time-slice and listening to your neighbors’ broadcasts during their time-slices — now you have a transmit duty cycle of 1/100 000 and a receive duty cycle of 10/100 000. But there’s a better solution.)

If, instead, you transmit your own clock reading every six intervals, you must use a larger timeslice: 6/100 000 or so — and that becomes your idle receive duty cycle. But now your transmit duty cycle is only about 10/600 000, or 1.3/100 000. (This assumes you can transmit your entire clock reading within your window, which may be unreasonable.)

Practically speaking, idle duty cycles of less than 1/1000 probably provide no additional benefit, since reducing your 30-microwatt-average idle radio to 15 microwatts won’t make a significant difference in the average power consumption of the device. So you could listen for 100 microseconds out of each 100 milliseconds and transmit your clock reading once every 500 milliseconds.

There’s another alternative to listening through a full transmit interval, which is random occasional transmission and receiving.

Audio Codec Hardware

Specialized codec hardware can often use two orders of magnitude less time and, I believe, less energy than general-purpose von Neumann machines running the same codecs in software.

There are two other alternatives: FPGAs, which I think use more power than ASICs (perhaps an order of magnitude?) and less than CPUs to perform such functions, and DSPs, which are general-purpose CPUs specialized for such applications (so they can do them in fewer clock cycles — I don’t know about less memory). In addition, CPUs tend to require substantial amounts of memory for such codecs.

(to do: insert real cost, power consumption, and size figures here.)

Audio Input and Output

A 10- or 12-bit ADC sampling at 8kHz is quite adequate for recording voice, although more doesn’t hurt; many microcontrollers include such devices in the chips themselves, and can be connected directly to a small dynamic microphone input with a DC offset. Typically DAC output in such devices is handled by PWM plus filtering, and I think the outputs on such microcontrollers are sufficient to drive earphones. I don’t know how well this works for voice.

External ADCs and DACs along these lines are relatively cheap and low-power as well, and at least the DAC can be entirely powered down much of the time.

If audio is the only user input into the device, however, it needs to be listening enough of the time to know when you call its name. Perhaps you could start by detecting sufficiently-loud sounds with passive analog circuitry (a band-pass filter to pull out sounds in the range of voice, a rectifier to convert to DC, and a very simple low-pass filter to ignore sufficiently short sounds) and use that to wake up the CPU; or use some kind of mechanical or capacitive switch to get the CPU’s attention.

Perhaps, though, you could run the device continuously at low speed; can you run a modern DSP or FPGA at 10000 MACs per second for a fraction of a milliwatt in order to do voice processing constantly? You can run 100 000 cycles on a non-DSP each second; on the Atmel chips mentioned previously this would cost 0.1 to 0.5mW. Presumably a chip designed for this could do it for less.

All of this attention-processing stuff would be unnecessary when the device had run too low on power to run the CPU anyway, so it doesn’t count against the power cost of retaining SRAM as NVRAM.

(to do: include real power costs of ADC on the Atmel chips, external ADC/DACs, and size and monetary costs of the latter as well)

Conclusions

Credits

Kragen Sitaker wrote the first draft of this document; it came from discussions with Rohit Khare and Smruti Vidwans, and Dan Gould provided some useful insights as well. Barbara Hohlt told me about TDM power control.

I was reading Dean and Ghemawat’s MapReduce paper this morning. It describes a way to write
large-grain parallel programs to be executed on Google’s large
clusters of computers in a simple functional style, in C++. It
occurred to me that the system as described might be well-suited to
computational cooperation between mutually untrusting entities,
computation in what we at CommerceNet refer to as a “decentralized”
system.

Decentralized computational cooperation

There are five major problems with regard to computational
cooperation, by which I mean outsourcing some amount of your
computation to servers run by people you don’t trust completely:

Cracking 
The server trusts the client not to break the server.
Storage 
The client trusts the server to store the client’s data safely.
Correctness 
The client trusts the server to execute the client’s code accurately and produce results quickly.
Confidentiality 
The client trusts the server not to disclose the client’s data against the client’s wishes.
Payment 
The server usually trusts the client to pay them for the service provided.

Cracking can be dealt with by known methods: metered resource usage
and well-known isolation techniques.

Payment can be dealt with in any number of ways; it should be noted
that it is not entirely separable from cracking.

Storage can be dealt with by replicating or erasure-coding data across
a number of administratively-independent servers.

This leaves the problems of correctness and confidentiality; I think
the MapReduce approach can help with correctness.

Correctness

The “map” function converts a record from an input file into a set of
records in an intermediate file; typically, each input record is
replicated in several places on the cluster, and the “map” function is
deterministic. The “reduce” function converts the set of all
intermediate records sharing the same key (gathered from the many
map-stage output files) into a set of output records, which are
written to an output file; typically the “reduce” function is also
deterministic.

In the usual case, where these functions are deterministic, they can
be executed on two administratively-independent servers, and the
results (which, in the Google case, are merely files) can be compared.
If they differ, the same results can be recomputed on more
administratively-independent servers to see which ones were correct.

(It may be worthwhile to compare Merkle tree hashes of the output
files, rather than the output files themselves, since moving the
output files over the network may entail significant expense.)

This prevents any single broken or dishonest machine or administrative
entity from affecting the correctness of the overall computation.
Higher levels of redundancy can be used to defend against stronger
attackers at relatively modest cost.

Some threats can be defeated by even weaker means with negligible
computational cost. Computing each function for a randomly selected
1% of input or intermediate records, then comparing the results, may
provide an acceptable probability of catching faults or compromises if
they are expected to affect a significant proportion of the output;
and it requires negligible computational cost. In the
ten-billion-record performance tests mentioned in the Google paper, a
corrupt “map” function would have to affect fewer than 694 of the
input records to have less than a 50% chance of detection by the 1%
sample (including 100 million randomly selected records). A corrupt
“map” function that affected 5000 input records — only one out of
every two million — would have only an 0.7% chance of not being
caught.

This probably deals adequately with machine failures and gross
attacks, but a careful attack might corrupt the output for only a
single input record — and would have only a 1% chance of being
caught. This may still be enough if the problem is a result of a
deliberate attack and the attacker is vulnerable to sufficiently
severe penalties.

Confidentiality

Confidentiality is a more difficult problem; computing with
confidential data on hardware that belongs to someone you don’t trust
requires that you compute with encrypted data. In the general case,
this is a very difficult problem.

(Perhaps this could be written with a real example.)

by Kragen Sitaker; see also Publications

Distributed posting list joins (aka inverted list intersections) are
the biggest unsolved problem for a geographically-distributed
full-text web-search engine (according to the Nutch FAQ, if I read it
correctly, in the section
entitled “Will Nutch be a distributed, P2P-based search engine?”).
Here are some thoughts which I think include a workable solution.

I talked about posting list colocation before, in a post entitled
“distributed peer-to-peer full-text web searching”, in February 2004,
on kragen-tol.
Posting list colocation seems similar to the coincidence problem for
pub-sub rendezvous discussed in
Consistent Hashing Multicast
— I wonder if there’s a way to cut down the numbers in a similar
way? Maybe as discussed in Loo, Huebsch, et al.’s [http://iptps04.cs.ucsd.edu/papers/loo-hybrid.pdf “The Case for a
Hybrid P2P Search Infrastructure”] you could avoid
moving large posting lists around by moving queries to them instead of
moving them to queries? Hmm…

That paper also mentions something about join indexes, which sound
like they could be a useful optimization for frequent or expensive
joins.

In the same conference, Shi, Yang, et al. published a paper entitled
[http://iptps04.cs.ucsd.edu/papers/shi-keysearch.pdf “Making Peer-to-Peer Keyword Searching Feasible Using Multi-level
Partitioning”].
This brilliant paper proposes a solution to the distributed
posting-list join problem, and this solution appears to me to be a
major advance that should make peer-to-peer keyword searching of
filenames work noticeably better, but it seems to me that it doesn’t
yet solve the problem for full-text web search.

Basically it just puts each group of documents in its own group of
nodes (called a level-L subgroup in the paper) which is assumed to be
tightly connected. Unfortunately the second-largest posting list in a
query must be able to be transmitted over the network within the group
within the required response time. Putting some realistic numbers on
this, “http” and “www” might each have 16GB of total posting-list
data, each group of nodes might have 1Mbps of bandwidth available to
any member, and you might want 1-second response time. This means you
must divide your document space into at least 128000 groups, and send
any query to all 128000 groups. (The paper includes a tree multicast
protocol that makes it possible to do this with reasonable latency.)
This unfortunately is comparable to the total number of nodes that can
be expected to participate in such a system, so when you try to use
this technique to replace Google, it becomes indistinguishable from
the partition-by-document technique used by Gnutella, and requires
five orders of magnitude too much work from the network as a whole for
most queries. This will cause it to perform poorly under many
simultaneous queries, which doesn’t appear to be considered in the
paper.

The paper actually claims that its result is practical for full-text
web search, but its performance results are simulated. They find on
the order of 30GB of total data transferred for a search of 3 * 10^9
documents in a network of 16384 nodes, or about 1.8MB transferred per
node on average. On a 1Mbps link, that’s 15 seconds. Accordingly,
the latency numbers presented stop around 2 seconds per search with
L=4 and 100 million documents. (The paper is ambiguous, but I believe
the 30GB is per search, not a total over several queries.)

So here’s a slight tweak on this idea which, I think, makes it
practical: the aggregate bandwidth available to a particular posting
list should be proportional to the size of the posting list. We
achieve this by distributing posting lists for more common terms over
a larger number of nodes.

My “more on distributed peer-to-peer full-text web searching” post in
February 2004 gives a challenge problem: 18 million postings for “criteria”, one
million postings each for “autism” and “eye-contact”, with an
intersection of only 4200, in under a second.

Suppose your chosen strategy was to distribute the 2M postings for
“autism” and “eye-contact” to the machines where the “criteria”
postings lived, then have those machines do the merge and send all the
remaining results to a single collection point, in under a second; and
suppose that all the machines involved have only half a megabit per
second available in each direction. Each posting is perhaps five
bytes, so we have 10 megabytes of posting data to scatter/gather in a
second, or 80 megabits; that means that the “autism” and “eye-contact”
postings must be scattered among 80 machines each, for a total of
12500 postings per machine. Each of those machines then generates a
half-megabit spike of traffic for a second sending its load to the
“criteria” machines (of which there are 1440, so each sender must send
to 18-20 destinations); those machines filter the incoming packets in
a streaming fashion against their local hash tables of “criteria”
postings, leaving only two small lists of autism&criteria and
eye-contact&criteria that must be merged.

If all these machines must maintain in-RAM hash tables of all posting
list parts they hold, and they’re not allowed to hold more than 12500
postings of any particular posting list (so they can transmit them
over the network in a second), then each hash table might be 12500 *
(8 (pointer table bytes) + 4 (nextpointer bytes) + 8 (docid bytes) =
20 bytes) = 250 kB. This means that a 1GB machine can be responsible
for 4000 full-sized posting-list parts, or 50 million postings.
(Maybe use a sorted list instead and save a factor of 4?)

For reference, my current mailbox contains 300MB, 1.2 million terms,
14 million postings, and 36000 documents (mail messages). If those
numbers are typical, this scheme needs one network node per 130
thousand documents or 1GB of text. This scales linearly with RAM on
those machines, but that still means we can only index the documents
that fit in these machines’ RAM. This limitation doesn’t change with
available bandwidth.

Unfortunately, you also need enough network nodes that the longest
posting list can be distributed at 12500 postings per node, and
practically speaking, I think that means that you need one node per
12500 documents. This limitation doesn’t change with available RAM,
but it does change with available bandwidth. You’d need about 5
megabits (per acceptable response time, e.g. per second, or per two
seconds) each way for it to equal the limitation of 1GB RAM.

So that’s an interesting pair of limitations. In this scheme, we can
only effectively use about 200MB of RAM per megabit of bandwidth, or
vice versa, and either one buys us about 25000 documents or 200MB
indexed.

1GB RAM now retails around $200, or about $40 per 200MB.

The Shi et al. paper discusses ways to dramatically reduce bisection
bandwidth requirements; these are not discussed here on the theory
that the planet-wide internet probably has sufficient bisection
bandwidth. However, their geographic clustering technique does apply
to the design discussed here.

I’m guessing that right now core bandwidth costs about $20 per
bidirectional megabit per second per month. (This means that every
two months, the bandwidth cost equals the memory cost, and that’s only
going to get shorter.) Indexing the 40 terabytes of data on the Web in
this fashion would require 40 000 one-gigabyte-RAM, five-megabit
machines, which would cost about $100 each per month to run if they
were doing searches flat-out all the time. Burst bandwidth is cheaper
than sustained bandwidth, though.

It’s very reasonable to expect 40 000 machines, or even ten times that
number, to participate in a global peer-to-peer network.

If you were going to build this system today, the user interface
should probably be in Korean, then translated into Japanese, then
possibly translated into Chinese and English; although there are more
potential participants in China and English-speaking countries, they
tend to lag far behind Korean and Japanese participants in available
bandwidth.

EC’05-SSA (Sponsored Search Auctions) Workshop Notes

These are Rohit Khare’s notes on the SSA Workshop; links to each of the papers can be found on the SSA Workshop Agenda Page.

Pennock: History

In the beginning: exact match, rank by bid price, pay per click, human editors – worked, if ad hoc.

Today & tomorrow:

  • AI match
  • Rank by expected revenule
  • 2nd price (proxy bid)
  • pp-c, -impression –conversion
  • budget optimization
  • contextual
  • automating editorial filtering
  • “exploration vs. exploitation”

See also Mehta-Saberia, Vazirani, & Vazirani: AdWords and generalized on-line matching.
‘’’This workshop nearly doubles the academic literature on the topic.’’’
Interesting how interested MSR is: 4 papers at this ws alone!

Y!RL has a data sharing policy in place…

Kursad’s Welcome. He has two papers in this area so far: Internet ad pricing models, “A theory of bidding in search phrase auctions: Can bidding wars be collusive?”

Introductions (20ppl)

Tuomas Sandholm 
first automation of combinatorial bidding for computational resources (1991). CMU. Founder/CTO of CombineNet (rel=Parkes). Also interested in muti-agent learning. New to this sub-field.
Michael Ostrovsky 
Harvard Econ graduate, game theory, micro. About to start at Stanford in July.
Jim Jansen 
Penn State IST – information-seeking perspective
Siva Viswanathan 
UMd Business school: intersection of game theory and economics. Online intermediaries.
Jan Pedersen 
Chief scientist at Yahoo!Search.
David Radicheck 
grad student Berkeley, Y!RL summer intern. Applied mechanism design.
Samueli Angelov 
TU Eindhoven, automatic-contracting, looking to test it in this field.
Vandana Ramachandran 
2nd year UMd PhD with Siva. Asymmetry, how market makers are changing that.
Animesh Animesh 
thesis student at UMd as well, looking to learn more about this area.
? Chen 
UT Austin, Grad strudent, Mechanism design in online auctions
Chris Meek 
MSR, new to this area, it’s exciting to learn more. Machine-learning person by background.
Brendan Kitts 
iProspect Inc, developed a bidding system over the years, currently managing $3M at a time.
Joshua Goodman 
MSR, worked on spam (email), started in Machine Learning, moved to economic mechanisms. Now interested in all the bad things you can do: clickfraud is a great challenge area. He’s started a new group (2ppl so far) interested in spam, running the anti-spam conf this year (not the MIT one)
Qi Lu ; Y! VP Engg of search marketplace
Madhu Therani 
U Az. IS dept, got into this area 6 mos ago, contacted by an SEO looking for advice.
Badrui Sarwar 
research scientist in personalization at Shopping.com (now eBay). {Formerly of Outride, Sehda.com, part of their new-ish research unit}.

…two other students joined in later.

Goodman: Pay-percent-of-impressions.

09:00a - 09:30a Pay-Per-Percentage of Impressions: An Advertising Method that is Highly Robust to Fraud
                Joshua Goodman, Microsoft

Pie analogy: found a fake plastic pie in his cafeteria – why is this fraud going on? Attacker is trying to reduce the price of pies… but now they’re using sophisticated fake-pie detection algorithms, but now they’re adding scent to the plastic ones, etc, etc.

Fix? Offer to buy 10% of all the pies for $1. So I don’t care that they deliver some fake pies, I got the one slice I wanted.

Cute detached-head-shot of Reyes, GOOG CFO, “the biggest threat to the Internet economy… something has to be done about this really, really quickly, because potentially, it threatens our business model. As early as 1999, hackers on IRC colluded to click each others’ ads.

  • broad match does not work / prefix match does
  • Importance of being truly random
  • Suggest it’s a great system to use in combination with other methods (e.g. for the most fraud-susceptible terms)

Click Fraud: 1) click on a competitor to tap them out. This means you pay less if join later, reduce their ROI, Pay-per-conversion is one mooted-way to tackle this, but then you get into advertiser-fraud, so they’ll minimize their conversion. Another example: he buys ads on other email conferences, so those organizers might click me to get me off of “their” term. Type 2) syndication click fraud, clicking on ads running on your own site. {Side discussion ensues about how much money GOOG makes from syndication by content-sensing} {Kursad: doesn’t type-1 clicking increase your competitor’s CTR, and raising their competitive placement, perhaps canceling out? No, not enough. }

Impression Fraud: A little harder to do, since you don’t want impressions on your ‘‘own’‘ ads (one version: impression-fraud everyone, click-fraud ‘‘yourself’‘, then even have the chutzpah to demand a refund!). Goal is to lower competitor’s CTR until they fall off.

Solution: pay for share – {people adding crap to the mix just falls out of the math.} except for the new kind of fraud: Misinformation Fraud. This attack would be to run up lots of fakes and get a novice buyer to pay a lot for a “highly-trafficked” term – in this model, advertisers must have enough info to make their ‘‘own decisions’‘ about what %age is probably not fraudulent when bidding. Best predictor will be their own previous experience.

-> haven’t we just shifted the detection problem to the advertiser?? ‘‘Yes’‘! Shifting the burden back to them is a good thing, gives them more options, but most of all delinks the incentives from the all-powerful ad network.

Other advantages: it seems easier to determine aggregate %age fake than which are fake. It’s important, though, that you get a genuinely random subsample for your 10%. Don’t settler for 1/10th of the day, or 1/10th of the geography, etc – need a nongame-able subset. {Reminds me of cut-and-choose rules in cryptography}

Broad match: can’t oversell. The market for “digital” and “camera” overlap. Historical data can’t disambiguate, b/c that retrospective also has fraud in it.

Fix: use prefix matching. Property that must be maintained? “probability that I get any real lead is independent of the actions of any other advertisers.

Ultimately, we’d like an auction algorithm that unifies PP%, prefix PP%, PPC, exact-PPC). First, normalize all adds to prefix-pp%; create pseudo-markets by aggregating prefixes (and running hierarchical sub-auctions).

Concluding thought: may seem wacky, but this is what we do in ‘old media’ – you buy %ages of time, and large ad buyers may be happy with this again.

Q: how do you account for positioning-within-pages? See paper: there’s a point scheme for making those distinct markets. Q: how continuous? A: paper argues for 1 week windows. Q (Jan): if the issue is to randomize impressions to get a share of the true impressions, can we just apply this to budget constraints… {actually, I didn’t capture this point accurately}

Meek: Stochastic and contingent-payment auctions

09:30a - 10:00a Stochastic and Contingent-Payment Auctions
               Christopher Meek, Microsoft Research
               David Maxwell Chickering, Microsoft Research
               David Bruce Wilson, Microsoft Research

Definition: action for items where paments are contingent on some future event. Today: allocation is by-bid/by-expected-revenue, with “vickery-like” pricing {?}

Problem 1: Vindictive bidding: a spoiler that raises the price to just below the max (i.e. incentive in-compatible if you actually have the goal of increasing competitor’s expenditures…)
Problem 2: Uncertainty – low CTR estimates are reinforcing, so such hypotheses need to be re-tested (and that “costs” the auctioneer)

His proposal: allocate and price items stochastically. Properties: not as efficient, but can limit vindictive bidding, increase # participants, learn probabilities, etc.

Allocation rule is a function from the bid-vector to a distribution over allocations (in general). In the one-item case, simplifies to the “marginal allocation” function, a cumulative-density-function that the i-th bidder wins with a bid at price p-sub-i. Suppose bids were $8/$1/$1 – you’d get 80% in a poll, but if you exponentiate the shares, you can tune it towards winner-take-all, but this factor of Beta lets us tune between these.

“Condex {conditional expectation} pricing rule” – price charged the winner is less than the expected value of the ‘second-price’ bidder.

Go back to the 8/1/1 case: with a very high beta, anyone who bids over 1 would win, so winner should pay nearly-$1 – what the $8 player would have paid if they “bid” a density function rather than the $8 “point”. He shows one example curve, where the $8 player ‘spread their bids’ and ended up paying, on average, $2.

Next step is transforming this into a “workable” scheme, one that uses a random number assigned to each bidder (a ‘coin’). This factor is used to discount their effective bid, but also uses other’s factors to discount what the winner actually pays (!)

For Problem 1 (vindictive), the spoiler bears some risk of winning – and they’d lose real money. For Problem 2, this in the setting of a repeated game. The auctioneer faces a classic exploration-vs-exploitation tradeoff; recent literature looks at this very old problem with a ‘reinforcement-learning’ perspective. E.g. interval-estimation, where we “fuzz upward” CTRs to the 95% level of the PDF. Or 1933-era Thompson strategy of drawing randomly from the two PDFs.

Q (Kitts): observation that Goog, others seems to rotate in and out randomly in effect, for some secret reasons. A: they may appear to be very precise in their language, but the fudging is around CTR. A (Kitts): we found this out because we raise our client’s budgets to to $250K/day, and then our agents dynamically monitor pricing/usage to override Goog budget control (and thus our client’s ads don’t get rotated out of #1).\

Sandholm (with Parkes): Optimize-and-Dispatch

10:00a - 10:30a Optimize-and-Dispatch Architecture for Expressive Ad Auctions
               David C. Parkes, CombineNet, Inc. and Harvard University
               Tuomas Sandholm, CombineNet, Inc. and Carnegie Mellon University

Bit more of a framework, not really a particular auction design. Expressiveness has its advantages in other markets, but the challenge is combinatorial explosion in the ad domain. Historically with OR-matching; Sandholm added XOR in 98/99, he added OR-of-XORs, and then Nisan added XOR-of-Ors in 2000, and OR* (a general graph, more compact, Nisan and Fujishima). Other new ones: recursive, side constraints (linear expressions). A lot of experience was gained in the B2B procurement world. CombineNet provide(s)(d): package bids, conditional discounts, capacity constraints, multi-attribute bidding, ‘detailed cost structures.’

Why? Improving economic (allocation) inefficiency in the supply chain (Pareto optimality). Without it, having to bid incrementally for each atomic item, means that risk exposure reduces prices. {More disclosure == market designs that do better, in the same way that ASTs let compilers do a better job at runtime?}. Also reduces need to pre-package bundles. Collaborative bidding – sharing backhaul transport, etc. CombineNet has cleared $11B, saving $1.5 in new value (not merely zero-sum moving of savings along the chain).

{he offers a useful first-stab at mathematically formalizing the Google SSA model – should check the literature to see if there’s better ones in the literature yet}

What’s unique to SSA? 1) performance: auction has to run in real-time, seems to rule out optimization; 2) on-line: must be solved without waiting for future knowledge to reduce uncertainty. The framework he’s proposing to address this is a near-line optimizer than runs “often” and whose output parameterizes a fast dispatcher.

Expressiveness encompasses local & global (namely, constraints that apply to a particular auction, or must be calculated over a collection). Examples of global: volume SLAs, interdependencies (per-slot, or not-alongside-competitors), bonuses (threshold & incremental payouts); and price schedules (price-adjustment) – note that this subcategory of global constraint can be implemented/enforced by the dispatcher using local info & its past dispatch log.

Examples of local: grammar (core+optional good words, phrases, {thesaurus}, hits (by URLs), even query expansion (how targeted would a refinement be if I added my own terms?)); user profiling (geoIP, history, etc); price-adjustments (bonus solely to trump competitor C, or reduce their rank, etc). {Refers to some old notable paper on selling nukes that yields unbounded revenue – wonder which one he’s talking about?}

OK, so what’s the information flow between the optimizer and dispatcher? They seem to operate at different time-horizons (i.e. opt=month, disp=day). See paper for details.

Q: Transparency: how much of these expressive bids should be public? A: We’ve build 190+ markets, and it’s varied quite a bit, from r/t feedback to never to occasional, etc – how to shape/improve your bids. Expect to see the quoting function become important – the system needs to offer some feedback to new bidders in the form of a quote with expected yield.

Q: How well could these strategies be imposed unilaterally? What if I informed my bidding agent, and then it went into a single-event market and dealt with it dynamically? A: No, the optimal market outcome requires access to all of the competitive bids… the point is that a clearing engine must assess global information.

{Q: how to audit this? Q: how to build actual experiments with this? – maybe bring this back up at the panel?}

Edelman, Ostrovsky: Bidder Strategy

11:00a - 11:30a Strategic Bidder Behavior in Sponsored Search Auctions
               Benjamin Edelman, Harvard University
               Michael Ostrovsky, Harvard University

Ostrovsky spoke (he will be teach Economics at Stanford this year). Fundamental question: do bidders modify their behavior based on ‘‘other bidder’s’‘ actions?

{Nobody from Google here, btw.} They’re frankly wrong on Goog official propaganda that Vickrey auctions are strategy-proof – AdWords doesn’t work b/c there’s multiple objects (3 ad slots per page, say). Quick example: if there’s only two bidders and three slots, you should shave your prices down and take the second spot for less than the first.

Key data set: Overture data from 2002-3 when (like eBay) they offered two strategies to advertisers: first-price fixed price, or proxy-bidding (effectively second-price). They can detect auto-bidding from the transaction logs by looking for very-fast cycles between pairs of bidders (note that the log timestamps were only 15-min accurate).

Comment by Sandholm: do you see preference curves that ‘price’ the tradeoff in ranking?

Second data set: a week of watching Y!/Goog for top rankings for ads on some well-known keywords every 7 minutes – and once again, saw cycling between warring bidders who wanted top-rankings. ‘‘The median streak length was 14 minutes!’‘

Punchline? How much money is left on the table? … compare a counterfactual dataset where you ran a true second-price auction using the highest-of-the-week value as the putative private value to the proxy-bid data. Data shows gaps of 60% (in the paper) to 22% (newer data set, median) to 48% (top decile of competitive words).

Theoretical wrapup: The Vickerey-Clarke-Groves auction actually states that the winning bid pays the externalities imposed on other bidders, which is a second-price auction for a single good – but not in a multiple good scenario. Instead, you need to account separately for the value of an ad placed at a rank and the tradeoff costs of dropping a rank, etc.

So rerunning the counterfactual process against VCG, it does poorly on thin (3.9 competitor) markets, but adds 8% on liquid (10+ bidders) on average and a minimum of 1%.

Conclusions: there is strategy (it’s not incentive-compatible today); there’s real money at stake; and a correctly implemented VCG would increase revenue for competitive keywords while making life easier for bidders (though, admittedly, if most revenue comes from thin markets, it might not).

Ramachandran (UMd): Empirical data on bidding strategy

11:30a - 12:00a Online Advertisers' Bidding Strategies for Search, Experience, and Credence Goods: An  Empirical Investigation
               Animesh Animesh, University of Maryland
               Vandana Ramachandran, University of Maryland
               Siva Vaswanathan, University of Maryland

Trying to compare behavior across the spectrum of Search (e.g. CD’s, no uncertainty about its value) – Experience (e.g. Flowers, you at least learn quality upon delivery) – Credence (used cars, even after using it, you’re uncertain of its value). Hypotheses: in Search, high-quality firms should advertise more, while in Experience/Credence goods, advertising is itself a signal, making it even possible that a low-quality firm should advertise. Empirical findings are inconclusive.

Approach: regression of observed ad rank against Alexa data about site traffic. Findings: in Search goods, advertisers are stable and place a high value on top rankings. Credence goods have the most rank-volatility.

Isobar/iProspect: Everything you ever wanted to know about clickfraud but were afraid to ask…

12:00p - 12:30p A Formal Analysis of Search Auctions Including Predictions on Click Fraud and Bidding Tactics
               Brenden Kitts, Isobar Communications Corporation
               Parameshvyas Laxminarayan, Isobar Communications Corporation
               Benjamin LeBlanc, Isobar Communications Corporation
               Ryan Meech, Isobar Communications Corporation

Media would have you believe that 10-20% fraud. Overall, across their client base, 12.5% of clicks are being rebated (presumed fraud). They model 16.9% — so it all lines up. How many clicks-w/no-conversion in a row can there be? Use the binomial distribution at a confidence of 0.01 and up to .8% of public IPs fail. Their models do line up with the rebates, too – but the outliers are quite intriguing.

Grandiloquently-titled “Ryan’s Theorem” is that ‘‘if’‘ clickfraud affects all advertisers {in a market} equally, nobody cares. Well, no advertiser or engine – but content-publishing partners are playing a zero-sum game with fraudsters and, in extremis, the adjusted price could fall below the minimum floor.

There will still be a need for third parties to arbitrage budgets amongst search engines.

Aggarwal, Hartline: Knapsack Auctions

02:00p - 02:30p Knapsack Auctions
               Gagan Aggarwal, Stanford University
               Jason D. Hartline, Microsoft

Basic problem: sell a single ad-slot on a web page over the course of a day. C pageviews; advertisers require c-i impressions per click (1/CTR); and each advertiser will pay v-i for a single clickthrough.

Batched auctions are easier to make “temporally strategyproof” but admittedly not practical yet. On the good side, this maps directly to the knapsack problem: how many advertisers to admit into the mix to maximize revenue?

CMU: SSA design via Machine learning

02:30p - 03:00p Sponsored Search Auction Design via Machine Learning
               Mari-Florina Balcan, Carnegie Mellon University
               Avrim Blum, Carnegie Mellon University
               Jason D. Hartline, Microsoft
               Yishay Mansour, Tal-Aviv University

Aiming to reduce revenue maximization problems to well-known algorithms . Reduction based on random sampling.

Typical analysis of independent auctions does not account for competition between advertisers bidding on similar keywords. How to optimize when the semantic similarity information is private?

Trying to create clusters of terms to form interlinked markets and thus offer a fixed price in each market.

UT Austin: Divisible goods

03:00p - 03:30p Designing Share Structure in Auctions of Divisible Goods
               Jianqing Chen, University of Texas at Austin
               De Liu, University of Kentucky
               Andrew B. Whinston, University of Texas at Austin

Bid for shares, not just positions.

Penn/FIU: Searcher perceptions

04:00p - 04:30p Examining Searcher Perceptions of and Interactions with Sponsored Results
               Bernard J. Jansen, Pennsylvania State University
               Marc Resnick, Florida International University

Clever hack: controlled experiments where they swapped the ads and the top hits to see what difference it made (testing the value of editorial integrity, what a nifty J-school koan!). 82% of users go first to the organic links, 6% to ads (according to self-reported “talk-out-loud” recordings). Only 27% of users never looked at the ads at all. Of ads, 56% could dismiss it by title alone, but 67% also read the summary to decide that it was relevant.

by Kragen Sitaker, Friday October 15; see also Publications

(Detailed notes that supplement our notes on Google Desktop Search’s WinSock integration.)

For Firefox, it looks like what goes over the wire has a < ! – – tro2 – – > in
the appropriate place, which somewhere gets replaced by a < p
class=e>< table …ID=”Google Desktop Search”>…</ table>. So somewhere
between where the bytes come in over the wire and where they go into
FireFox’s “view source” and rendering engine, this HTML comment gets
rewritten with the local desktop search results.

No clues yet on how that happens. Maybe a FireFox? plugin? Maybe it
patched the TCP/IP stack?…

Also of note:

  • the User-Agent string is Mozilla/5.0 (Windows; U… Google-TR-1) Gecko/20041001 Firefox/0.10.1 — so clearly somewhere in the request path the desktop search code is munging the user-agent.
  • GoogleDesktopNetwork1.dll and GoogleDesktopNetwork2?.dll are loaded into FireFox?, according to CurrProcess?
  • GoogleDesktopNetwork2.dll says in its .rdata section, some in UTF-16 and some in ASCII:
Layered Hidden Window
Layered WS2 Provider
Content-length: %d
<!--tro2-->
<!--trh2-->
QUERY
SEARCH
WEBSERVER
<!--trl2-->
:
content-length:
text/html
content-type:
gzip
content-encoding:
chunked
transfer-encoding:
HTTP/1.0
HTTP/
charset=
X-TR: 2
X-TR: 1
<body
Content-Type: text/html
http://
GET http://
toolbarqueries.google
.google.

and other related stuff.

  • objdump -p says GoogleDesktopNetwork2.dll calls only a few functions

from WS2_32.dll: WPUCompleteOverlappedRequest, WSCGetProviderPath,
and WSCEnumProtocols (plus five anonymous entry points), and provides
only a few functions: WSPAccept, WSPAddressToString, and 28 more WSP*
functions. Plus DoNothing.

It appears that WSCEnumProtocols gives you a bunch of protocols, each of
which contains a protocol chain: a list of Winsock SPI transport
providers (“Winsock 2 layered service provider”), each layered on top of
the next. The WSP* functions are those defined in the SPI interface.

In the registry, HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services
\WinSock2\Parameters\Protocol_Catalog9\Catalog_Entries contains a bunch
of subkeys with values pointing to the GoogleDesktopNetwork1.dll — via
different pathnames. The old versions (conveniently saved under a
nearby key) point to “mswsock.dll” instead.

So apparently Microsoft Windows arranges to load the Google DLLs into
anything that accesses the network, and they have a chance to rewrite
HTTP streams before the application sees them.

Microsoft Word doesn’t seem to get the desktop search results, though,
even though CurrProc says it has the DLL loaded. Turns out
GoogleDesktopNetwork2.dll has a list of executable files:

WAOL.EXE

OPERA.EXE
NP.EXE

NETSCP6.EXE
NETSCP.EXE

NEOPLANET.EXE
NEOPLA~1.EXE

MSN6.EXE

MSN.EXE
MOZILLAFIREBIRD.EXE
MOZILLA.EXE
IEXPLORE.EXE

FIREFOX.EXE
AIM.EXE
AVANT.EXE
AHTTP.EXE

And, in fact, if we rename FireFox’s executable to a different name, the
desktop search results stop showing up at the top of Google search
pages. (After we exit and restart the browser, anyway.)

Also, the string “Google-TR-1” occurs in the GoogleDesktopNetwork2.DLL.

Introduction

Ångströ is that most dreaded of software artifacts, a multi-purpose platform: infrastructure for storing, retrieving, sharing, and searching Atom entries with semi-structured contents.

  • Using Ångströ as a stable, secure remote store can dramatically increase the power of user scripts and AJAX-style user interfaces.
  • Using Ångströ as a notebook to collect ‘factoids’ while surfing allows easy searching — and graph-based measures for ranking relevancy.
  • Using Ångströ as a notification service enables synchronization of multiple instances on occasionally-connected devices.

The central insight is that Atom entries are close to becoming a lowest-common-denominator for interchange between all of the application-layer protocols we are familiar with today; and that the REST architectural style (and its derivatives) are more appropriately used to explain how Web applications behave at a granularity finer than an HTML page.

One source of inspiration is the old world of PC database applications such as Access, FileMaker, and dBase – power users used to be capable of whipping together small applications using simple abstractions of forms, tables, and reports. We’d just like to use the World Wide Web as the database.

Scenario

A scenario I’d like to enable is the development of a so-called ‘long-tail application,’ something as seemingly frivolous as, say, a frequent-flier’s console:

  1. review the status of every flight I’ve ever taken, on any carrier (“scrape feeds”)
  2. allow me to print boarding passes or activate bonus offers (“invoke services”)
  3. display the highest awards I am currently eligible for on each carrier – and how many miles to go until the next milestone (“calculation”)
  4. and sort incoming offers for promotional mileage bonuses on routes/carriers that are likely to apply (“news ranking”)

From there, it would be fun to add all sorts of additional features like a map of where I’ve been, where I can transfer miles from credit cards to “top-up” accounts, analyses of how often I choose carriers on a citypair, get upgraded, or even integration with other applications like expense-reporting. However, the four core features above are already enough to evaluate whether an Ångströ would be helpful.

Netscrape

There are several sources of flight history information in my digital life: carrier’s FF sites, online travel agencies, emails from both, and random historical scraps such as Excel files.

The common tool I’d like to use on all of these sources is a ‘semantic highlighter’ to train a scraping bot by example. By “turning on a macro recorder” and logging into an FF-site, generating a report, and then pointing at relevant table columns for Origin / Destination / Date / Class / Flight / Reservation, I should be able to create an Atom entry with microformatted XHTML in my new ‘rohit-flight-info’ model for each paragraph, row, or email.

Microservices

Having vacuumed up the data from multiple site, I’d like to present it for myself. While atoms may have been stored by source-site and date (of acquisition, not the date-of-flight), I need to query for all ‘rohit-flight-info’ across all of them.

Turning to display, the EJAX approach for connecting data-sources to client-side templates is to simply describe the “report” format I want in HTML and indicate in passing which fields I want bound to which elements. Think of it as a very lightweight XSLT, perhaps again trained by examples.

For flights that are in the future and have valid locator-codes, it should also be possible to add HTML forms to the display that link directly to carriers’ services for printing boarding passes, changing seats/classes, etc.

Calculation

With a proper microformat, it seems reasonable to expect Ångströ to parse numbers, dates, and even units – but not to learn more-esoteric rules about bonus-miles, promotions, and fare bonuses. It should be as easy as it might be in Excel to write up all of those rules and come up with a bottom-line on how many miles I have on each carrier.

The second part of the demo is a lookup: comparing those mileage levels to each carrier’s (constantly changing) award charts. This presumes that another process is extracting a meaningful feed from the carrier’s site for each award; it would be straightforward to numerically compare the mileage levels to filter eligible awards.

A scarily-dynamic descendent of this application ought to even tap into additional feeds about my credit-card spending and loyalty-points accumulated in other programs to take advantage of WebFlyer’s conversion planning services to game out various scenarios for how much it would take to reach the next rung of awards.

Ranking

The mileage-junkie’s grail is to have the system also parse all of the promotional email I get from all these partners and recommend strategies for maximizing the portfolio. Imagine popping up an alternative flight to Dulles that instead gets double-points for going through BWI and locks up another award for $25 extra (there are lots of services that will calculate the minimum cash fare – not maximizing total value).

There are some email newsletters I’d like turned into feeds, and websites that feature these offers that could also be turned into feeds. At a minimum, I’d like to be able to filter related offers while viewing my “flight cockpit,” but at best, it could see which offers my friends are taking advantage of (collaborative filtering) or would fit my future travel plans (events evdb recommends…).

Architecture

  1. Fission
  2. Diffuson
  3. Fusion

Microformats: Splitting Pages into Atoms

TPd: Managing a Pile of Atoms

Storage

Search

Ranking according to graphs

Notification

@@you gotta store the state if you’re going to detect deltas… but that doesn’t mean you have to archive it over the long-term.

XGI: Fusing Atoms back into Pages

… dependency-driven recalculation…

…intermediate output states of an XGI script constitute a feed of its own…

MiniML

aka MFML

Evaluation

Do all of these moving parts in concert make CRUD tables easy? Or will it be too compicated to uncover that functionality?

Alternative (motivational) scenarios

“turning the web into a real-time database”

“prioritizing the most-relevant results according to what you’re doing at the moment”

it’s a

  • new kind of search engine
  • a lowercase-semantic-web knowledge source
  • a source for awareness tools like WoWbar

Next step: a compelling storyline/demo that shows off how all of these pieces interact

Applications

Three applications we could build on what we have so far:

  • Microsearch — how could we build a “search engine” for all the microformatted data in the world?
  • Automagix — how could we write Internet “agents” that react to changes in (microformatted) data, like Apple’s Automator workflow?
  • LI Times — how could a newspaper driven off of one’s linkedin profile keep (proactively!) in touch with what’s up in your network?

Social recommendation networks

Often I’m interested in the same web pages my friends read. Sometimes, reading their blogs tells me what they read; del.icio.us helps, too. However, those approaches take some work per page: you have to explicitly post a link to each page to tell me you liked it. It’s nearly the same amount of work as telling me about cool web pages when we’re hanging out at the water cooler. So we never publish links to the majority of pages we visit, so our friends never find out we liked them.

Collaborative filtering: statistical recommendations

Several services, such as Alexa’s “what’s related” service and Amazon’s “similar items” service, take advantage of a central point of control in a system to track many people’s reading. Then, they predict what I will be interested in, from statistical profiles of people like me. These services provide many of the benefits of social recommendation networks, but they don’t require as much effort on the part of the users. They are known as collaborative filtering systems.

Cobrowsing: publishing your URL history

In 1997, in a web page entitled Communal Web Browsing, I suggested that you could discourage children from browsing porn sites by just publishing a list of what they were browsing in real time where their parents or other community members could watch; this avoids the problems Solid Oak Software has so vividly illustrated with censorware. Off-the-shelf software such as VNC, BackOrifice, and Groove support sharing of a web browser. Most Americans, however, would prefer not to publicize their web-browsing habits indiscriminately. Even public-access computer terminals in public libraries commonly have some sort of privacy shield on the screen to restrict viewing to the person sitting at the machine.

If I were to build a piece of software that published every URL I visited on the Web, my friends could easily see which pages I had visited, with no extra effort on my part. If my friends also installed this software, I could see which pages had been most frequently visited by my friends — presumably the most popular ones would also be of interest to me. My friends might not be willing to install that software, though, because they might prefer a greater measure of privacy, such as that granted by software that requires an explicit action to publish each link.

Bloom filters

There are these things called Bloom filters; they’re basically compact, lossy representations of sets of hash values. You feed a Bloom filter a hash, and it returns “yes” or “no”. “Yes” means that the set it represents it may or may not contain the hash value; “no” means that it definitely does not. The probability of a “yes” for a value it doesn’t contain is adjustable, and it’s lower for larger Bloom filters.

Schachter and Ceglowski’s LOAF, a pun on FOAF, uses Bloom filters to help you prioritize your email by letting you recognize which incoming email comes from your friends’ correspondents — with an adjustable probability of a false positive. If that probability were 50%, then you’d expect a new correspondent to appear to be a correspondent of about 10 out of 20 friends whose LOAF files you have. The probability is about 1% that, by chance, they will appear to be a correspondent of more than 15, and only about 5% that they will appear to be a correspondent of more than 13. But if they actually are a correspondent of 7 out of of your 20 friends, the probability is about 50% that they will appear to be a correspondent of more than 13.

As the number of your LOAF-using correspondents increases, you can reliably detect a much smaller proportion of correspondents.

This protects the privacy of your friends’ address books, since you can’t tell for sure whether any particular friend has an address in their address book — because of the false positive probability. But it still gives you an aggregate answer about your community’s familiarity with any particular email address.

Decentralized collaborative filtering by publishing URL histories through Bloom filters to your social network

Suppose we apply this to URLs instead. Each day I’ll publish an updated Bloom filter containing all the URLs I’ve seen in the last few weeks, with a relatively high false-positive rate. Every few weeks, I’ll clear it out, and new URLs will accumulate for a few weeks. Now all my friends have a hint, whenever they see a URL, whether or not I also saw it. They can preferentially follow links that several of their friends thought were interesting.

Because I actually only visit a few hundred URLs per week, even a relatively low false-positive rate suffices to protect my privacy. If there are four billion URLs on the public web, and my Bloom filter’s false-positive rate is 0.1%, then four million URLs match the filter without me having visited them (in the last few weeks), while only a few hundred match it because I have visited them. So the fact that a particular URL matches my filter is pretty weak evidence that I have actually visited it. Practically speaking, a false-positive rate closer to 10% might be better, since some URLs are much more popular than others, so the prior probability might be high enough to matter.

A URL-recommending engine can accumulate these different Bloom filters over time and compare them against my browser history to see which ones are most closely correlated to it, in order to figure out which ones are the best predictors of where I want to go next. A naive Bayesian reasoner could simply compute the probability, given that a URL does or does not appear in a particular Bloom filter, that it appears in my history; then it can adjust all these probabilities together to come up with a predicted interest level. A smarter engine might discover pairwise or even more complex dependencies between the filters and discount those results — if Adam and Rohit always go to the same URLs, I’d like to not count their two filters as if they were independent.

The question remains how to acquire URLs to recommend in the first place. You can’t simply pull them out of the filters — that’s the point. You could look at the stream of links in HTML pages I look at, or the stream of links in a bunch of RSS feeds, or the output of Technorati, del.icio.us, Syndic8, or pubsub.com, or my email, IRC, and IM channels.

Then you have the question of how to present the results. You could display them on an HTML page, sorted with the most relevant links first, or you could rewrite HTML pages on the way in to my browser with icons or colors representing popularity.

Related topics

Other useful things to do with a big pile of URL Bloom filters:

  • given a seed URL and a list of candidate URLs, order the candidate URLs according to their co-occurrence with the seed URL.
  • compute the pairwise correlations of the filters, perhaps with a list of URLs and perhaps without.
  • cluster a list of URLs into clusters that tend to co-occur in the same filters — for example, to find the “principal components” of a pile of search engine results, so you can present a few URLs from each “component”.

Attacks

An attack due to Dan Gould: to find out whether someone has visited a particular URL, instead of testing that URL alone, test many URLs that tend to be associated with it — for example, pages it links to, or pages that link to it, or several pages in a sequence. Then test to see whether a statistically unlikely number of these URLs are “hits” in the filter — it will tell you whether the person has visited some subset of those URLs, but not which ones. But that may be enough.

Privacy-Enabling Collaborative Filtering

John Canny from UC-Berkeley has a couple of papers on that: Collaborative filtering with Privacy and Collaborative Filtering with Privacy via Factor Analysis. More recently, a group of researchers from Haifa published a paper titled “Privacy-Enhanced Collaborative filtering” in PEP05’s proceedings.

Bloom Filter Resources

  • Math

Pei Cao has good mathematic exploration of Bloom filters. It also provides a quite useful table showing false positive rate under various m/n and k combinations. According to Pei’s calculation, “it seems that 4 bits per count would be amply sufficient”. when m/n=4, k=3 reaches the minimal false positive rate of 0.147

  • Applications of Bloom filters

Andrei Broder (IBM Research) and Michael Mitzenmacher (Harvard) have a nice survey paper on Network Applications of Bloom Filters.

More recently, a paper by researchers from UT-Austin and IBM Research talks about how to refine web search results by removing near-duplicate documents detected by using Bloom filters as a means of computing similarities of documents.

Steve Bellovin (AT&T Labs Research) and William Cheswick (Lumeta) proposed a new search sheme based on Bloom filters and Pohlig-Hellman encryption.

  • Coding Bloom filter and tricks

Maciej Ceglowski has a good post on Perl.com

Some Motley Bloom Tricks

Mark Fischer’s post on flipcode.com

MIT netcom

Rice FreePastry

  • Notes of BF

The origin:
In his 1970 seminal paper “Space/Time Trade-offs in Hash Coding with Allowable Errors”, Burton Bloom introduced a new efficient hash-coding method with allowable errors with a formula to calculate the trade-offs among three computational factors: allowable fraction of errors (false positive rate), space (bits of hash area), and time (average time to reject a message as a member of the given set). People name this data structure as “Bloom Filter”.

The design of BF:
• Lee Gremillion in his 1982 paper “Designing a Bloom Filter for Different File Access”, demonstrates a simulation approach to the design of a BF. The argument there is basically to use simulation to determine the actual values of different parameters (e.g., size of the BF).
• James Mullin in his 1983 paper “A Second Look at Bloom Filters” proposes an analytic approach of design BF based on probability. His work has been widely quoted in later literature about BF.
• Ramakrishna in his 1989 paper “Practical Performance of Bloom Filters and Parallel Free-Text Searching”, focuses on the issue of choosing the “right” transformations/hash functions. He examines a class of hash functions (Hc,d mapS A into B):

Hc,d(x) = ((cx + d) mod p) mod m, and H1 = {Hc,d( ) | 0 < c < p, 0< d < p)}

Where The keys are assumed to be integers drawn from a universe A, A = (1, 2, . . . , p – l), p is a prime;
Let B denote the range of hash addresses, B = (0,1, . . . , m – 1).
Based on the results of his simulations, he concludes that “by choosing hash transformations randomly from the class H1 the theoretical error rate can be
achieved in practice”.
• The hash functions being used are desirable to be perfectly random meaning that they map each item in the universe to a random number uniform over the range of {1,…,m}, where m is the length of the Bloom Filter. While what hash functions to use in practice still remains to be an open questions, currently MD5 is widely used. (Also take a look at Rabin’s fingerprints. It seems that it hashes words)

Different versions/variants of BF:
• Standard BF:
A Bloom filter for representing a set S = {x1 , x2 , . . . , xn } of n elements is described by an array of m bits, initially all set to 0. A Bloom filter uses k independent hash functions h1 , . . . , hk with range {1, . . . , m}. For each element x ∈ S, the bits hi (x) are set to 1 for 1 ≤ i ≤ k. A location can be set to 1 multiple times, but only the first change has an effect.
To check if an item y is in S, we check whether all hi (y) are set to 1. If not, then clearly y is not a member of S. If all hi (y) are set to 1, we assume that y is in S, although we are wrong with some probability (false positive rate).

Some cool properties of BF:
1. A Bloom filter that represents the union of two sets can be obtained by taking the OR of the two bit vectors of the original Bloom filters.
2. Bloom filters can easily be halved in size, allowing an application to dynamically shrink a Bloom filter. Suppose that the size of the filter is a power of 2. To halve the size of the filter, just OR the first and second halves together. When hashing to do a lookup, the highest order bit can be masked.
3. Bloom filters can also be used to approximate the intersection between two sets. The inner product of the two bit vectors can be used as a measure of their similarity.

• Another version of standard BF:
Each hash function has a range of m/k consecutive bit locations disjoint from all the others. The total number of bits is still m, but the bits are divided equally among the k hash functions. According to Andrei Broder and Michael Mitzenmacher’s survey paper, this version shares the same asymptotical performance with the one above. Despite the false positive rate is slightly higher than the previous version, this version might be useful for implementation reasons: e.g., ease the potential exploitation of parallelization of array accesses by dividing the bits among the hash functions.

The minimum value for the false positive rate f is reached when p = 1/2, or equivalently k = ln 2 · (m/n). In this case the false positive rate f is (1/2)k ≈ (0.6185)m/n, but in practice, k must be an integer, and a smaller, sub-optimal k might be preferred since this reduces the number of hash functions that have to be computed. They also derived that the lower bound of the false positive rate f is only (0.5)m/n.

• Counting Bloom Filter:
If you have a set that is subject to change, inserting a new element into a BF is easy (hash it k times and set corresponding bits to 1), but deleting an element is tricky. You cannot set all the k bits resulting from hashing the element to 0, because the bits might be set by other elements.
To prevent this problem, Li Fan et al. proposed the idea of counting Bloom Filter, where each entry in the Bloom filter is a small counter instead of a single bit. When an item is inserted, the corresponding counters are incremented; when an item is deleted, the corresponding counters are decremented. To avoid counter overflow, we choose sufficiently large counters (4 bits per counter is usually large enough).

• Compressed BF:
Michael Mitzenmacher (Harvard) proposed Compressed Bloom filters, which improve the performance when BFs are passed as messages. In our case, BF is not only a data structure of containing users’ history URLs, but also a message being published and thus transmitted on the Internet, so the idea of compressing BF as a message (reduce the transmission size) is straightforward. The focus of optimization is not on the size of a BF but rather the number of 1’s in a BF because BF is usually sparse and we would only transmit the 1’s. So, in his 2002 paper “Compressed Bloom Filter”, Mitzenmacher illustrates that using a larger, but sparser, Bloom filter can yield the same false positive rate with a smaller number of transmitted bits. So, the tradeoff here is between the size of BF and the transmission size. Alternatively, one can use the same number of transmitted bits but improve the false positive rate, or find a more suitable tradeoff between the two.