Email Guides and Essays
by Kaitlin Duck Sherwood,

2004 Anti-Spam Conference Trip Report

Here are my observations of the 15 Jan 2004 (anti-)spam conference held at MIT. This is what I found interesting, anyway -- your milage may vary. This page reflects my interpretation of what I, Kaitlin Duck Sherwood, think I heard. I think these are pretty close to true, but have no guarantees. If you see something that is incorrect or incomplete, please send email to ducky at webfoot.com.

It was COLD. It was so cold outside that the lecture hall got cold. I'm used to women getting cold indoors, but it got so cold that men started putting on coats at about noon and gloves by about 3 PM.

Terry Sullivan looked at spam volatility, and reported that while spam does change, it changes slowly. (Personal note: I'm a bit skeptical. It might be that the majority of spam doesn't change quickly, but it seems like a significant piece changes very rapidly. Miles Libby from Yahoo said that it takes spammers only about two hours to start sending messages that get around a new anti-spam feature.)

Shlomo Hershkop of Columbia IIRC talked about how correspondence patterns were useful for distinguishing between spam and non-spam: how often do I get email like this from these people, how often do I respond, how often do I email from this person, how often am I CC'd, etc).

There were two lawyers, Jonathan Praed from the Internet Law Group and Matthew Prince from Unspam. They had very different takes on how the fight against spammers is going: roughly speaking, Praed is optimistic and Matthew Prince is pessimistic.

Praed notes that in his experience, spammers just have one degree of separation. Two random spammers will buy lists from the same harvesters, or get IP addresses from the same providers, or use the same spam SW, or use the same defense council. (Personal note: I infer that this means that taking out one potentially helps you take out others.)

He notes that government is finally getting involved. The government still has a steep learning curve and limited resources -- making them still heavily dependent upon private parties -- but at least they are starting to get involved. He noted that the Virgina Attorney General got the very first arrest that was purely for spam and didn't have other underlying crimes (like fraud, forgery, trespass to chattels, identity theft, etc) underneath.

Praed says that the federal CAN-SPAM law has had little impact so far, except to reveal unequivocably that spammers don't give a hoot about legality. He notes that CAN-SPAM will accelerate the run offshore and increase the market for IP addresses. (He sees these as good things: this makes it more expensive for the spammers to do business.) While CAN-SPAM does pre-empt state laws, there are significant "carve-outs" for state laws.

Praed also commented that the growth in blocked mail is phenomenal. (Personal note: That comment was backed up later in the day. Geoff Hulten of Microsoft, for example, said later said that Hotmail blocks over three billion messages per day. At dinner, Miles Libby said that one of the other major ISPs (AOL, if I recall correctly) also had publicly stated that they block 2 or 3 B messages/day.)

Praed's summary:

Victory doesn't have to mean 100% spamfree inboxes. Banks are still in business despite there being bank robberies.
Spammers struggle on increasingly small margins.
Email is resiliant. It is still useful, even at 60% spam.
Spam is a parasite on email: it can't kill its host.
Filters and lawsuits work.

Matthew Prince of Unspam was much more pessimistic. He showed a graph of the amount of spam with an overlay of number of state laws vs. time, and noted that spam has grown enormously despite various laws. He noted that the federal CAN-SPAM law was based on the state laws, which date back to 1997, when spam wasn't nearly as big a problem. He did note that the McCain amendment was one good and overlooked improvement: you don't have to catch the sender, you only have to catch the people who profit from it. (Personal note: That might be nice, but it makes me worry that if someone didn't like me, they could send out a bunch of spam promoting my book.... and get me in serious trouble. Uh-oh.)

Prince was very pessimistic about laws because of jursidictional issues. He noted that the state of Washington was the most successful at prosecuting spam lawsuits (with a whoppin' four) because they have a Washington state "do not spam" registry -- which allows the state to establish jurisdiction.

Prince noted that establishing the identity (and residence?) of the *recipient* is important for establishing jurisdiction. If the spammer can find out your home jurisdiction, then it's much harder for them to claim that they aren't bound by laws in your jurisdiction.

Prince also suggested using some provisions of the DCMA to nail address harvesters. Courts have ruled that people visiting a web site are bound by its terms of service, so he suggests putting something in the meta tags that says "do not harvest email from this site" (and has boilerplate for license text at http://www.unspam.com), assert that the email addresses on the site are trade secrets, and nail them with DCMA. (The audience was VERY uncomfortable about using DCMA for anything....)

(Personal note: Presumably it's hard to prove who harvested your email address. I suggested to a few representatives of major ISPs that they put up a page with the unspam licensing terms and a link that says "If you follow this link and harvest this address, we will sue you", then on the next page, have an automatically-generated email address that encodes the IP address of the requestor (or simply keeps track of the address-IP pain). Then, if (when) that address starts getting spam, you can go after the IP address that harvested it.)

Geoff Hulten of Microsoft talked a bit about Microsoft's anti-spam fight with some statistics:

They block over three billion spam messages per day
about 70-80% of the Hotmail messages are spam
about 16% of spam is porn
spam comes from:
- 53% US (meaning that the US is a net exporter of spam!)
- 13% China
- 5% South Korea
- 3% Brazil
- 2% Canada
spam uses these character sets, which can be used as a rough guess of language group:
- 60% US Latin
- 30% ANSI Latin
- 1% ISO Latin
- 2% Korean
- 4% Japanese
(Personal note: I believe that cellphone spam is a much bigger problem in Japan than classical email spam.)
Russia, China, and Brazil send more spam than non-spam
as far as where the spam business need to be:
- 30% domestic presence required (e.g. financial services, insurance, government grant proposals)
- 32% semi-domestic -- stuff needs to be shipped here (e.g. medicine)
- 38% no domestic presence needed (e.g. porn (16%), 411 scams) Hulten notes that up to 70% of spam could go offshore.

Hotmail has gotten great value out of their users -- there are a set of users that have signed up to give feedback about what is spam and what isn't.

Hulten said that it's hard to get really high accuracy on spam because there's a very big (3-5%) category where users disagree on what spam is. (I heard several other people echo that during the course of the day, both in talks and in hallway conversations.) For example, users disagreed on legitimate commerical messages and even on off-topic mailing list messages.

Bill Yerazunis talked about training methods. In one test, he found this error rate: Train on Errors (TOE) 149

Train Everthing (TEFT) 69

Train Until No Errors (TUNE) 54

He noted that "forgetting" is good (which was echoed a few times throughout the day) because features can change polarity. However, you should forget as little as possible. He found that randomly deleting a few words gave a 3x improvement in accuracy.

He noted that using a collaborative filtering scheme (which he called "innoculation") didn't always work if the check was done at the time a message was received -- because the filter might not have had time to hear about that particular message. He argued that one should filter at the SMTP level for the easy stuff, then a second time right before the user looks at the message. (Personal note: while it's conceptually nice that the MTA and the MUA are distinct and separate things, the two have access to different information useful in spam: MUA has the address book and message history, but the MTA has envelope information and how many people got that message.)

There was more stuff that was technical enough that I can't recreate it adequately from my sparse notes -- I would need to go look at his full paper.

Marty Lamb talked about TarProxy -- a program that slows down spammers.

Using an external spam filter to figure out whether or not a message is spam, at SMTP time, an MTA can flat-out reject the message with a 5xx message:

	554 I don't need any Viagra.  Go away.

or a 4xx "tempfail" message:

	451 I'm tired of this.  Spam me later

or "tar pit" it by giving repeated 4xx- messages (the dash means "more information coming")

	451-Your spam is important to us.  Please stay on the line.
	451-Your spam is important to us.  Please stay on the line.
	451-Your spam is important to us.  Please stay on the line.
	451-Your spam is important to us.  Please stay on the line.
	451-Your spam is important to us.  Please stay on the line.

(This is the moral equivalent of putting a telemarketer on hold.)

Lamb shared what he'd learned:

Don't tarpit yourself. Keeping the connections open slows your own SMTP server down. Thus, put your tarpit box ahead of your SMTP server.
Expect classifiers to do less: only rely on typical features. You need the result to be available immediately. (I can't read my notes here)
Allow classifiers to do more. Allow the classifiers to modify the message to add headers to the message (e.g. the envelope FROM and TO, and the sending IP address) to make it easier for the client to filter.

He noted that this does increase network overhead, and even non-spam is slowed down a little bit. The internal SMTP server must disallow relaying from (I can't read my notes here).

Lamb said that he doesn't handle START TTLS and other SMTP extensions, nor POP-before-SMTP. (He's not to worried about some of those because spammers don't use them.)

Tarpit is Opensource GPL'd, and he'd love help on code/docs/test. See http://www.martiansoftware.com/tarproxy

Ken Schneider of Brightmail was up next.

Brightmail uses

honeypots
logistics and operations center which monitors spam 24/7
filtering software deployed at customer site

Schneider noted that over 98% of spam has a URL to click on. (Otherwise, how will they get their money out of you?) He showed a lot of ways to obfuscate URLs. While many of those are easy for a computer to decipher and look up on a blacklist, using redirects like rd.yahoo.com makes it difficult to tell where the redirect is going. (Miles Libby said later that Yahoo is aware of the problem and working on fixing it, but there are other open redirects out there.)

Schneider mentioned that porn was about 16-20% of spam, up from about 5% a few years ago.

Jonathan Zadarksy talked about using pairs of words instead of single words. He found two-word couplets (e.g. "unsubscribe from") to be better indications of spam than single words (e.g. "unsubscribe" and "from"). (Personal note: IIRC, this contradicts what the spambayes folks have found.)

He also talked about a proposed standard for message inoculation (collaborative filtering) and said that in a small group, they were able to catch 20 extra spams in a month (though he said that they'd probably sent ten "innoculation" messages in that month). (Personal note: this didn't seem like a very high payoff:effort ratio to me.)

Miles Libby of Yahoo talked about Yahoo's spam problem and approach, which wasn't that different from what had been said earlier so I won't recap it. (Sorry, Miles.) He did mention "DomainKeys", a Yahoo proposal to authenticate using PKI. (When asked about revocation, he replied that each key would be given a time-to-live, and that domains could specify multiple keys.)

Eric Kidd posited basically that whitelisting was easy and cheap to do, and buys you an enormous amount. He thinks you should give high hamminess scores to

anybody who you've ever sent a message
anybody who's ever sent you a good message
anybody who was also on the recipient list of a message you got from someone on your whitelist

and combine them with a Bayesian filter.

Then do the same for domains, where the "goodness" of a domain is a function of the (# of good messages from a domain) / (total # messages from that domain)

If you still can't figure out whether something is spam or ham, then run it through a regular spam filter.

He notes that many of the Bayesian filters end up with To/From/CC addresses as high-value features, and wonders if these complex machine learning filters might be discovering what are in fact very simple rules.

Victor Mendez talked about putting CRM114 on a server. I had looked at CRM114's source at one point last year and dismissed it as unintelligible, so I wasn't terribly interested in this talk, so didn't take good notes.

John Graham-Cummings probably took the prize for wittiest presentation again, despite giving it via videotape. (He opened with what was apparently a parody of the 'all your base are belong to us' videogame.)

He gave stats on the 254 spams that made it through POPFile last year:

52% "picospam"
13% Rich Text Format (even though
9% challenge-response (which he *does* *not* answer)
9% non-deliverable
4% blank
13% anti-spam products

The RTF ones were listed as Content-Type of text/plain. The challenges were fake challenges, which just cements his opinion more firmly that challenge/response is a bad idea.

For the spam masquerading as non-deliverable mail and the anti-spam products, once he trained POPFile on them, they went away.

He also talked about using adaptive learning techniques to learn what his "hammie" words were. By doing iterating with different "word salad" messages (with words chosen from a dictionary) and giving feedback about which messages made it through, he was able to learn pretty quickly his "kryptonite" -- the words which had the highest "good" ratings. This means that it is vitally important to prevent spammers from getting feedback on which messages got through and which didn't. This means no automatic display of external images (web bugs), for example.

(Personal note: And this means that places like Yahoo, where spammers can get exact feedback, are screwed.)

Thede Loder et all from the University of Michigan talked about an economic model for spam. They showed a bunch of graphs and charts that talked about the utility of spam, but it all came down to "if receiver doesn't want your message, sender pays".

There was a little handwaving about how these micropayments would actually work. They seemed to have the attitude that it was a small matter of programming. (Personal note: I think that getting the infrastructure right for micropayments is going to be extremely difficult, and that there are some Nigerians who are going to get fabulously wealthy while the bugs are getting worked out.)

Eric Johansson talked about Camram, a variant of sender-pays where the sender pays in computational time instead of cold hard cash. Instead, the sender performs some computationally expensive (a few seconds) operation whose reverse operation is computationally cheap. He didn't talk in detail about the a particular algorithm to use (or if he did, I zoned out during that portion), but clearly had one in mind.

One of the advantages he listed over paying in cash is that there is no central authority, making it difficult for spammers, corporations, or governments to corrupt it. He also pointed out that it was take only minor extensions to some of current infrastructure to make it useful, has protection against double-spending money, is tamper-resistent, and preserves anonymity.

He saw Camram as being used in conjunction with whitelisting: that you would only create a stamp when the message was to someone you hadn't corresponded with before. "Strangers cost, friends fly free."

He recognized that Moore's law meant that to continue being a meaningful amount of computational time, you'd have to slowly crank up the computational complexity of the work. He had this idea of a peer-to-peer rate setting system where you'd take into account what everyone else was paying.

(Personal notes on Camram: I didn't think he made it terribly clear that a camram stamp could be just one more feature that fed into a statistical anti-spam filter. I presume that there's a way to find out just how much a stamp "cost", and the hamminess improvement that the stamp would buy you ought to be proportional to its cost. One can imagine that then people would crank up the cost by hand if they found that their messages weren't getting through, and so you wouldn't need a peer-to-peer network.

I also presume that this would need to be built in to the client (otherwise people won't use it) and that it would be nice to be able to set by hand how much work you wanted to spend on a stamp. I can imagine that if I wanted to get the attention of a stranger who I *really* wanted to get through to, I'd like to be able to send proof of hours worth of work.

Oh, and I personally think that "proof-of-work" is a much more workable solution than "proof-of-payment".)

Peter Kay of Titan Key groused about the incredibly cold weather, saying that he had worked hard to keep warm when he lived in Chicago, then finally realized he could stop spending so much effort by moving to Hawai'i. He felt similarly about spam: we shouldn't be spending lots of effort on filtering spam once it gets into our systems, we shouldn't let it in our systems in the first place.

He has a variant of disposable email addresses/single-use email addresses he calls "KeyMail". To simplify a little bit, you use an address until it starts getting spam. As soon as that account gets its first piece of spam, you "lock" it so that anybody who has sent you mail to that address in the past can keep using that address; anyone else gets rejected at the server.

There are some extra frills on that: he lets you set various "policies" on email addresses, including shelf life, quantity, and scope (first N users).

Other advantages include

"real vacations" -- you can set up an account to send a vacation bounce
child-safety -- where Mom and Dad have to approve correspondents

He said that what we needed now were:

an open source developer community
a GPL business model
an earth-shattering kaboom

(Personal comments: I loved this idea at first glance, but upon further inspection I got a little nervous about a few things:

race conditions: I tell Janice to use account37 at 3:05. At 3:06, spam comes in to account37 so I lock account37. Janice sends me a message at 3:07 and gets locked out. In this particular case, I presumably would know to whitelist Janice by hand, but suppose it's an email address that I published on a Web site?
mailing lists: mailing lists are usually TO an known address, not FROM a known address. Presumably you could set up a system where you whitelist TO addresses, but he didn't talk about that.

passing around email addresses: Fred's been using my account12 for six years, and I locked that account three years ago. Mary asks Fred for my address, she gets blocked on account12, and Fred tells her, "you must be doing something wrong, it works for me".)

Eric Raymond talked about the IETF's Anti-Spam Research Group (ASRG), CANSPAM, and SPF (an authentication scheme).

The ASRG is an anti-spam organization that he described as a clearinghouse for anti-spam ideas. (Personal note: I'm on the ASRG mailing list, and it has a very low signal to noise ratio. Someone described it as being a "write-only" mailing list. However, it *does* mean that there is something defineable as a community, it does mean that people are talking. It -- along with the Boston spam conference, JamSpam, and the spambayes project -- is a part of why anti-spam is in so much better shape than it was two years ago.)

CANSPAM, the US federal legislation, has language buried in section 11.2 that says that the FCC (or was it FTA?) is supposed to ask the IETF what standards exist. There being none written down, Eric decided to write some down. His spec must reflect existing practice, so there won't be any surprises. (Look for ADV: and Adult and Bulk, for example.) The spec is not useful if it has provisions that would be found to conflict with the U.S. Constitution, so it has to have weak labeling, not strong labeling.

SPF -- Sender Permitted From -- is a gentle authentication scheme that ASRG has come up with. If you are a domain, you advertise in your DNS record what IP addresses (or blocks of addresses) you use. When a message arrives at someone else's SMTP server allegedly from your domain, that server can go check and see if the IP address is on your "good" list.

There are some problems with this.

It breaks message forwarding (server-to-server store-and-forward "forwarding", not Aunt Jessie deciding that Uncle Fred should see the message also "forwarding"). With less forwarding now than in the past, that is less of a problem than it used to be. However, there are still some people whose messages get forwarded. (Personal note: it would be interesting to ask Yahoo or Brightmail what percentage of messages are forwarded.)
It makes roaming harder. In a perfect world, you'd like to be able to send email from anywhere and get it forwarded through anywhere. This breaks that.

Tony Yu of Mailshell had a presentation that probably would have been a lot more interesting a 10 AM than at in the late afternoon -- I was getting tired by this point. He talked about all the components of Mailshell's anti-spam product, which had mostly very familiar components. The one thing that they did that was a bit interesting was that they had an algorithm for determining "sameness" of messages; this helped in collaborative filtering.

Richard Jowsey from Death2spam.net had the bad luck to be last on the schedule. He was a very engaging speaker and had lots of very interesting-looking charts with bimodal probability distributions. He talked about various ways that you could manipulate the graphs to spread the humps, to model overlap, to normalize curves, etc, but he didn't do a very good job of explaining _why_ he was doing these manipulations or what results he got when he did these manipulations.

More general personal comments:

This year was not as interesting as last year's conference, but that's mostly because we know so much more now than we did last year. Last year, there was a lot that was very new to many people, myself included. (Paul Graham's _A Plan For Spam_ paper only came out in August of 2002 -- seventeen months ago.) There were no other spam conferences, there was no IETF working group on spam, and the commercial anti-spam offerings, by and large, weren't very good. We now have a very vibrant anti-spam ecosystem, two conferences coming up this summer, an overly-active IETF Anti-Spam Research Group, and a slew of commercial products -- including enterprise gateway products -- that block in the 95-99% range with few false-positives.

There will be a conference in June in Mountain View, CA: Conference on Email and Anti Spam

My husband is taking a class in computational genomics right now, and from what he's told me so far, a lot of work goes into telling how closely a pair of DNA strings match. Given that right now spammers are injecting junk DNA and small mutations (err extra text and small variations) into messages to make them look dissimilar, these algorithms might be of interest.

The conference was run very well, especially considering that it was free! In particular, this is one of the few conferences I've ever attended where speakers really were held well to their alloted time block.