Email Guides and Essays
About Overcome Email Overload with Eudora 5
About Kaitlin Duck Sherwood
2004 Anti-Spam Conference Trip Report
Here are my observations of the 15 Jan 2004 (anti-)spam conference held at MIT. This is what I found interesting, anyway -- your milage may vary. This page reflects my interpretation of what I, Kaitlin Duck Sherwood, think I heard. I think these are pretty close to true, but have no guarantees. If you see something that is incorrect or incomplete, please send email to ducky at webfoot.com.
It was COLD. It was so cold outside that the lecture hall got cold. I'm used to women getting cold indoors, but it got so cold that men started putting on coats at about noon and gloves by about 3 PM.
Terry Sullivan looked at spam volatility, and reported that while spam does change, it changes slowly. (Personal note: I'm a bit skeptical. It might be that the majority of spam doesn't change quickly, but it seems like a significant piece changes very rapidly. Miles Libby from Yahoo said that it takes spammers only about two hours to start sending messages that get around a new anti-spam feature.)
Shlomo Hershkop of Columbia IIRC talked about how correspondence patterns were useful for distinguishing between spam and non-spam: how often do I get email like this from these people, how often do I respond, how often do I email from this person, how often am I CC'd, etc).
There were two lawyers, Jonathan Praed from the Internet Law Group and Matthew Prince from Unspam. They had very different takes on how the fight against spammers is going: roughly speaking, Praed is optimistic and Matthew Prince is pessimistic.
Praed notes that in his experience, spammers just have one degree of separation. Two random spammers will buy lists from the same harvesters, or get IP addresses from the same providers, or use the same spam SW, or use the same defense council. (Personal note: I infer that this means that taking out one potentially helps you take out others.)
He notes that government is finally getting involved. The government still has a steep learning curve and limited resources -- making them still heavily dependent upon private parties -- but at least they are starting to get involved. He noted that the Virgina Attorney General got the very first arrest that was purely for spam and didn't have other underlying crimes (like fraud, forgery, trespass to chattels, identity theft, etc) underneath.
Praed says that the federal CAN-SPAM law has had little impact so far, except to reveal unequivocably that spammers don't give a hoot about legality. He notes that CAN-SPAM will accelerate the run offshore and increase the market for IP addresses. (He sees these as good things: this makes it more expensive for the spammers to do business.) While CAN-SPAM does pre-empt state laws, there are significant "carve-outs" for state laws.
Praed also commented that the growth in blocked mail is phenomenal. (Personal note: That comment was backed up later in the day. Geoff Hulten of Microsoft, for example, said later said that Hotmail blocks over three billion messages per day. At dinner, Miles Libby said that one of the other major ISPs (AOL, if I recall correctly) also had publicly stated that they block 2 or 3 B messages/day.)
Matthew Prince of Unspam was much more pessimistic. He showed a graph of the amount of spam with an overlay of number of state laws vs. time, and noted that spam has grown enormously despite various laws. He noted that the federal CAN-SPAM law was based on the state laws, which date back to 1997, when spam wasn't nearly as big a problem. He did note that the McCain amendment was one good and overlooked improvement: you don't have to catch the sender, you only have to catch the people who profit from it. (Personal note: That might be nice, but it makes me worry that if someone didn't like me, they could send out a bunch of spam promoting my book.... and get me in serious trouble. Uh-oh.)
Prince was very pessimistic about laws because of jursidictional issues. He noted that the state of Washington was the most successful at prosecuting spam lawsuits (with a whoppin' four) because they have a Washington state "do not spam" registry -- which allows the state to establish jurisdiction.
Prince noted that establishing the identity (and residence?) of the *recipient* is important for establishing jurisdiction. If the spammer can find out your home jurisdiction, then it's much harder for them to claim that they aren't bound by laws in your jurisdiction.
Prince also suggested using some provisions of the DCMA to nail address harvesters. Courts have ruled that people visiting a web site are bound by its terms of service, so he suggests putting something in the meta tags that says "do not harvest email from this site" (and has boilerplate for license text at http://www.unspam.com), assert that the email addresses on the site are trade secrets, and nail them with DCMA. (The audience was VERY uncomfortable about using DCMA for anything....)
(Personal note: Presumably it's hard to prove who harvested your email address. I suggested to a few representatives of major ISPs that they put up a page with the unspam licensing terms and a link that says "If you follow this link and harvest this address, we will sue you", then on the next page, have an automatically-generated email address that encodes the IP address of the requestor (or simply keeps track of the address-IP pain). Then, if (when) that address starts getting spam, you can go after the IP address that harvested it.)
Geoff Hulten of Microsoft talked a bit about Microsoft's anti-spam fight with some statistics:
Hotmail has gotten great value out of their users -- there are a set of users that have signed up to give feedback about what is spam and what isn't.
Hulten said that it's hard to get really high accuracy on spam because there's a very big (3-5%) category where users disagree on what spam is. (I heard several other people echo that during the course of the day, both in talks and in hallway conversations.) For example, users disagreed on legitimate commerical messages and even on off-topic mailing list messages.
Bill Yerazunis talked about training methods. In one test, he found this error rate: Train on Errors (TOE) 149
Train Everthing (TEFT) 69
Train Until No Errors (TUNE) 54
He noted that "forgetting" is good (which was echoed a few times throughout the day) because features can change polarity. However, you should forget as little as possible. He found that randomly deleting a few words gave a 3x improvement in accuracy.
He noted that using a collaborative filtering scheme (which he called "innoculation") didn't always work if the check was done at the time a message was received -- because the filter might not have had time to hear about that particular message. He argued that one should filter at the SMTP level for the easy stuff, then a second time right before the user looks at the message. (Personal note: while it's conceptually nice that the MTA and the MUA are distinct and separate things, the two have access to different information useful in spam: MUA has the address book and message history, but the MTA has envelope information and how many people got that message.)
There was more stuff that was technical enough that I can't recreate it adequately from my sparse notes -- I would need to go look at his full paper.
Marty Lamb talked about TarProxy -- a program that slows down spammers.
Using an external spam filter to figure out whether or not a message is spam, at SMTP time, an MTA can flat-out reject the message with a 5xx message:
554 I don't need any Viagra. Go away.or a 4xx "tempfail" message:
451 I'm tired of this. Spam me lateror "tar pit" it by giving repeated 4xx- messages (the dash means "more information coming")
451-Your spam is important to us. Please stay on the line. 451-Your spam is important to us. Please stay on the line. 451-Your spam is important to us. Please stay on the line. 451-Your spam is important to us. Please stay on the line. 451-Your spam is important to us. Please stay on the line.(This is the moral equivalent of putting a telemarketer on hold.)
Lamb shared what he'd learned:
He noted that this does increase network overhead, and even non-spam is slowed down a little bit. The internal SMTP server must disallow relaying from (I can't read my notes here).
Lamb said that he doesn't handle START TTLS and other SMTP extensions, nor POP-before-SMTP. (He's not to worried about some of those because spammers don't use them.)
Tarpit is Opensource GPL'd, and he'd love help on code/docs/test. See http://www.martiansoftware.com/tarproxy
Ken Schneider of Brightmail was up next.
Schneider noted that over 98% of spam has a URL to click on. (Otherwise, how will they get their money out of you?) He showed a lot of ways to obfuscate URLs. While many of those are easy for a computer to decipher and look up on a blacklist, using redirects like rd.yahoo.com makes it difficult to tell where the redirect is going. (Miles Libby said later that Yahoo is aware of the problem and working on fixing it, but there are other open redirects out there.)
Schneider mentioned that porn was about 16-20% of spam, up from about 5% a few years ago.
Jonathan Zadarksy talked about using pairs of words instead of single words. He found two-word couplets (e.g. "unsubscribe from") to be better indications of spam than single words (e.g. "unsubscribe" and "from"). (Personal note: IIRC, this contradicts what the spambayes folks have found.)
He also talked about a proposed standard for message inoculation (collaborative filtering) and said that in a small group, they were able to catch 20 extra spams in a month (though he said that they'd probably sent ten "innoculation" messages in that month). (Personal note: this didn't seem like a very high payoff:effort ratio to me.)
Miles Libby of Yahoo talked about Yahoo's spam problem and approach, which wasn't that different from what had been said earlier so I won't recap it. (Sorry, Miles.) He did mention "DomainKeys", a Yahoo proposal to authenticate using PKI. (When asked about revocation, he replied that each key would be given a time-to-live, and that domains could specify multiple keys.)
Eric Kidd posited basically that whitelisting was easy and cheap to do, and buys you an enormous amount. He thinks you should give high hamminess scores to
Then do the same for domains, where the "goodness" of a domain is a function of the (# of good messages from a domain) / (total # messages from that domain)
If you still can't figure out whether something is spam or ham, then run it through a regular spam filter.
He notes that many of the Bayesian filters end up with To/From/CC addresses as high-value features, and wonders if these complex machine learning filters might be discovering what are in fact very simple rules.
Victor Mendez talked about putting CRM114 on a server. I had looked at CRM114's source at one point last year and dismissed it as unintelligible, so I wasn't terribly interested in this talk, so didn't take good notes.
John Graham-Cummings probably took the prize for wittiest presentation again, despite giving it via videotape. (He opened with what was apparently a parody of the 'all your base are belong to us' videogame.)
He gave stats on the 254 spams that made it through POPFile last year:
The RTF ones were listed as Content-Type of text/plain. The challenges were fake challenges, which just cements his opinion more firmly that challenge/response is a bad idea.
For the spam masquerading as non-deliverable mail and the anti-spam products, once he trained POPFile on them, they went away.
He also talked about using adaptive learning techniques to learn what his "hammie" words were. By doing iterating with different "word salad" messages (with words chosen from a dictionary) and giving feedback about which messages made it through, he was able to learn pretty quickly his "kryptonite" -- the words which had the highest "good" ratings. This means that it is vitally important to prevent spammers from getting feedback on which messages got through and which didn't. This means no automatic display of external images (web bugs), for example.
(Personal note: And this means that places like Yahoo, where spammers can get exact feedback, are screwed.)
Thede Loder et all from the University of Michigan talked about an economic model for spam. They showed a bunch of graphs and charts that talked about the utility of spam, but it all came down to "if receiver doesn't want your message, sender pays".
There was a little handwaving about how these micropayments would actually work. They seemed to have the attitude that it was a small matter of programming. (Personal note: I think that getting the infrastructure right for micropayments is going to be extremely difficult, and that there are some Nigerians who are going to get fabulously wealthy while the bugs are getting worked out.)
Eric Johansson talked about Camram, a variant of sender-pays where the sender pays in computational time instead of cold hard cash. Instead, the sender performs some computationally expensive (a few seconds) operation whose reverse operation is computationally cheap. He didn't talk in detail about the a particular algorithm to use (or if he did, I zoned out during that portion), but clearly had one in mind.
One of the advantages he listed over paying in cash is that there is no central authority, making it difficult for spammers, corporations, or governments to corrupt it. He also pointed out that it was take only minor extensions to some of current infrastructure to make it useful, has protection against double-spending money, is tamper-resistent, and preserves anonymity.
He saw Camram as being used in conjunction with whitelisting: that you would only create a stamp when the message was to someone you hadn't corresponded with before. "Strangers cost, friends fly free."
He recognized that Moore's law meant that to continue being a meaningful amount of computational time, you'd have to slowly crank up the computational complexity of the work. He had this idea of a peer-to-peer rate setting system where you'd take into account what everyone else was paying.
(Personal notes on Camram: I didn't think he made it terribly clear that a camram stamp could be just one more feature that fed into a statistical anti-spam filter. I presume that there's a way to find out just how much a stamp "cost", and the hamminess improvement that the stamp would buy you ought to be proportional to its cost. One can imagine that then people would crank up the cost by hand if they found that their messages weren't getting through, and so you wouldn't need a peer-to-peer network.
I also presume that this would need to be built in to the client (otherwise people won't use it) and that it would be nice to be able to set by hand how much work you wanted to spend on a stamp. I can imagine that if I wanted to get the attention of a stranger who I *really* wanted to get through to, I'd like to be able to send proof of hours worth of work.
Oh, and I personally think that "proof-of-work" is a much more workable solution than "proof-of-payment".)
Peter Kay of Titan Key groused about the incredibly cold weather, saying that he had worked hard to keep warm when he lived in Chicago, then finally realized he could stop spending so much effort by moving to Hawai'i. He felt similarly about spam: we shouldn't be spending lots of effort on filtering spam once it gets into our systems, we shouldn't let it in our systems in the first place.
He has a variant of disposable email addresses/single-use email addresses he calls "KeyMail". To simplify a little bit, you use an address until it starts getting spam. As soon as that account gets its first piece of spam, you "lock" it so that anybody who has sent you mail to that address in the past can keep using that address; anyone else gets rejected at the server.
There are some extra frills on that: he lets you set various "policies" on email addresses, including shelf life, quantity, and scope (first N users).
Other advantages include
He said that what we needed now were:
(Personal comments: I loved this idea at first glance, but upon further inspection I
got a little nervous about a few things:
Eric Raymond talked about the IETF's Anti-Spam Research Group (ASRG), CANSPAM, and SPF (an authentication scheme).
The ASRG is an anti-spam organization that he described as a clearinghouse for anti-spam ideas. (Personal note: I'm on the ASRG mailing list, and it has a very low signal to noise ratio. Someone described it as being a "write-only" mailing list. However, it *does* mean that there is something defineable as a community, it does mean that people are talking. It -- along with the Boston spam conference, JamSpam, and the spambayes project -- is a part of why anti-spam is in so much better shape than it was two years ago.)
CANSPAM, the US federal legislation, has language buried in section 11.2 that says that the FCC (or was it FTA?) is supposed to ask the IETF what standards exist. There being none written down, Eric decided to write some down. His spec must reflect existing practice, so there won't be any surprises. (Look for ADV: and Adult and Bulk, for example.) The spec is not useful if it has provisions that would be found to conflict with the U.S. Constitution, so it has to have weak labeling, not strong labeling.
SPF -- Sender Permitted From -- is a gentle authentication scheme that ASRG has come up with. If you are a domain, you advertise in your DNS record what IP addresses (or blocks of addresses) you use. When a message arrives at someone else's SMTP server allegedly from your domain, that server can go check and see if the IP address is on your "good" list.
There are some problems with this.
Tony Yu of Mailshell had a presentation that probably would have been a lot more interesting a 10 AM than at in the late afternoon -- I was getting tired by this point. He talked about all the components of Mailshell's anti-spam product, which had mostly very familiar components. The one thing that they did that was a bit interesting was that they had an algorithm for determining "sameness" of messages; this helped in collaborative filtering.
Richard Jowsey from Death2spam.net had the bad luck to be last on the schedule. He was a very engaging speaker and had lots of very interesting-looking charts with bimodal probability distributions. He talked about various ways that you could manipulate the graphs to spread the humps, to model overlap, to normalize curves, etc, but he didn't do a very good job of explaining _why_ he was doing these manipulations or what results he got when he did these manipulations.
More general personal comments: