| Subject: Re: Now, what is this about? |
Author:
fghgfh
|
[
Next Thread |
Previous Thread |
Next Message |
Previous Message
]
Date Posted: 05:30:28 01/01/08 Tue
In reply to:
ghjghj
's message, "Re: Now, what is this about?" on 05:28:57 01/01/08 Tue
>>>>>Like to build things? Try Hacker News.
>>>>>August 2002
>>>>>
>>>>>
>>>>>I think it's possible to stop spam, and that
>>>>>content-based filters are the way to do it. The
>>>>>Achilles heel of the spammers is their message.
>They
>>>>>can circumvent any other barrier you set up. They
>>>have
>>>>>so far, at least. But they have to deliver their
>>>>>message, whatever it is. If we can write software
>>>that
>>>>>recognizes their messages, there is no way they can
>>>>>get around that.
>>>>>
>>>>>To the recipient, spam is easily recognizable. If
>>you
>>>>>hired someone to read your mail and discard the
>>spam,
>>>>>they would have little trouble doing it. How much
>do
>>>>>we have to do, short of AI, to automate this
>>process?
>>>>>
>>>>>I think we will be able to solve the problem with
>>>>>fairly simple algorithms. In fact, I've found that
>>>you
>>>>>can filter present-day spam acceptably well using
>>>>>nothing more than a Bayesian combination of the
>spam
>>>>>probabilities of individual words. Using a slightly
>>>>>tweaked (as described below) Bayesian filter, we
>now
>>>>>miss less than 5 per 1000 spams, with 0 false
>>>>>positives.
>>>>>
>>>>>The statistical approach is not usually the first
>>one
>>>>>people try when they write spam filters. Most
>>>hackers'
>>>>>first instinct is to try to write software that
>>>>>recognizes individual properties of spam. You look
>>at
>>>>>spams and you think, the gall of these guys to try
>>>>>sending me mail that begins "Dear Friend" or has a
>>>>>subject line that's all uppercase and ends in eight
>>>>>exclamation points. I can filter out that stuff
>with
>>>>>about one line of code.
>>>>>
>>>>>And so you do, and in the beginning it works. A few
>>>>>simple rules will take a big bite out of your
>>>incoming
>>>>>spam. Merely looking for the word "click" will
>catch
>>>>>79.7% of the emails in my spam corpus, with only
>>1.2%
>>>>>false positives.
>>>>>
>>>>>I spent about six months writing software that
>>looked
>>>>>for individual spam features before I tried the
>>>>>statistical approach. What I found was that
>>>>>recognizing that last few percent of spams got very
>>>>>hard, and that as I made the filters stricter I got
>>>>>more false positives.
>>>>>
>>>>>False positives are innocent emails that get
>>>>>mistakenly identified as spams. For most users,
>>>>>missing legitimate email is an order of magnitude
>>>>>worse than receiving spam, so a filter that yields
>>>>>false positives is like an acne cure that carries a
>>>>>risk of death to the patient.
>>>>>
>>>>>The more spam a user gets, the less likely he'll be
>>>to
>>>>>notice one innocent mail sitting in his spam
>folder.
>>>>>And strangely enough, the better your spam filters
>>>>>get, the more dangerous false positives become,
>>>>>because when the filters are really good, users
>will
>>>>>be more likely to ignore everything they catch.
>>>>>
>>>>>I don't know why I avoided trying the statistical
>>>>>approach for so long. I think it was because I got
>>>>>addicted to trying to identify spam features
>myself,
>>>>>as if I were playing some kind of competitive game
>>>>>with the spammers. (Nonhackers don't often realize
>>>>>this, but most hackers are very competitive.) When
>I
>>>>>did try statistical analysis, I found immediately
>>>that
>>>>>it was much cleverer than I had been. It
>discovered,
>>>>>of course, that terms like "virtumundo" and "teens"
>>>>>were good indicators of spam. But it also
>discovered
>>>>>that "per" and "FL" and "ff0000" are good
>indicators
>>>>>of spam. In fact, "ff0000" (html for bright red)
>>>turns
>>>>>out to be as good an indicator of spam as any
>>>>>pornographic term.
>>>>>
>>>>>
>>>>>_ _ _
>>>>>
>>>>>
>>>>>Here's a sketch of how I do statistical filtering.
>I
>>>>>start with one corpus of spam and one of nonspam
>>>mail.
>>>>>At the moment each one has about 4000 messages in
>>it.
>>>>>I scan the entire text, including headers and
>>>embedded
>>>>>html and javascript, of each message in each
>corpus.
>>>I
>>>>>currently consider alphanumeric characters, dashes,
>>>>>apostrophes, and dollar signs to be part of tokens,
>>>>>and everything else to be a token separator. (There
>>>is
>>>>>probably room for improvement here.) I ignore
>tokens
>>>>>that are all digits, and I also ignore html
>>comments,
>>>>>not even considering them as token separators.
>>>>>
>>>>>I count the number of times each token (ignoring
>>>case,
>>>>>currently) occurs in each corpus. At this stage I
>>end
>>>>>up with two large hash tables, one for each corpus,
>>>>>mapping tokens to number of occurrences.
>>>>>
>>>>>Next I create a third hash table, this time mapping
>>>>>each token to the probability that an email
>>>containing
>>>>>it is a spam, which I calculate as follows [1]:
>>>>>(let ((g (* 2 (or (gethash word good) 0)))
>>>>> (b (or (gethash word bad) 0)))
>>>>> (unless (< (+ g b) 5)
>>>>> (max .01
>>>>> (min .99 (float (/ (min 1 (/ b nbad))
>>>>> (+ (min 1 (/ g
>ngood))
>>>
>>>>> (min 1 (/ b
>>>>>nbad)))))))))
>>>>>
>>>>>where word is the token whose probability we're
>>>>>calculating, good and bad are the hash tables I
>>>>>created in the first step, and ngood and nbad are
>>the
>>>>>number of nonspam and spam messages respectively.
>>>>>
>>>>>I explained this as code to show a couple of
>>>important
>>>>>details. I want to bias the probabilities slightly
>>to
>>>>>avoid false positives, and by trial and error I've
>>>>>found that a good way to do it is to double all the
>>>>>numbers in good. This helps to distinguish between
>>>>>words that occasionally do occur in legitimate
>email
>>>>>and words that almost never do. I only consider
>>words
>>>>>that occur more than five times in total (actually,
>>>>>because of the doubling, occurring three times in
>>>>>nonspam mail would be enough). And then there is
>the
>>>>>question of what probability to assign to words
>that
>>>>>occur in one corpus but not the other. Again by
>>trial
>>>>>and error I chose .01 and .99. There may be room
>for
>>>>>tuning here, but as the corpus grows such tuning
>>will
>>>>>happen automatically anyway.
>>>>>
>>>>>The especially observant will notice that while I
>>>>>consider each corpus to be a single long stream of
>>>>>text for purposes of counting occurrences, I use
>the
>>>>>number of emails in each, rather than their
>combined
>>>>>length, as the divisor in calculating spam
>>>>>probabilities. This adds another slight bias to
>>>>>protect against false positives.
>>>>>
>>>>>When new mail arrives, it is scanned into tokens,
>>and
>>>>>the most interesting fifteen tokens, where
>>>interesting
>>>>>is measured by how far their spam probability is
>>from
>>>>>a neutral .5, are used to calculate the probability
>>>>>that the mail is spam. If probs is a list of the
>>>>>fifteen individual probabilities, you calculate the
>>>>>combined probability thus:
>>>>>(let ((prod (apply #'* probs)))
>>>>> (/ prod (+ prod (apply #'* (mapcar #'(lambda (x)
>>>>> (- 1 x))
>>>>> probs)))))
>>>>>
>>>>>One question that arises in practice is what
>>>>>probability to assign to a word you've never seen,
>>>>>i.e. one that doesn't occur in the hash table of
>>word
>>>>>probabilities. I've found, again by trial and
>error,
>>>>>that .4 is a good number to use. If you've never
>>seen
>>>>>a word before, it is probably fairly innocent; spam
>>>>>words tend to be all too familiar.
>>>>>
>>>>>There are examples of this algorithm being applied
>>to
>>>>>actual emails in an appendix at the end.
>>>>>
>>>>>I treat mail as spam if the algorithm above gives
>it
>>>a
>>>>>probability of more than .9 of being spam. But in
>>>>>practice it would not matter much where I put this
>>>>>threshold, because few probabilities end up in the
>>>>>middle of the range.
>>>>>
>>>>>
>>>>>_ _ _
>>>>>
>>>>>
>>>>>One great advantage of the statistical approach is
>>>>>that you don't have to read so many spams. Over the
>>>>>past six months, I've read literally thousands of
>>>>>spams, and it is really kind of demoralizing.
>>Norbert
>>>>>Wiener said if you compete with slaves you become a
>>>>>slave, and there is something similarly degrading
>>>>>about competing with spammers. To recognize
>>>individual
>>>>>spam features you have to try to get into the mind
>>of
>>>>>the spammer, and frankly I want to spend as little
>>>>>time inside the minds of spammers as possible.
>>>>>
>>>>>But the real advantage of the Bayesian approach, of
>>>>>course, is that you know what you're measuring.
>>>>>Feature-recognizing filters like SpamAssassin
>assign
>>>a
>>>>>spam "score" to email. The Bayesian approach
>assigns
>>>>>an actual probability. The problem with a "score"
>is
>>>>>that no one knows what it means. The user doesn't
>>>know
>>>>>what it means, but worse still, neither does the
>>>>>developer of the filter. How many points should an
>>>>>email get for having the word "sex" in it? A
>>>>>probability can of course be mistaken, but there is
>>>>>little ambiguity about what it means, or how
>>evidence
>>>>>should be combined to calculate it. Based on my
>>>>>corpus, "sex" indicates a .97 probability of the
>>>>>containing email being a spam, whereas "sexy"
>>>>>indicates .99 probability. And Bayes' Rule, equally
>>>>>unambiguous, says that an email containing both
>>words
>>>>>would, in the (unlikely) absence of any other
>>>>>evidence, have a 99.97% chance of being a spam.
>>>>>
>>>>>Because it is measuring probabilities, the Bayesian
>>>>>approach considers all the evidence in the email,
>>>both
>>>>>good and bad. Words that occur disproportionately
>>>>>rarely in spam (like "though" or "tonight" or
>>>>>"apparently") contribute as much to decreasing the
>>>>>probability as bad words like "unsubscribe" and
>>>>>"opt-in" do to increasing it. So an otherwise
>>>innocent
>>>>>email that happens to include the word "sex" is not
>>>>>going to get tagged as spam.
>>>>>
>>>>>Ideally, of course, the probabilities should be
>>>>>calculated individually for each user. I get a lot
>>of
>>>>>email containing the word "Lisp", and (so far) no
>>>spam
>>>>>that does. So a word like that is effectively a
>kind
>>>>>of password for sending mail to me. In my earlier
>>>>>spam-filtering software, the user could set up a
>>list
>>>>>of such words and mail containing them would
>>>>>automatically get past the filters. On my list I
>put
>>>>>words like "Lisp" and also my zipcode, so that
>>>>>(otherwise rather spammy-sounding) receipts from
>>>>>online orders would get through. I thought I was
>>>being
>>>>>very clever, but I found that the Bayesian filter
>>did
>>>>>the same thing for me, and moreover discovered of a
>>>>>lot of words I hadn't thought of.
>>>>>
>>>>>When I said at the start that our filters let
>>through
>>>>>less than 5 spams per 1000 with 0 false positives,
>>>I'm
>>>>>talking about filtering my mail based on a corpus
>of
>>>>>my mail. But these numbers are not misleading,
>>>because
>>>>>that is the approach I'm advocating: filter each
>>>>>user's mail based on the spam and nonspam mail he
>>>>>receives. Essentially, each user should have two
>>>>>delete buttons, ordinary delete and delete-as-spam.
>>>>>Anything deleted as spam goes into the spam corpus,
>>>>>and everything else goes into the nonspam corpus.
>>>>>
>>>>>You could start users with a seed filter, but
>>>>>ultimately each user should have his own per-word
>>>>>probabilities based on the actual mail he receives.
>>>>>This (a) makes the filters more effective, (b) lets
>>>>>each user decide their own precise definition of
>>>spam,
>>>>>and (c) perhaps best of all makes it hard for
>>>spammers
>>>>>to tune mails to get through the filters. If a lot
>>of
>>>>>the brain of the filter is in the individual
>>>>>databases, then merely tuning spams to get through
>>>the
>>>>>seed filters won't guarantee anything about how
>well
>>>>>they'll get through individual users' varying and
>>>much
>>>>>more trained filters.
>>>>>
>>>>>Content-based spam filtering is often combined with
>>a
>>>>>whitelist, a list of senders whose mail can be
>>>>>accepted with no filtering. One easy way to build
>>>such
>>>>>a whitelist is to keep a list of every address the
>>>>>user has ever sent mail to. If a mail reader has a
>>>>>delete-as-spam button then you could also add the
>>>from
>>>>>address of every email the user has deleted as
>>>>>ordinary trash.
>>>>>
>>>>>I'm an advocate of whitelists, but more as a way to
>>>>>save computation than as a way to improve
>filtering.
>>>I
>>>>>used to think that whitelists would make filtering
>>>>>easier, because you'd only have to filter email
>from
>>>>>people you'd never heard from, and someone sending
>>>you
>>>>>mail for the first time is constrained by
>convention
>>>>>in what they can say to you. Someone you already
>>know
>>>>>might send you an email talking about sex, but
>>>someone
>>>>>sending you mail for the first time would not be
>>>>>likely to. The problem is, people can have more
>than
>>>>>one email address, so a new from-address doesn't
>>>>>guarantee that the sender is writing to you for the
>>>>>first time. It is not unusual for an old friend
>>>>>(especially if he is a hacker) to suddenly send you
>>>an
>>>>>email with a new from-address, so you can't risk
>>>false
>>>>>positives by filtering mail from unknown addresses
>>>>>especially stringently.
>>>>>
>>>>>In a sense, though, my filters do themselves embody
>>a
>>>>>kind of whitelist (and blacklist) because they are
>>>>>based on entire messages, including the headers. So
>>>to
>>>>>that extent they "know" the email addresses of
>>>trusted
>>>>>senders and even the routes by which mail gets from
>>>>>them to me. And they know the same about spam,
>>>>>including the server names, mailer versions, and
>>>>>protocols.
>>>>>
>>>>>
>>>>>_ _ _
>>>>>
>>>>>
>>>>>If I thought that I could keep up current rates of
>>>>>spam filtering, I would consider this problem
>>solved.
>>>>>But it doesn't mean much to be able to filter out
>>>most
>>>>>present-day spam, because spam evolves. Indeed,
>most
>>>>>antispam techniques so far have been like
>pesticides
>>>>>that do nothing more than create a new, resistant
>>>>>strain of bugs.
>>>>>
>>>>>I'm more hopeful about Bayesian filters, because
>>they
>>>>>evolve with the spam. So as spammers start using
>>>>>"c0ck" instead of "cock" to evade simple-minded
>spam
>>>>>filters based on individual words, Bayesian filters
>>>>>automatically notice. Indeed, "c0ck" is far more
>>>>>damning evidence than "cock", and Bayesian filters
>>>>>know precisely how much more.
>>>>>
>>>>>Still, anyone who proposes a plan for spam
>filtering
>>>>>has to be able to answer the question: if the
>>>spammers
>>>>>knew exactly what you were doing, how well could
>>they
>>>>>get past you? For example, I think that if
>>>>>checksum-based spam filtering becomes a serious
>>>>>obstacle, the spammers will just switch to mad-lib
>>>>>techniques for generating message bodies.
>>>>>
>>>>>To beat Bayesian filters, it would not be enough
>for
>>>>>spammers to make their emails unique or to stop
>>using
>>>>>individual naughty words. They'd have to make their
>>>>>mails indistinguishable from your ordinary mail.
>And
>>>>>this I think would severely constrain them. Spam is
>>>>>mostly sales pitches, so unless your regular mail
>is
>>>>>all sales pitches, spams will inevitably have a
>>>>>different character. And the spammers would also,
>of
>>>>>course, have to change (and keep changing) their
>>>whole
>>>>>infrastructure, because otherwise the headers would
>>>>>look as bad to the Bayesian filters as ever, no
>>>matter
>>>>>what they did to the message body. I don't know
>>>enough
>>>>>about the infrastructure that spammers use to know
>>>how
>>>>>hard it would be to make the headers look innocent,
>>>>>but my guess is that it would be even harder than
>>>>>making the message look innocent.
>>>>>
>>>>>Assuming they could solve the problem of the
>>headers,
>>>>>the spam of the future will probably look something
>>>>>like this:
>>>>>Hey there. Thought you should check out the
>>>>following:
>>>>>
>>>>>href="http://www.27meg.com/foo">http://www.27meg.co
>m
>>/
>>>f
>>>>o
>>>>>o
>>>>>
>>>>>because that is about as much sales pitch as
>>>>>content-based filtering will leave the spammer room
>>>to
>>>>>make. (Indeed, it will be hard even to get this
>past
>>>>>filters, because if everything else in the email is
>>>>>neutral, the spam probability will hinge on the
>url,
>>>>>and it will take some effort to make that look
>>>>>neutral.)
>>>>>
>>>>>Spammers range from businesses running so-called
>>>>>opt-in lists who don't even try to conceal their
>>>>>identities, to guys who hijack mail servers to send
>>>>>out spams promoting porn sites. If we use filtering
>>>to
>>>>>whittle their options down to mails like the one
>>>>>above, that should pretty much put the spammers on
>>>the
>>>>>"legitimate" end of the spectrum out of business;
>>>they
>>>>>feel obliged by various state laws to include
>>>>>boilerplate about why their spam is not spam, and
>>how
>>>>>to cancel your "subscription," and that kind of
>text
>>>>>is easy to recognize.
>>>>>
>>>>>(I used to think it was naive to believe that
>>>stricter
>>>>>laws would decrease spam. Now I think that while
>>>>>stricter laws may not decrease the amount of spam
>>>that
>>>>>spammers send, they can certainly help filters to
>>>>>decrease the amount of spam that recipients
>actually
>>>>>see.)
>>>>>
>>>>>All along the spectrum, if you restrict the sales
>>>>>pitches spammers can make, you will inevitably tend
>>>to
>>>>>put them out of business. That word business is an
>>>>>important one to remember. The spammers are
>>>>>businessmen. They send spam because it works. It
>>>works
>>>>>because although the response rate is abominably
>low
>>>>>(at best 15 per million, vs 3000 per million for a
>>>>>catalog mailing), the cost, to them, is practically
>>>>>nothing. The cost is enormous for the recipients,
>>>>>about 5 man-weeks for each million recipients who
>>>>>spend a second to delete the spam, but the spammer
>>>>>doesn't have to pay that.
>>>>>
>>>>>Sending spam does cost the spammer something,
>>though.
>>>>>[2] So the lower we can get the response rate--
>>>>>whether by filtering, or by using filters to force
>>>>>spammers to dilute their pitches-- the fewer
>>>>>businesses will find it worth their while to send
>>>>spam.
>>>>>
>>>>>The reason the spammers use the kinds of sales
>>>pitches
>>>>>that they do is to increase response rates. This is
>>>>>possibly even more disgusting than getting inside
>>the
>>>>>mind of a spammer, but let's take a quick look
>>inside
>>>>>the mind of someone who responds to a spam. This
>>>>>person is either astonishingly credulous or deeply
>>in
>>>>>denial about their sexual interests. In either
>case,
>>>>>repulsive or idiotic as the spam seems to us, it is
>>>>>exciting to them. The spammers wouldn't say these
>>>>>things if they didn't sound exciting. And "thought
>>>you
>>>>>should check out the following" is just not going
>to
>>>>>have nearly the pull with the spam recipient as the
>>>>>kinds of things that spammers say now. Result: if
>it
>>>>>can't contain exciting sales pitches, spam becomes
>>>>>less effective as a marketing vehicle, and fewer
>>>>>businesses want to use it.
>>>>>
>>>>>That is the big win in the end. I started writing
>>>spam
>>>>>filtering software because I didn't want have to
>>look
>>>>>at the stuff anymore. But if we get good enough at
>>>>>filtering out spam, it will stop working, and the
>>>>>spammers will actually stop sending it.
>>>>>
>>>>>
>>>>>_ _ _
>>>>>
>>>>>
>>>>>Of all the approaches to fighting spam, from
>>software
>>>>>to laws, I believe Bayesian filtering will be the
>>>>>single most effective. But I also think that the
>>more
>>>>>different kinds of antispam efforts we undertake,
>>the
>>>>>better, because any measure that constrains
>spammers
>>>>>will tend to make filtering easier. And even within
>>>>>the world of content-based filtering, I think it
>>will
>>>>>be a good thing if there are many different kinds
>of
>>>>>software being used simultaneously. The more
>>>different
>>>>>filters there are, the harder it will be for
>>spammers
>>>>>to tune spams to get through them.
>>>>>
>>>>>
>>>>>
>>>>>Appendix: Examples of Filtering
>>>>>
>>>>>Here is an example of a spam that arrived while I
>>was
>>>>>writing this article. The fifteen most interesting
>>>>>words in this spam are:
>>>>>qvp0045
>>>>>indira
>>>>>mx-05
>>>>>intimail
>>>>>$7500
>>>>>freeyankeedom
>>>>>cdo
>>>>>bluefoxmedia
>>>>>jpg
>>>>>unsecured
>>>>>platinum
>>>>>3d0
>>>>>qves
>>>>>7c5
>>>>>7c266675
>>>>>
>>>>>The words are a mix of stuff from the headers and
>>>from
>>>>>the message body, which is typical of spam. Also
>>>>>typical of spam is that every one of these words
>has
>>>a
>>>>>spam probability, in my database, of .99. In fact
>>>>>there are more than fifteen words with
>probabilities
>>>>>of .99, and these are just the first fifteen seen.
>>>>>
>>>>>Unfortunately that makes this email a boring
>example
>>>>>of the use of Bayes' Rule. To see an interesting
>>>>>variety of probabilities we have to look at this
>>>>>actually quite atypical spam.
>>>>>
>>>>>The fifteen most interesting words in this spam,
>>with
>>>>>their probabilities, are:
>>>>>madam 0.99
>>>>>promotion 0.99
>>>>>republic 0.99
>>>>>shortest 0.047225013
>>>>>mandatory 0.047225013
>>>>>standardization 0.07347802
>>>>>sorry 0.08221981
>>>>>supported 0.09019077
>>>>>people's 0.09019077
>>>>>enter 0.9075001
>>>>>quality 0.8921298
>>>>>organization 0.12454646
>>>>>investment 0.8568143
>>>>>very 0.14758544
>>>>>valuable 0.82347786
>>>>>
>>>>>This time the evidence is a mix of good and bad. A
>>>>>word like "shortest" is almost as much evidence for
>>>>>innocence as a word like "madam" or "promotion" is
>>>for
>>>>>guilt. But still the case for guilt is stronger. If
>>>>>you combine these numbers according to Bayes' Rule,
>>>>>the resulting probability is .9027.
>>>>>
>>>>>"Madam" is obviously from spams beginning "Dear Sir
>>>or
>>>>>Madam." They're not very common, but the word
>>"madam"
>>>>>never occurs in my legitimate email, and it's all
>>>>>about the ratio.
>>>>>
>>>>>"Republic" scores high because it often shows up in
>>>>>Nigerian scam emails, and also occurs once or twice
>>>in
>>>>>spams referring to Korea and South Africa. You
>might
>>>>>say that it's an accident that it thus helps
>>identify
>>>>>this spam. But I've found when examining spam
>>>>>probabilities that there are a lot of these
>>>accidents,
>>>>>and they have an uncanny tendency to push things in
>>>>>the right direction rather than the wrong one. In
>>>this
>>>>>case, it is not entirely a coincidence that the
>word
>>>>>"Republic" occurs in Nigerian scam emails and this
>>>>>spam. There is a whole class of dubious business
>>>>>propositions involving less developed countries,
>and
>>>>>these in turn are more likely to have names that
>>>>>specify explicitly (because they aren't) that they
>>>are
>>>>>republics.[3]
>>>>>
>>>>>On the other hand, "enter" is a genuine miss. It
>>>>>occurs mostly in unsubscribe instructions, but here
>>>is
>>>>>used in a completely innocent way. Fortunately the
>>>>>statistical approach is fairly robust, and can
>>>>>tolerate quite a lot of misses before the results
>>>>>start to be thrown off.
>>>>>
>>>>>For comparison, here is an example of that rare
>>bird,
>>>>>a spam that gets through the filters. Why? Because
>>by
>>>>>sheer chance it happens to be loaded with words
>that
>>>>>occur in my actual email:
>>>>>perl 0.01
>>>>>python 0.01
>>>>>tcl 0.01
>>>>>scripting 0.01
>>>>>morris 0.01
>>>>>graham 0.01491078
>>>>>guarantee 0.9762507
>>>>>cgi 0.9734398
>>>>>paul 0.027040077
>>>>>quite 0.030676773
>>>>>pop3 0.042199217
>>>>>various 0.06080265
>>>>>prices 0.9359873
>>>>>managed 0.06451222
>>>>>difficult 0.071706355
>>>>>
>>>>>There are a couple pieces of good news here. First,
>>>>>this mail probably wouldn't get through the filters
>>>of
>>>>>someone who didn't happen to specialize in
>>>programming
>>>>>languages and have a good friend called Morris. For
>>>>>the average user, all the top five words here would
>>>be
>>>>>neutral and would not contribute to the spam
>>>>>probability.
>>>>>
>>>>>Second, I think filtering based on word pairs (see
>>>>>below) might well catch this one: "cost effective",
>>>>>"setup fee", "money back" -- pretty incriminating
>>>>>stuff. And of course if they continued to spam me
>>(or
>>>>>a network I was part of), "Hostex" itself would be
>>>>>recognized as a spam term.
>>>>>
>>>>>Finally, here is an innocent email. Its fifteen
>most
>>>>>interesting words are as follows:
>>>>>continuation 0.01
>>>>>describe 0.01
>>>>>continuations 0.01
>>>>>example 0.033600237
>>>>>programming 0.05214485
>>>>>i'm 0.055427782
>>>>>examples 0.07972858
>>>>>color 0.9189189
>>>>>localhost 0.09883721
>>>>>hi 0.116539136
>>>>>california 0.84421706
>>>>>same 0.15981844
>>>>>spot 0.1654587
>>>>>us-ascii 0.16804294
>>>>>what 0.19212411
>>>>>
>>>>>Most of the words here indicate the mail is an
>>>>>innocent one. There are two bad smelling words,
>>>>>"color" (spammers love colored fonts) and
>>>"California"
>>>>>(which occurs in testimonials and also in menus in
>>>>>forms), but they are not enough to outweigh
>>obviously
>>>>>innocent words like "continuation" and "example".
>>>>>
>>>>>It's interesting that "describe" rates as so
>>>>>thoroughly innocent. It hasn't occurred in a single
>>>>>one of my 4000 spams. The data turns out to be full
>>>of
>>>>>such surprises. One of the things you learn when
>you
>>>>>analyze spam texts is how narrow a subset of the
>>>>>language spammers operate in. It's that fact,
>>>together
>>>>>with the equally characteristic vocabulary of any
>>>>>individual user's mail, that makes Bayesian
>>filtering
>>>>>a good bet.
>>>>>
>>>>>Appendix: More Ideas
>>>>>
>>>>>One idea that I haven't tried yet is to filter
>based
>>>>>on word pairs, or even triples, rather than
>>>individual
>>>>>words. This should yield a much sharper estimate of
>>>>>the probability. For example, in my current
>>database,
>>>>>the word "offers" has a probability of .96. If you
>>>>>based the probabilities on word pairs, you'd end up
>>>>>with "special offers" and "valuable offers" having
>>>>>probabilities of .99 and, say, "approach offers"
>(as
>>>>>in "this approach offers") having a probability of
>>.1
>>>>>or less.
>>>>>
>>>>>The reason I haven't done this is that filtering
>>>based
>>>>>on individual words already works so well. But it
>>>does
>>>>>mean that there is room to tighten the filters if
>>>spam
>>>>>gets harder to detect. (Curiously, a filter based
>on
>>>>>word pairs would be in effect a Markov-chaining
>text
>>>>>generator running in reverse.)
>>>>>
>>>>>Specific spam features (e.g. not seeing the
>>>>>recipient's address in the to: field) do of course
>>>>>have value in recognizing spam. They can be
>>>considered
>>>>>in this algorithm by treating them as virtual
>words.
>>>>>I'll probably do this in future versions, at least
>>>for
>>>>>a handful of the most egregious spam indicators.
>>>>>Feature-recognizing spam filters are right in many
>>>>>details; what they lack is an overall discipline
>for
>>>>>combining evidence.
>>>>>
>>>>>Recognizing nonspam features may be more important
>>>>>than recognizing spam features. False positives are
>>>>>such a worry that they demand extraordinary
>>measures.
>>>>>I will probably in future versions add a second
>>level
>>>>>of testing designed specifically to avoid false
>>>>>positives. If a mail triggers this second level of
>>>>>filters it will be accepted even if its spam
>>>>>probability is above the threshold.
>>>>>
>>>>>I don't expect this second level of filtering to be
>>>>>Bayesian. It will inevitably be not only ad hoc,
>but
>>>>>based on guesses, because the number of false
>>>>>positives will not tend to be large enough to
>notice
>>>>>patterns. (It is just as well, anyway, if a backup
>>>>>system doesn't rely on the same technology as the
>>>>>primary system.)
>>>>>
>>>>>Another thing I may try in the future is to focus
>>>>>extra attention on specific parts of the email. For
>>>>>example, about 95% of current spam includes the url
>>>of
>>>>>a site they want you to visit. (The remaining 5%
>>want
>>>>>you to call a phone number, reply by email or to a
>>US
>>>>>mail address, or in a few cases to buy a certain
>>>>>stock.) The url is in such cases practically enough
>>>by
>>>>>itself to determine whether the email is spam.
>>>>>
>>>>>Domain names differ from the rest of the text in a
>>>>>(non-German) email in that they often consist of
>>>>>several words stuck together. Though
>computationally
>>>>>expensive in the general case, it might be worth
>>>>>trying to decompose them. If a filter has never
>seen
>>>>>the token "xxxporn" before it will have an
>>individual
>>>>>spam probability of .4, whereas "xxx" and "porn"
>>>>>individually have probabilities (in my corpus) of
>>>>>.9889 and .99 respectively, and a combined
>>>probability
>>>>>of .9998.
>>>>>
>>>>>I expect decomposing domain names to become more
>>>>>important as spammers are gradually forced to stop
>>>>>using incriminating words in the text of their
>>>>>messages. (A url with an ip address is of course an
>>>>>extremely incriminating sign, except in the mail of
>>a
>>>>>few sysadmins.)
>>>>>
>>>>>It might be a good idea to have a cooperatively
>>>>>maintained list of urls promoted by spammers. We'd
>>>>>need a trust metric of the type studied by Raph
>>>Levien
>>>>>to prevent malicious or incompetent submissions,
>but
>>>>>if we had such a thing it would provide a boost to
>>>any
>>>>>filtering software. It would also be a convenient
>>>>>basis for boycotts.
>>>>>
>>>>>Another way to test dubious urls would be to send
>>out
>>>>>a crawler to look at the site before the user
>looked
>>>>>at the email mentioning it. You could use a
>Bayesian
>>>>>filter to rate the site just as you would an email,
>>>>>and whatever was found on the site could be
>included
>>>>>in calculating the probability of the email being a
>>>>>spam. A url that led to a redirect would of course
>>be
>>>>>especially suspicious.
>>>>>
>>>>>One cooperative project that I think really would
>be
>>>a
>>>>>good idea would be to accumulate a giant corpus of
>>>>>spam. A large, clean corpus is the key to making
>>>>>Bayesian filtering work well. Bayesian filters
>could
>>>>>actually use the corpus as input. But such a corpus
>>>>>would be useful for other kinds of filters too,
>>>>>because it could be used to test them.
>>>>>
>>>>>Creating such a corpus poses some technical
>>problems.
>>>>>We'd need trust metrics to prevent malicious or
>>>>>incompetent submissions, of course. We'd also need
>>>>>ways of erasing personal information (not just
>>>>>to-addresses and ccs, but also e.g. the arguments
>to
>>>>>unsubscribe urls, which often encode the
>to-address)
>>>>>from mails in the corpus. If anyone wants to take
>on
>>>>>this project, it would be a good thing for the
>>world.
>>>>>
>>>>>Appendix: Defining Spam
>>>>>
>>>>>I think there is a rough consensus on what spam is,
>>>>>but it would be useful to have an explicit
>>>definition.
>>>>>We'll need to do this if we want to establish a
>>>>>central corpus of spam, or even to compare spam
>>>>>filtering rates meaningfully.
>>>>>
>>>>>To start with, spam is not unsolicited commercial
>>>>>email. If someone in my neighborhood heard that I
>>was
>>>>>looking for an old Raleigh three-speed in good
>>>>>condition, and sent me an email offering to sell me
>>>>>one, I'd be delighted, and yet this email would be
>>>>>both commercial and unsolicited. The defining
>>feature
>>>>>of spam (in fact, its raison d'etre) is not that it
>>>is
>>>>>unsolicited, but that it is automated.
>>>>>
>>>>>It is merely incidental, too, that spam is usually
>>>>>commercial. If someone started sending mass email
>to
>>>>>support some political cause, for example, it would
>>>be
>>>>>just as much spam as email promoting a porn site.
>>>>>
>>>>>I propose we define spam as unsolicited automated
>>>>>email. This definition thus includes some email
>that
>>>>>many legal definitions of spam don't. Legal
>>>>>definitions of spam, influenced presumably by
>>>>>lobbyists, tend to exclude mail sent by companies
>>>that
>>>>>have an "existing relationship" with the recipient.
>>>>>But buying something from a company, for example,
>>>does
>>>>>not imply that you have solicited ongoing email
>from
>>>>>them. If I order something from an online store,
>and
>>>>>they then send me a stream of spam, it's still
>spam.
>>>>>
>>>>>Companies sending spam often give you a way to
>>>>>"unsubscribe," or ask you to go to their site and
>>>>>change your "account preferences" if you want to
>>stop
>>>>>getting spam. This is not enough to stop the mail
>>>from
>>>>>being spam. Not opting out is not the same as
>opting
>>>>>in. Unless the recipient explicitly checked a
>>clearly
>>>>>labelled box (whose default was no) asking to
>>receive
>>>>>the email, then it is spam.
>>>>>
>>>>>In some business relationships, you do implicitly
>>>>>solicit certain kinds of mail. When you order
>>online,
>>>>>I think you implicitly solicit a receipt, and
>>>>>notification when the order ships.
[
Next Thread |
Previous Thread |
Next Message |
Previous Message
]
| |