Bayesian spam filtering


Bayesian spam filtering

Bayesian spam filtering (pronounced BAYS-ee-ən, IPA pronunciation: IPA| ['beɪz.i.ən] , after Rev. Thomas Bayes), a form of e-mail filtering, is the process of using a naive Bayes classifier to identify spam e-mail.

The first known mail-filtering program to use a Bayes classifier was Jason Rennie's ifile program, released in 1996. The program was used to sort mail into folders. [cite web|url=http://people.csail.mit.edu/jrennie/ifile/old/README-0.1A| paper|author=Jason Rennie|title=ifile|date=1996] The first scholarly publication on Bayesian spam filtering was by Sahami et al. (1998). [cite web|url=http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf| paper|author=M. Sahami, S. Dumais, D. Heckerman, E. Horvitz|title=A Bayesian approach to filtering junk e-mail|publisher=AAAI'98 Workshop on Learning for Text Categorization|date=1998] Variants of the basic technique have been implemented in a number of research works and commercial software products. In 2002, the principles of Bayesian filtering were publicized to more general audiences in an essay by Paul Graham. [cite web|url=http://www.paulgraham.com/spam.html|title=A Plan for Spam|last=Graham|first=Paul|authorlink=Paul Graham|year=2002]

Bayesian spam filtering has become a popular mechanism to distinguish illegitimate spam email from legitimate email (sometimes called "ham" or "bacn"). [cite web|url=http://www.wordspy.com/words/ham.asp|title=Word Spy - Ham] Many modern mail clients implement Bayesian spam filtering. Users can also install separate email filtering programs. Server-side email filters, such as DSPAM, SpamAssassin, SpamBayes, Bogofilter and ASSP, make use of Bayesian spam filtering techniques, and the functionality is sometimes embedded within mail server software itself.

Mathematical foundation

Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:

:Pr(mathrm{spam}|mathrm{words}) = frac{Pr(mathrm{words}|mathrm{spam})Pr(mathrm{spam})}{Pr(mathrm{words})}

Process

Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word Viagra in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.

After training, the word probabilities (also known as likelihood functions) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability. This contribution is called the posterior probability and is computed using Bayes' theorem. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam. Email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright.

Advantages

The advantage of Bayesian spam filtering is that it can be trained on a per-user basis.

The spam that a user receives is often related to the online user's activities. For example, a user may have been subscribed to an online newsletter that the user considers to be spam. This online newsletter is likely to contain words that are common to all newsletters, such as the name of the newsletter and its originating email address. A Bayesian spam filter will eventually assign a higher probability based on the user's specific patterns.

The legitimate e-mails a user receives will tend to be different. For example, in a corporate environment, the company name and the names of clients or customers will be mentioned often. The filter will assign a lower spam probability to emails containing those names.

The word probabilities are unique to each user and can evolve over time with corrective training whenever the filter incorrectly classifies an email. As a result, Bayesian spam filtering accuracy after training is often superior to pre-defined rules.

It can perform particularly well in avoiding false positives, where legitimate email is incorrectly classified as spam. For example, if the email contains the word "Nigeria", which is frequently used in Advance fee fraud spam, a pre-defined rules filter might reject it outright. A Bayesian filter would mark the word "Nigeria" as a probable spam word, but would take into account other important words that usually indicate legitimate e-mail. For example, the name of a spouse may strongly indicate the e-mail is not spam, which could overcome the use of the word "Nigeria." Some spam filters combine the results of both Bayesian spam filtering and pre-defined rules resulting in even higher filtering accuracy.

Disadvantages

Bayesian spam filtering is susceptible to Bayesian poisoning, a technique used by spammers in an attempt to degrade the effectiveness of spam filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out emails with large amounts of legitimate text (gathered from legitimate news or literary sources). Spammer tactics include insertion of random innocuous words that are not normally associated with spam, thereby decreasing the email's spam score, making it more likely to slip past a Bayesian spam filter. This is also a tactic used by advertising-oriented web pages, who also place 'random word' pages in their sites to alter the behavior of web page spiders (scripts that add sites automatically to a search engine) for search engines.

General applications of Bayesian filtering

While Bayesian filtering is used widely to identify spam email, the technique can classify (or "cluster") almost any sort of data. It has uses in science, medicine, and engineering. One example is a general purpose classification program called [http://ic.arc.nasa.gov/ic/projects/bayes-group/autoclass/ AutoClass] which was originally used to classify stars according to spectral characteristics that were otherwise too subtle to notice. There is recent speculation that even the brain uses Bayesian methods to classify sensory stimuli and decide on behavioural responses. [ [http://www.bcs.rochester.edu/people/alex/pub/articles/KnillPougetTINS04.pdf Trends in Neuroscience, 27(12):712-9, 2004] (pdf)]

ee also

* Bayesian poisoning
* Bayesian inference
* Bayes's theorem
* Email filtering
* Markovian discrimination
* Naive Bayes classifier
* Recursive Bayesian estimation
* Stopping e-mail abuse

References

External links

* Guide to Bayesian spam filters: [http://lwn.net/Articles/172491/ part 1] , [http://lwn.net/Articles/173910/ part 2] .
* [http://mail.python.org/pipermail/python-dev/2002-August/028216.html Detailed explanation of Paul Graham's formulas]
* [http://www.linuxjournal.com/article/6467 Gary Robinson's Linux Journal article]
* [http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html Gary Robinson's spam blog]
* [http://www.gfi.com/whitepapers/why-bayesian-filtering.pdf Why Bayesian filtering is the most effective anti-spam technology]


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Bayesian poisoning — is a technique used by spammers to attempt to degrade the effectiveness of spam filters that rely on bayesian spam filtering. Bayesian filtering relies on Bayesian probability to determine whether an incoming mail is spam or is not spam ( ham , i …   Wikipedia

  • Bayesian filtering — may refer to:* Bayesian spam filtering, a method to detect spam. * Sequential bayesian filtering, a method to estimate the state of a system evolving in time …   Wikipedia

  • Bayesian — refers to methods in probability and statistics named after the Reverend Thomas Bayes (ca. 1702 ndash;1761), in particular methods related to: * the degree of belief interpretation of probability, as opposed to frequency or proportion or… …   Wikipedia

  • Spam Reader — Infobox Software name = Spam Reader caption = developer = [http://www.spam reader.com LuxContinent] latest release version = 2.5 latest release date = November 03, 2006 latest preview version = latest preview date = operating system = Microsoft… …   Wikipedia

  • Spam (electronic) — An email box folder littered with spam messages A typical spam m …   Wikipedia

  • Spam in blogs — For blogs that are built only for spamming, see Spam blog. Spam blacklist redirects here. For Wikipedia s internal spam blocking mechanism, see Wikipedia:Spam blacklist. Spam in blogs (also called simply blog spam or comment spam) is a form of… …   Wikipedia

  • Spam and Open Relay Blocking System — SORBS (Spam and Open Relay Blocking System) is a list of e mail servers suspected of sending or relaying spam (a DNS blacklist). It has been augmented with complementary lists that include various other classes of hosts, allowing for customized… …   Wikipedia

  • Bayesian inference — is statistical inference in which evidence or observations are used to update or to newly infer the probability that a hypothesis may be true. The name Bayesian comes from the frequent use of Bayes theorem in the inference process. Bayes theorem… …   Wikipedia

  • Bayesian-Filter — Der bayessche Filter (auch als bayesischer Filter bezeichnet) ist ein statistischer Filter, der auf dem bayesschen Wahrscheinlichkeitsbegriff aufbaut. Sein Name leitet sich vom englischen Mathematiker Thomas Bayes (etwa 1702−1761) ab. Markow… …   Deutsch Wikipedia

  • Bayesian Filter — Der bayessche Filter (auch als bayesischer Filter bezeichnet) ist ein statistischer Filter, der auf dem bayesschen Wahrscheinlichkeitsbegriff aufbaut. Sein Name leitet sich vom englischen Mathematiker Thomas Bayes (etwa 1702−1761) ab. Markow… …   Deutsch Wikipedia