Emmesmail utilizes a multi-faceted approach to junk mail that achieves a spam rejection rate of nearly 100% with the number of false positives (each rejection of valid email being considered a false positive) just a few percent. Emmesmail initially used a whitelist, a blacklist, a Bayesian filter and an "appropriateness" filter. The whitelist and blacklist were user-specific, locally-generated files. In 2016, this scheme was simplified so that Emmesmail used only a locally-generated whitelist, a Bayesian filter, and a filter that looked at the percentage of unrecognized words.

Here is an outline of how the filtering works:

1) The first thing Emmesmail does is to check if the sender is included in the local whitelist, a list of senders previously determined not to be spammers. If so, the email is immediately delivered.

2) Emmesmail used to check next if the sender was included in the local blacklist, a list of senders previously found to be spammers. If included, the email was re-directed from the recipient's mailbox to the spam mailbox. This step was eliminated in 2016 since virtually every email whose sender was on the blacklist, was eliminated by the Bayesian filter.

3) If the sender is not included in the whitelist, the entire email, including the header, is next examined by a Bayesian filter modeled after that of Paul Graham.

4) If the Bayesian filter reports that the email is likely spam it is re-directed to the spam mailbox, appended to the database of spam emails (see information on Bayesian filtering below), and the sender added to the blacklist. If the email is thought not to be spam based upon Bayesian analysis, it is then examined by aditional filters. Prior to 2016, the next filter examined the ratio of characters to actual tokens (tokens are defined below), after which an "appropriateness" filter, examined the appropriateness of the words used in the email. Starting in 2016, the token paucity filter was eliminated and the "appropriateness" filter simply labeled as spam those emails where more than 40% of the included tokens were not already in the Bayesian filter's corpi.

5) If an email passes all the filtering, it is forwarded to the recipient's mailbox. Just having an email passed by the filtering process is not sufficient to add that email's sender to the whitelist. This only occurs once the email is saved by the recipient.

If upon checking the spam mailbox, if it is found that a mistake has been made and an innocent email has been diverted there, a single click will correct the mistake, deliver the mail to the intended recipient, and correct the databases.

Initially, Emmesmail rejects spam based upon Emmes Technologies' databases that come with the software, until such time as the user's databases become large enough to use.

When Emmesmail has determined that an email is spam, it can, if configured to do so, return the email to the spammer with a customizable "failure-to-deliver" message. Most authorities recommend that this feature not be used.

We found that in implementing the Bayesian filter described by Paul Graham, the following parameters needed to be defined.

Parameter |
Definition |
Value chosen |

## MAXW |
## Maximum number of tokens allowed in the hash table |
## 250000 |

## MWDS |
## Maximum number of words considered when calculating weights |
## 9000 |

## WMIN |
## Minimum length of a hash table token |
## 2 |

## WMAX |
## Maximum length of a hash table token |
## 40 |

## PMIN |
## Minimum probabilty of a token |
## 0.0001 |

## PMAX |
## Maximum probabilty of a token |
## 0.9999 |

## PUNK |
## Probability given a token not seen previously |
## 0.5 |

## MINO |
## Minimum number of times a token must appear in corpi to count |
## 4 |

## MNUM |
## Maximum number of emails in each corpus before thinning |
## 350 |

## RNUM |
## Number of emails remaining after thinning |
## 250 |

## CUT |
## Likelihood above which an email is considered spam |
## 0.5 |

## NTW |
## Number of words to weigh in likelihood calculation |
## 15 |

## AFPB |
## Anti false-positive bias factor |
## 1.0 |

## - |
## Characters which act as token separators |
## \040, \011, \012, @, ? |

**WMIN**: Was set to 2 to avoid examining single letters.

**WMAX**: This eliminates long undecipherable tokens as occur with
pdf documents.

**PMIN, PMAX**: Not 0 or 1, in order to avoid division by zero in the
calculations. Also, if too small, a single word can carry too much weight.

**MINO**: A word must occur at least four times in our corpi to be
significant with regard to determining whether an email is spam. Graham used
five, but we felt four might allow one less spam to be passed during the
filter's training period.

**MNUM, RNUM**: When one of our corpi gets to contain 350 emails, we
reduce it to include only the 250 most recent and then add new ones until the
total number is again 350.

**CUT, NTW**: Like the original Paul Graham filter, we calculate the
likelihood of an email being spam according to the formula

Likelihood = pspam/(pspam + pnspam)

where pspam =
w_{1}*w_{2}*w_{3}*....w_{n}, and pnspam =
(1-w_{1})*(1-w_{2})*...(1-w_{n}), and where the
w_{n} are the weights of the tokens in the email. Like the original
Graham protocol, we arbitrarily consider only the NTW (15) most significant
(closest to 0 or 1) weights in the calculation of likelihood, and we reject
emails whose likelihood of spam is greater than CUT. We set CUT to 0.5, a
logical choice. Setting CUT to 0.9 as in Graham's formulation, gives the same
results, since, as he points out, the probabilities tend to be close to 0 or 1,
with hardly any falling between 0.5 and 0.9.

**AFPB**: The anti false-positive bias factor. The weights,
w_{n}, strictly should be calculated according to the formula

w_{n} = a/( a + b )

where a and b are the frequency of the word in the spam and non-spam
corpi respectively. The description of the original Graham filter recommended
counting the words in the non-spam corpus twice in order to reduce the incidence
of false positives. In our implementation this amounts to using the formula

w_{n} = a/( a + b*AFPB )

where AFPB is 2.0. We tried values for AFPB varying from 3.0 to 0.4, before setting AFPB to 1.0, essentially eliminating it as a variable.

Our attempt to implement Graham's formulation exactly did not, initially, achieve as high a spam rejection rate as he reported, so we made a number of changes to our spam filtering, introducing what we refer to as hierarchical filtering, With this system, the Bayesian filter is just one of a number of filters, applied in a linear fashion.

Currently, the first filter we apply is sender-filtering, which uses a user-specific whitelist and blacklist. Then the Bayesian filter is applied.

Next, those emails passing the sender-filtering and Bayesian filtering are challenged by a "token-paucity" filter which examines the ratio of characters to actual tokens and traps those spam emails that avoid detection by Bayesian filters by not containing very many words in ASCII or UTF-8 format.

Currently, the final filter, one which we have been using since 2006, is an "appropriateness" filter. The logic behind this is as follows. Standard emails, both spam and non-spam, contain a relatively narrow range of vocabulary, so that once the spam and non-spam corpi are reasonably-sized, the majority of the words in all emails are already in the stored corpi. Some spammers choose to put random words in their emails, and sometimes these help it pass the Bayesian filter. Non-spam senders almost never include large numbers of unrecognized words in their emails. In order to trap the tiny fraction of spam emails with unrecognized words that might otherwise not get caught, the "appropriateness" filter examines whether those emails passing all previous filters contain a majority of "appropriate" words or not.

We currently are achieving results as good or better than those of Graham. This is in part because of our modifications, but it is likely that the initial failure to duplicate Graham's excellent results were due to programming bugs, which since have been eliminated.

Before using sender-filtering, we make certain to prevent our own email address from appearing on either the whitelist or blacklist, thus frustrating spammers who send spam that appears to come from the intended recipient.

Year |
Spam Emails Rec. |
Spam Emails Rej. |
Rej. Rate (%) |
Valid Emails Rec. |
Valid Emails Rej. |
False Pos. (%) |

2003 |
276 |
256 |
92.8 |
682 |
28 |
4.1 |

2004 |
1173 |
1099 |
93.7 |
834 |
15 |
1.8 |

2005 |
2749 |
2624 |
95.5 |
1008 |
10 |
1.0 |

2006 |
11677 |
11401 |
97.6 |
804 |
16 |
2.0 |

2007 |
11622 |
11433 |
98.4 |
642 |
9 |
1.4 |

2008 |
11879 |
11579 |
97.5 |
1060 |
11 |
1.0 |

2009 |
1523 |
1504 |
98.8 |
607 |
4 |
0.7 |

2010 |
805 |
785 |
97.5 |
678 |
8 |
1.2 |

2011 |
784 |
773 |
98.6 |
528 |
5 |
0.9 |

2012 |
874 |
863 |
98.7 |
568 |
9 |
1.6 |

2013 |
1905 |
1882 |
98.8 |
639 |
9 |
1.4 |

2014 |
1982 |
1970 |
99.4 |
658 |
4 |
0.6 |

2015 |
2020 |
2001 |
99.1 |
611 |
15 |
2.5 |

2016 |
2264 |
2234 |
98.7 |
647 |
13 |
2.0 |

Use of a whitelist and blacklist to augment the Bayesian filter was introduced in mid-2004. A filter looking at "unrecognized" words in the email was added in 2006. In 2008, we briefly tried using "tokens" made up of two consecutive words in our comparisons, but this did not significantly improve spam rejection. In early 2014, token-paucity filtering was introduced, but only when the user had specified the use of sender filtering. The logic behind this was that many spammers were reducing the number of words in an email to hinder Bayesian filtering and whereas a friend might send you an email typing only a few words like "my pics", a legitimate person sending you an email would have to explain with "these are pics I am sending to pursuade you to go on a date with me". Emmesmail thinks if someone you don't know sends you an email with just "My pics", they are likely a spammer.

Prior to 2011, Emmesmail ran only under Microsoft Windows. Since then, it runs only under Linux.

In early 2015, we refined our email classification scheme in order to allow better assessment of the efficiency of each filter. Each email was classified according to one of the following:

ok-whitelist (sender is in the whitelist)ok-passed-all (all filters used thought the email was not spam)

spam-blacklist (sender is in the blacklist)

ok-fp-blacklist (sender appeared to be in blacklist, probably as a wildcard entry, but the email was not spam)

spam-bayes (the Bayesian filter thought the email was spam)

ok-fp-bayes (the Bayesian filter thought the email was spam, but it was not)

spam-token-paucity (too few tokens for the number of characters in email)

ok-fp-token-paucity (the token-paucity filter misdiagnosed this valid email)

spam-unrecognized-words

ok-fp-unrecognized-words (high fraction of words not previously seen in emails, but not spam)

spam-missed (this spam email missed by all the filters)

spam-user-defined (someone or someone posing as someone in the whitelist sent this spam email)

This allowed more detailed description of the filter results.

In the tables below, the filters are listed in the order applied. The number of emails tested by each spam filter goes down sequentially because the filtering is hierarchical and if the email is declared spam by one test, or valid by the whitelist, no more tests are done. The "No. tested" column lists the number of spam emails tested by that filter and is the denominator for the "% spam rejected" entry. The denominator for "% false positives" is the total number of valid emails received (including the false positives).

Spam Filter Type |
No. tested |
No. caught |
False Pos. |
% Spam rej. |
% False Pos. |

Sender-filtering |
2020 |
1404 |
0 |
70 |
0 |

Bayesian-filtering |
616 |
520 |
7 |
84 |
1.1 |

Token-paucity |
96 |
67 |
9 |
70 |
1.4 |

Unrecognized-words |
29 |
10 |
0 |
34 |
0 |

All filters combined |
2020 |
2001 |
15 |
99.1 |
2.5 |

Total Valid Emails |
Whitelisted |
Passed all filters |
False Pos. |
% False Pos. |

611 |
590 |
6 |
15 |
2.5 |

Spam Filter Type |
No. tested |
No. caught |
False Pos. |
% Spam rej. |
% False Pos. |

Bayesian-filtering |
2264 |
2193 |
7 |
97 |
1.1 |

Unrecognized-words |
69 |
39 |
6 |
57 |
0.9 |

All filters combined |
2264 |
2234 |
13 |
98.7 |
2.0 |

Total Valid Emails |
Whitelisted |
Passed all filters |
False Pos. |
% False Pos. |

647 |
579 |
55 |
13 |
2.0 |

In 2016, sender-filtering was separated into filtering against a whitelist and filtering against a blacklist. Black-sender-filtering was eliminated after it was noted that nearly all emails on the blacklist were caught by the Bayesian filter as well. White-sender-filtering was continued in order to reduce false positives when people on the whitelist occasionally sent unusual, but none-the-less valid emails that fail Bayesian filtering. Token-paucity filtering also was eliminated because it caused too many false-positives relative to the number of additional spam emails it caught.

Emmesmail initially used a whitelist, a blacklist, a Bayesian filter and an "appropriateness" or "unrecognized words" filter. As noted above, in 2016 we eliminated blacklist-filtering. This caused no decrease in our spam rejection rate, but had the pleasantly unexpected side effect that the occasional interesting email from a known spammer was passed through. A classic example is that the US State Department Consulate had been on my list of spammers because of its dozens of unnecessary emails telling me to stay away from Arab murderers. After eliminating the blacklist, the State Department emails warning of Arab murderers continued to be classified as spam, but when they sent an email telling me how to register for an absentee election ballot, which was of interest to me, that passed the Bayesian filter.

The success of eliminating the blacklist, has encouraged us to try also to eliminate the "unrecognized" filter. In 2016, the "unrecognized" filter increased our overall spam rejection rate and also increased our false positive rate. In 2017, our intention is to experiment on how to modify our Bayesian filter to operate without any additional filters (aside from whitelist filtering). For many years prior to 2017, an unrecognized word in the Bayesian filter was assigned a spam-probability of PUNK = 0.5, essentially eliminating it in the calculation of the overall probability of spam. This was done even though the assumption of our unrecognized filter was that the higher the fraction of unrecognized words, the higher the likelihood an email is spam. Our plan for 2017, after the unrecognized filter is eliminated, is to slowly increase the value of PUNK and examine the effect this has on the spam rejection and false positive rates. Over the course of the year, we will fill in the table below and then decide the best value for PUNK to use in 2018.

Period |
PUNK |
Spam Received |
Spam Caught |
Valid W.L. |
Valid Passed-all |
False Pos. |
% Spam rej. |
% False Pos. |

All 2016 |
0.5 |
2264 |
2193 |
579 |
55 |
7 |
97.0 |
1.1 |

Jan-Mar 2017 |
0.6 |
576 |
558 |
222 |
20 |
1 |
96.9 |
0.5 |

Apr-Jun 2017 |
0.7 |
281 |
275 |
98 |
7 |
2 |
97.9 |
1.9 |

Jul-Sep 2017 |
0.8 |
|||||||

Oct-Dec 2017 |
0.9 |

Updated 22 May, 2017