Bryce's Radio Experiments
Musings on general technology.

Permanent Link Thursday, October 31, 2002

POPFile, Part II

I got POPFile working, at another user's suggestion I exported some mail folders as .csv text files instead of individual .msg binary files. So far I've put 5500 messages into the corpus, with about 5% being known spam and half of the remainder coming from 16 mailing lists.

I'm keeping all of my Outlook rules in place until I am confident in POPFile's classifications. I've added two rules for POPFile, for spam and a mailing list that I just joined. The new list is a good test of how quickly POPFile can be taught. The intial corpus was just 15 messages, so far it has correctly classified 3 out of 5 new messages.

One problem I see with teaching POPFile is that the web interface only allows for negative reinforcement, ie: this message is classified wrong, it should be this. For a small corpus, my gut feeling is that positive reinforcement would be more beneficial. There's probably a tipping point where that sort of feedback loop would have a negative affect on accuracy, but that is something for a math genius to figure out.

POPFile's author will be on TechTV today at 19:00 Eastern.

I wish I'd known about these types of filtering programs years ago. From 1998 until 2001, I would receive 10,000 messages on a normal day and several times that on really bad days. I needed over 100 Outlook rules to manage the chaos and focus my attention on the 5% that mattered to me. ifile was first released in late 1996.

1:16:30 PM | Comments: | Topics: bayesian spam 


© Copyright 2003 T Bryce Yehl Click here to send an email to the editor of this weblog.
Last update: 6/29/2003; 10:00:23 PM.
the