| June 2003 | ||||||
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 | |||||
| Nov Jul | ||||||
This site is no longer maintained.
My current weblog.
I'm finally taking the plunge and switching to Movable Type. I'm not going to bother importing this weblog, at least not initially. Too much work for too little benefit. My archives can stay here indefinitely.
My new home page and weblog. Feeds are available in RSS 0.91, RSS 1.0, and RSS 2.0 flavors. I'm not going to set up RSS redirects. I don't like Userland's solution because any aggregator that doesn't understand the format will barf on it. HTTP 301 redirects are better supported, but I don't feel like reconfiguring Apache to allow .htaccess files.
For the couple of people that subscribe to my category feeds, I'll get around to re-creating those eventually. Stay subscribed to the current feeds and wait for an update.
With the newest version of the TiVo software (Version 3.2), TiVo has once again changed the secret password to enter "backdoor" mode, which lets advanced users enable hidden features. Unlike last time, people were not able to quickly find the new code, so a distributed computing project was started to find the backdoor codes. You can read about it Here, grab the Linux or Windows clients and pitch in some CPU time for a good cause." [Slashdot]
The mail parser has been updated to handle Outlook .MSG files.
There's a thread on corpus drifting that covers my thoughts on using positive reinforcement to help POPFile to learn. On the mailing list I am training POPFile on, it has missed 3 of 22 messages today. I'm thinking that POPFile needs about 100 messages in the corpus to get accuracy into the high 90s for mailing lists.
On the spam front, I seem to be in the middle of a drought. POPFile has missed 1 of 5 messages since yesterday.
I've found another bug, POPFile seems to top out at 8 simultaneous connections. I have 10 POP accounts in three of Outlook's "Send/Receive Groups." They have staggered times for checking mail but every so often they all overlap...
I got POPFile working, at another user's suggestion I exported some mail folders as .csv text files instead of individual .msg binary files. So far I've put 5500 messages into the corpus, with about 5% being known spam and half of the remainder coming from 16 mailing lists.
I'm keeping all of my Outlook rules in place until I am confident in POPFile's classifications. I've added two rules for POPFile, for spam and a mailing list that I just joined. The new list is a good test of how quickly POPFile can be taught. The intial corpus was just 15 messages, so far it has correctly classified 3 out of 5 new messages.
One problem I see with teaching POPFile is that the web interface only allows for negative reinforcement, ie: this message is classified wrong, it should be this. For a small corpus, my gut feeling is that positive reinforcement would be more beneficial. There's probably a tipping point where that sort of feedback loop would have a negative affect on accuracy, but that is something for a math genius to figure out.
POPFile's author will be on TechTV today at 19:00 Eastern.
I wish I'd known about these types of filtering programs years ago. From 1998 until 2001, I would receive 10,000 messages on a normal day and several times that on really bad days. I needed over 100 Outlook rules to manage the chaos and focus my attention on the 5% that mattered to me. ifile was first released in late 1996.
A few months ago I wrote about the pains of backing up large drives. I use a 60GB drive for backups of important files from my main 120 gigger, but I think that I'll outgrow this solution in 6 months. Fortunately I have a pair of 30 giggers lying around...
Looking at the files I am backing up, well over 90% of the space used is static -- changes are rare, additions are infrequent. I need a long-term archiving solution. Burning those files to CD isn't very appealing, I would need about 100 of them (I'd want two copies of everything because I have little faith in CDRs for long-term storage). DVDs would be more practical, I could probably find a Firewire burner to borrow...
What I'd really like is a hybrid online backup service. My upstream bandwidth is about 8KB/s on a good day, doing an initial backup of this data over the Internet would take an insane amount of time. NetFlix has the right idea for moving large quantities of data around: the US Postal Service. Send me a Firewire/USB drive for that initial backup, use the Internet for incrementals. Archive my static data to tape and warehouse it somewhere -- if my system crashes I won't mind it taking some time to retrieve that data, so long as I get it back eventually. Keep my last incremental online and recent ones near-line, that's the stuff that I'll want back quickly.
I've got no idea if such a service could be made affordable for consumers, but it would certainly be more useful than a purely Internet-based backup service.
I've always wondered why client-side spam filters for Windows are designed to work only with certain mail clients. SpamNet and Spam Assasin Pro only work with Outlook 2000+, SpamNix for Eudora 3+, etc... These tools could reach a wider audience if they were built as generic POP/IMAP proxies.
Open Source to the rescue. POPFile is a POP3 proxy that uses "Naive Bayes" for classification, written in Perl but geared for Windows users. Pop3proxy and IMAPAssasin use the Spam Assasin engine.