Building an email archive
As part of creating a life archive structure, I've started to work on how to structure my archival email. I have a few sources of historical email right now:
- mbox-formatted email archive from college: I have these due to having a backup of a backup of a backup of a computer
- Thunderbird email files: I have these from a CD backup I made a long time ago
- Gmail
My goal is to have a relatively timeless archive–one that is relatively easy to use and that will be easy to access and preserve over time. Here's what I have so far.
Archive structure
In my mnemosyne
folder I have a subfolder mail
.
This folder is controlled by dovecot using maildir-style folders. The key line in the dovecot configuration file for doing this is:
mail_location=maildir:/path/to/mnemosyne/mail:LAYOUT=fs
(:LAYOUT=fs
means don't put a dot at the beginning of the mailbox folder names.)
With folders, the mailbox structure is going to look like this:
received
2010
2011
2012
2013
- …
sent
2010
2011
2012
2013
- …
That is, I'm going to reorganize my mail into folders just based on whether it's sent or received.
Pulling gmail
In the case of gmail I'm using the excellent lieer tool to download all my email. This is my "working copy," not the archive–I use notmuch
and lieer to move emails around in this copy (e.g. when I change their tags). This is a two-way synchronization to Gmail. However, this is also a reasonable starting place for creating an archive.
Importing
I am creating a temporary mailbox folder for each import, e.g. import-mbox
.
For each set of mail, I get the mail set up in a directory and then I use doveadm import
to import it into the temporary folder. For example, for the mbox export, the mail was in a folder ~/tmp/email
as a bunch of files e.g. ~/tmp/email/sent
. I could import this with
doveadm import ~/tmp/email import-mbox ALL
The ALL
means import all the mail.
Restructuring the imported mail
I then used doveadm
to search through this mail and restructure it, using commands like this:
doveadm move sent/2010 mailbox import-mbox/sent* SENTSINCE 2009-12-31 SENTBEFORE 2011-01-01 doveadm move received/2010 mailbox import-mbox/* SENTSINCE 2009-12-31 SENTBEFORE 2011-01-01
Assuming that worked, dovecot is still trying to separate out the "unread" email vs. the "read" email, since after all dovecot is designed to be an IMAP server not a mail archive manager. "unread" email is in the filesystem in folders called new
; "read" email is in the filesystem in folders called cur
. Therefore I went into the folders and moved messages manually with mv
using a command similar to:
# n.b. this is pseudocode cd /path/to/mnemosyne/mail for i in */*/new do mv -i $i/* ${i//new/cur} done
I can use doveadm mailbox list
and doveadm mailbox status all '*'
to see what's happening within each mailbox directory.
Deduplicating email
I don't know how well this command works but I am trying to use doveadm deduplicate ALL
to remove duplicate email. I'm not sure how the deduplication works, though–this may not do what I want it to do.
Using the archive
To use the archive I created a second notmuch-config
file in /path/to/mnemosyne/mail/notmuch-config
:
export NOTMUCH_CONFIG=/path/to/mnemosyne/mail/notmuch-config notmuch new
This indexes all the email. The best thing about notmuch in this case is that notmuch creates its own index and does not modify the source email. I can therefore throw away and recreate notmuch's indexes if needed.
I can then use notmuch's search and show commands to find and export email into other reading programs.