Aug 8
Converting a mailman archive to work with mod_mbox
Recently I was working with a friend to get mod_mbox up and running with some of the Wikimedia mailing list archives which are on mailman. These mailing lists don’t immediately work because they’re not in the right format however it’s relatively easy to pick these up to work with mod_mbox on Debian.To get started, let’s get mod_mbox sorted. If you haven’t already you’re going to need to grab apache2-dev and scons:
The next step is to checkout mod_mbox from Apache:
Once that checks out, we’re going to need to build the module:
scons APXS=$(which apxs)
That’ll build a mod_mbox.so file that you will need to copy into your Apache modules directory:
Last but not least we’re going to need to tell Apache to use our new configuration, create a new file called /etc/apache2/conf-available/mbox.conf:
AddHandler mbox-handler .mbox
<LocationMatch /archives/([^/]+)>
MboxIndex On
MboxRootPath "/archives/"
MboxStyle "/archives/style.css"
MboxScript "/archives/archives.js"
MboxHideEmpty On
MboxAntispam On
</LocationMatch>
Once you’ve created that file, we’re going to need to enable that configuration and reload Apache:
sudo service apache2 reload
That’s going to match /archives on our Apache install and have mbox handle it for us. We’re going to need to create /var/www/html/archives and populate it for the next steps.
In this example I’m going to borrow the wikimedia-l list, you can check out their archive here: https://lists.wikimedia.org/pipermail/wikimedia-l/. You’ll note accompanying each month is a Gzip’d Text option which contains the archive for the month. We’re going to use that to grab the mail archives and do some transformations.
The first step is to create a /var/www/html/archives/wikimedia-l folder. I made this owned by myself to make the next steps easier (replace pasamio with your own username):
sudo chown pasamio /var/www/html/archives/wikimedia-l
The next step is to download an archive from wikimedia:
wget https://lists.wikimedia.org/pipermail/wikimedia-l/2017-July.txt.gz
This should download and create a 2017-July.txt.gz file however mod_mbox doesn’t understand that and we need to process it. mod_mbox expects the archives to be plain text and their file names have the format “YYYYMM.mbox”. Mailman also has another quirk: it doesn’t include a List-Post header and it also replaces ‘@’ with ‘at’.
sed 's/^\(From:\? .*\) \(at\|en\) /\1@/' 2017-July.txt | sed -e's/Subject:/List-Post: <mailto:wikimedia-l@lists.wikimedia.org>\nSubject:/g' > 201707.mbox
There’s a few things happening there, we’re unzipping the archive, we’re using sed to fix the ‘at’ to ‘2’ and we’re adding a “List-Post” header before the subject on all of the messages and then outputting it to a new file with the filename that mod_mbox expects.
The final step is to update the directory to include the new mbox file. mod-mbox-util is in the same directory that we checked out mod_mbox to before:
When we run that it should output something like the following:
Base Path: /var/www/html/archives/wikimedia-l/
Found 1 mbox files to process
Scaning 201707.mbox for Mailing List info
Building Cache for wikimedia-l@lists.wikimedia.org
Last Update: Thu, 01 Jan 1970 00:00:00 GMT
Current Time: Mon, 07 Aug 2017 04:23:26 GMT
Processing '201707.mbox'
scanned 236 messages
And if we navigate to the host at http://hostname/archives/wikimedia-l/, we should see the mailing list archive and be able to navigate through the threads. Success!
Now I was curious to grab all of the lists and see what that would look like:
cd /var/www/html/archives/wikimedia-l/
# Array of months for later use
MONTHS=(January February March April May June July August September October November December)
# Loop over the available years
for YEAR in {2004..2017}
do
# Loop over the months, zero offset for the month names above.
for MONTH in {0..11}
do
echo $YEAR ${MONTHS[$MONTH]}
# Get the archive file
wget https://lists.wikimedia.org/pipermail/wikimedia-l/$YEAR-${MONTHS[$MONTH]}.txt.gz
# And unzip it
gunzip $YEAR-${MONTHS[$MONTH]}.txt.gz
done
done
# Work through the list of txt files; $(ls *.txt) because apparently *.txt broke some systems
# Note: One could refactor this into the above loop as well to simplify things.
for FILENAME in $(ls *.txt)
do
# Ok! BASENAME takes the FILENAME (e.g. 2017-January.txt) and turns it into 201701
# - We do a sed to get rid of the extension
# - We do a sed to get the year and the first three letters of the month (e.g. 2017-Jan)
# - We use awk to turn this into a number (e.g. Jan == 01) and format it as YYYYMM
BASENAME=$(echo $FILENAME | sed -e's/.txt$//' | sed -e's/\([0-9]*-[A-Za-z]\{3\}\).*/\1/' | awk -F'-' '{printf "%s%02d", $1, (match("JanFebMarAprMayJunJulAugSepOctNovDec",$2)+2)/3}')
echo $BASENAME
# Mailman uses "at" instead of "@" so let's fix that.
# Mailman also doesn't include List-Post so let's fake that as well.
# Finally we output this to the new $BASENAME.mbox (e.g. 201701.mbox)
sed 's/^\(From:\? .*\) \(at\|en\) /\1@/' $FILENAME | sed -e's/Subject:/List-Post: <mailto:wikimedia-l@lists.wikimedia.org>\nSubject:/g' > $BASENAME.mbox
done
# Last but not least we need to rebuild the caches.
/path/to/mod_mbox/mod-mbox-util -v -c .
This will grab a copy of the mailing list archives, fix up the From field, add the List-Post and then recreate the caches.
Similarly here’s a quick one liner to grab the current month’s file:
And something to grab last month’s archive:
That can be combined with the code above to periodically download the archives and regenerate your own local archive.
Related Resources
Here are some linksI found useful
- http://httpd.apache.org/mod_mbox/install.html
- http://httpd.apache.org/mod_mbox/ref.html
- https://github.com/mpercy/mod_mbox/blob/master/README
No Comments
Leave a comment