Aug 8

Converting a mailman archive to work with mod_mbox

Recently I was working with a friend to get mod_mbox up and running with some of the Wikimedia mailing list archives which are on mailman. These mailing lists don’t immediately work because they’re not in the right format however it’s relatively easy to pick these up to work with mod_mbox on Debian.To get started, let’s get mod_mbox sorted. If you haven’t already you’re going to need to grab apache2-dev and scons:

sudo apt-get install apache2-dev scons

The next step is to checkout mod_mbox from Apache:

svn checkout https://svn.apache.org/repos/asf/httpd/mod_mbox/trunk/ mod_mbox

Once that checks out, we’re going to need to build the module:

cd mod_mbox
scons APXS=$(which apxs)

That’ll build a mod_mbox.so file that you will need to copy into your Apache modules directory:

sudo cp mod_mbox.so /usr/lib/apache2/modules/

Last but not least we’re going to need to tell Apache to use our new configuration, create a new file called /etc/apache2/conf-available/mbox.conf:

LoadModule mbox_module /usr/lib/apache2/modules/mod_mbox.so

AddHandler mbox-handler .mbox
<LocationMatch /archives/([^/]+)>
MboxIndex On
MboxRootPath "/archives/"
MboxStyle "/archives/style.css"
MboxScript "/archives/archives.js"
MboxHideEmpty On
MboxAntispam On
</LocationMatch>

Once you’ve created that file, we’re going to need to enable that configuration and reload Apache:

sudo a2enconf mbox
sudo service apache2 reload

That’s going to match /archives on our Apache install and have mbox handle it for us. We’re going to need to create /var/www/html/archives and populate it for the next steps.

In this example I’m going to borrow the wikimedia-l list, you can check out their archive here: https://lists.wikimedia.org/pipermail/wikimedia-l/. You’ll note accompanying each month is a Gzip’d Text option which contains the archive for the month. We’re going to use that to grab the mail archives and do some transformations.

The first step is to create a /var/www/html/archives/wikimedia-l folder. I made this owned by myself to make the next steps easier (replace pasamio with your own username):

sudo mkdir -p /var/www/html/archives/wikimedia-l
sudo chown pasamio /var/www/html/archives/wikimedia-l

The next step is to download an archive from wikimedia:

cd /var/www/html/archives/wikimedia-l
wget https://lists.wikimedia.org/pipermail/wikimedia-l/2017-July.txt.gz

This should download and create a 2017-July.txt.gz file however mod_mbox doesn’t understand that and we need to process it. mod_mbox expects the archives to be plain text and their file names have the format “YYYYMM.mbox”. Mailman also has another quirk: it doesn’t include a List-Post header and it also replaces ‘@’ with ‘at’.

gunzip 2017-July.txt.gz
sed 's/^\(From:\? .*\) \(at\|en\) /\1@/' 2017-July.txt | sed -e's/Subject:/List-Post: <mailto:wikimedia-l@lists.wikimedia.org>\nSubject:/g' > 201707.mbox

There’s a few things happening there, we’re unzipping the archive, we’re using sed to fix the ‘at’ to ‘2’ and we’re adding a “List-Post” header before the subject on all of the messages and then outputting it to a new file with the filename that mod_mbox expects.

The final step is to update the directory to include the new mbox file. mod-mbox-util is in the same directory that we checked out mod_mbox to before:

/path/to/mod_mbox/mod-mbox-util -v -c .

When we run that it should output something like the following:

$ ~/research/mod_mbox/mod-mbox-util -v -c .
Base Path: /var/www/html/archives/wikimedia-l/
Found 1 mbox files to process
Scaning 201707.mbox for Mailing List info
Building Cache for wikimedia-l@lists.wikimedia.org
Last Update: Thu, 01 Jan 1970 00:00:00 GMT
Current Time: Mon, 07 Aug 2017 04:23:26 GMT
Processing '201707.mbox'
    scanned 236 messages

And if we navigate to the host at http://hostname/archives/wikimedia-l/, we should see the mailing list archive and be able to navigate through the threads. Success!

Now I was curious to grab all of the lists and see what that would look like:

#!/bin/bash

cd /var/www/html/archives/wikimedia-l/

# Array of months for later use
MONTHS=(January February March April May June July August September October November December)

# Loop over the available years
for YEAR in {2004..2017}
do
  # Loop over the months, zero offset for the month names above.
  for MONTH in {0..11}
  do
    echo $YEAR ${MONTHS[$MONTH]}
    # Get the archive file
    wget https://lists.wikimedia.org/pipermail/wikimedia-l/$YEAR-${MONTHS[$MONTH]}.txt.gz
    # And unzip it
    gunzip $YEAR-${MONTHS[$MONTH]}.txt.gz
  done
done

# Work through the list of txt files; $(ls *.txt) because apparently *.txt broke some systems
# Note: One could refactor this into the above loop as well to simplify things.
for FILENAME in $(ls *.txt)
do
  # Ok! BASENAME takes the FILENAME (e.g. 2017-January.txt) and turns it into 201701
  # - We do a sed to get rid of the extension
  # - We do a sed to get the year and the first three letters of the month (e.g. 2017-Jan)
  # - We use awk to turn this into a number (e.g. Jan == 01) and format it as YYYYMM
  BASENAME=$(echo $FILENAME | sed -e's/.txt$//' | sed -e's/\([0-9]*-[A-Za-z]\{3\}\).*/\1/' | awk -F'-' '{printf "%s%02d", $1, (match("JanFebMarAprMayJunJulAugSepOctNovDec",$2)+2)/3}')
  echo $BASENAME
  # Mailman uses "at" instead of "@" so let's fix that.
  # Mailman also doesn't include List-Post so let's fake that as well.
  # Finally we output this to the new $BASENAME.mbox (e.g. 201701.mbox)
  sed 's/^\(From:\? .*\) \(at\|en\) /\1@/' $FILENAME | sed -e's/Subject:/List-Post: <mailto:wikimedia-l@lists.wikimedia.org>\nSubject:/g' > $BASENAME.mbox
done

# Last but not least we need to rebuild the caches.
/path/to/mod_mbox/mod-mbox-util -v -c .

This will grab a copy of the mailing list archives, fix up the From field, add the List-Post and then recreate the caches.

Similarly here’s a quick one liner to grab the current month’s file:

wget  https://lists.wikimedia.org/pipermail/wikimedia-l/$(date +%Y-%B).txt.gz

And something to grab last month’s archive:

wget  https://lists.wikimedia.org/pipermail/wikimedia-l/$(date +%Y-%B -d "@$(expr $(date +%s) - 86400)").txt.gz

That can be combined with the code above to periodically download the archives and regenerate your own local archive.

Related Resources

Here are some linksI found useful

No comments

No Comments

Leave a comment

%d bloggers like this: