Mar 5

Importing receipts into TapForms with CouchDB and VueScan

Category: couchdb,tapforms

For the longest time I’ve collected receipts from when I’ve refuelled my car however I’ve never had a good way of figuring out how to import them and retain the images. In 2016 I picked up TapForms to replace Bento as my personal database tool that sync’d across Mac and iOS. With TapForms 5 the data storage layer moved over to Couchbase Lite with sync options being Peer to Peer and IBM Cloudant. Then TapForms later added support for syncing to CouchDB which made it easy to quickly within my own network set up sync and then add data. This brought me to the next challenge: how do I quickly add my backlog of receipts?

I started out creating my fuel database using receipts as I created them, slowly fine tuning the database and adding various fields. I have a timestamp field for when I refuelled, the gallons I pumped, the cost per gallon, total paid and the trip reading. I thought that would be sufficient and my old receipts only had the trip reading. Then I realised I should probably track the odometer reading as a secondary data source to validate the trip counter reading. Then I looked at the trip computer in my car and thought it might be interesting to keep a copy of it’s internal range estimates, average fuel usage estimation (MPG) and average speed (MPH). There was also a trip timer that I’d not tracked either that I now use to track time between refuels. Finally I decided to use a location field to track where I refuel. The last piece is of course an image of the receipt.

Initially what I was doing was using my phone to enter the data and also then take a photo of the receipt. This worked ok but wasn’t the greatest experience. It meant that the photos were never well cropped and I ended up having to take them on dark backgrounds to make it a little clearer. This was also a little time consuming to get a copy of the image and then also make sure it was fully legible. I did some amount of back entry this way but decided that this just wasn’t going to fly.

I then grabbed my old Canon imageFormula P-215 document scanner that I’ve had for years. I bought it back in 2013 during some other inspired document scanning situation (I think that one was back when I was a heavy Evernote user) and only used it from time to time. I figured that it would be handy for this use case so hooked it up to my computer and tried to see if I could configure it to send images to TapForms using the CaptureOnTouch app that the scanner came with. Now even though it had been updated and working for Sierra, on High Sierra the app was no longer fully functional. There were custom UI controls that completely bugged out on High Sierra which made reconfiguring the app to do what I wanted impossible. I filed a support request with Canon who said that High Sierra wasn’t supported any more for this scanner and to buy some other scanner software.

That lead me on a search where I quickly found VueScan which said it had support for the scanner. I downloaded it and gave it a spin with the scanner on a sample receipt. It managed to handle it fine and had more than enough options to control the image afterwards. I ended up paying for the Professional Edition to get all the options which cost me roughly $100. In retrospect the scanner itself cost $300, so the scanning software to make it work half a decade was a third the price.

VueScan let me quickly scan the receipts, it handled some amount of automatic deskewing but also let me override it as well. I had to play a bit with some of the options (black and white text versus gray) on some of the scans as well as fine tuning the colour to get clear images where possible or to recover as much of the image as possible. Eventually I got comfortable using it to scan in the images but now how to get that into TapForms?

I’ve had a CouchDB instance set up for a while now to support syncing TapForms so it was easy to add this new database. TapForms will attempt to create the database within CouchDB if it doesn’t exist already and will then sync it with it’s internal data store. I had TapForms push this database to CouchDB and then started looking around at the data structure using the CouchDB admin console. Looking at the records there were a bunch of items that seemed grouped by prefix: fld, frm and rec. When I first setup CouchDB, I sent the ever helpful Brenden from TapForms a support email who confirmed to me that these were the field, form and record types respectively and that he used the prefix to help disambiguate the records from each other.

Screen Shot 2018 03 04 at 1 10 02 am

The basic data structure is that fields are defined as their own record. They flag which form they belong to which has some metadata about the form such as the name. The finally the records themselves use the rec prefix to store their data. To figure out what that would look like, I used TapForms to create a simple test record and then looked at it’s data to see how everything comes together. The record format is composed of the basic metadata (created date, modified date and the type of form to use for the record) and a “values” object. This object uses the field key as it’s keys and then stores the individual items. This means you end up with something that looks like the following:

"values": {
    "fld-dc8a71d49f1e4f10836166b61e431116": [
      {
        "mimetype": "image/jpeg",
        "dimensions": "{1159, 3332}",
        "filename": "2018-02-26-0006.jpg"
      }
    ],
    "fld-39643a2e20c740e4b178a4a8804bed28": 3.798999999999999,
    "fld-226af22c51264504a9cebdc27693c77f": 28.72922578029996,
    "fld-3e55053abb7d41dda065423e460befc1": {
      "date": "2015-05-27T16:08:54.000Z"
    },
    "fld-51db8bacdefd454e823d01d04bbcd4a0": {
      "lat": "37.407086",
      "lon": "-122.078839",
      "title": "808 N Shoreline"
    },
    "fld-c29d3fdc2566466f8dbc7a8f819cb551": 37.48999999999999,
    "fld-6661aa37ecb64729895cd12122d79a7b": 9.867999999999999,
    "fld-dbe3bd2f4959456fa1bf7d17fb73add8": 283.5
  }

The field ID’s are used as the primary key for the document and then the value is specific to the field. For the image fields you can see a complicated structure that refers to a CouchDB document attachment. There’s also the location field that is composed of lat, lon and title. Finally the normal number values are stored as floating point approximations. The structure itself is really nicely thought out and works well so how do I create a new record?

I ended up using a very simple PHP script to take a file saved locally on disk and then upload it to the CouchDB instance. I did this a couple of times before I was happy with the result and confirmed it’d made it from the CouchDB instance to the TapForms database. I was originally thinking I needed to track everything but then I managed to trim down what I was doing to quickly import everything. The PHP script ends up being quite simple because it’s uploading a mostly templated document except for the created/modified date/time and the image attachment.

To get started we need to create a document ID, I grabbed a PHP UUID library and put it in uuid.php and use that to create a document ID:

<?php

require_once('uuid.php');

function generateDocumentId()
{
    return 'rec-' . str_replace('-', '', UUID::mint()->string);
}

The next step is to create a basic skeleton of the object. I grabbed this by looking at a created document in CouchDB and copying the bits that were interesting. The _id field will be our generated document ID, the dateCreated and dateModified will be passed in a little later, the dbID is the TapForms DB ID and the form/type fields are the TapForms internal form ID (also picked up from looking at the CouchDB record). The values contains the place holder for the image attachment metadata that we’re going to upload, with the fld being the TapForms internal key for the field. The last part is the internal _attachments object where we’ll put the actual attachment data in.

function getTemplate($id, $createdDate) {
    return  array(
        '_id' => $id,
        'dateCreated' => $createdDate,
        'dateModified' => $createdDate,
        'dbID' => 'db-c1da6869786341efbd8dbe2021a8df72',
        'form' => 'frm-900c80d629cd41ffb9bf45f091381753',
        'type' => 'frm-900c80d629cd41ffb9bf45f091381753',
        'values' => array(
            'fld-dc8a71d49f1e4f10836166b61e431116' => array(array(
                'dimensions' => '{}',
                'filename' => '',
                'mimetype' => ''
            ))
        ),
        '_attachments' => array()
    );
}

The next step is that I’m going to accept filenames from standard input so that I can hook this up to fswatch or just paste file names in when they’re ready. There’s a bit of a delay to make sure we skip files that aren’t there any more. In practice this delay isn’t noticed due to the time taken to make a single scan.

while(!feof(STDIN))
{
    $filename = rtrim(fgets(STDIN));
    sleep(5);
    if (!file_exists($filename))
    {
        continue;
    }

I want to only upload JPG files, so a quick test for that because at one point it tried to upload a .DS_Store file. Oops!

    echo $filename . "\n";
    $base = basename($filename);
    if (strpos($base, '.jpg') === FALSE)
    {
        echo "Not a .jpg file: $filename\n";
        continue;
    }

At this point it’s time to build out the template getting the current date in ISO 8601 format and generating a new document ID.

    $createdDate = date('c');
    $template = getTemplate(generateDocumentId(), $createdDate);

Now we need to get the image details and splice that metadata in place:

    $imageDetails = getimagesize($filename);
    print_r($imageDetails);
    $template['values']['fld-dc8a71d49f1e4f10836166b61e431116'][0]['dimensions'] = sprintf("{%d, %d}", $imageDetails[0], $imageDetails[1]);
    $template['values']['fld-dc8a71d49f1e4f10836166b61e431116'][0]['filename'] = $base;
    $template['values']['fld-dc8a71d49f1e4f10836166b61e431116'][0]['mimetype'] = $imageDetails['mime'];

The last part we need to do is base 64 encode the image into the attachments object. There are two ways of uploading attachments to CouchDB, one is doing it inline here and another is with an attachment stub in the same request.

    $template['_attachments'][$base] = array(
        'content_type' => $imageDetails['mime'],
        'data' => base64_encode(file_get_contents($filename))
    );

Last but not least we need to convert the data into JSON and then upload it to the CouchDB server. The details for the URL (hostname and DB name), username and password are hard coded and would need to be changed to match your hostname, DB name, username and password.

    $data_json = json_encode($template);
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://192.168.0.100:5984/db-c1da6869786341efbd8dbe2021a8df72/' . $template['_id']);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json','Content-Length: ' . strlen($data_json)));
    curl_setopt($ch, CURLOPT_USERPWD, 'username:password');
    curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'PUT');
    curl_setopt($ch, CURLOPT_POSTFIELDS, $data_json);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $response  = curl_exec($ch);
    curl_close($ch);
    print_r($response);
}

With the PHP script in place, the last step was to hook it up to fswatch to automatically upload new files when VueScan creates them in the target directory. I installed fswatch with Homebrew:

brew install fswatch

And then hook it all together:

fswatch ~/Pictures/Fuel\ Receipts/ | php auto_upload.php

Finally as I scan I can see it posting new documents to CouchDB and then when I look in TapForms, they show up there:

/Users/pasamio/Pictures/Fuel Receipts/2018-03-02-0049-0001.jpg
Array
(
    [0] => 2926
    [1] => 5554
    [2] => 2
    [3] => width="2926" height="5554"
    [bits] => 8
    [channels] => 1
    [mime] => image/jpeg
)
{"ok":true,"id":"rec-992555001df611e894c12753d9932cf2","rev":"1-baea737f2c8cef8525b200df7cc5fa1d"}

Screen Shot 2018 03 04 at 1 19 02 pm

Using this pipeline, I imported 117 new receipts into the system. Now I just need to do the data entry part! You can download the full source code from this gist. Special thanks to Brendan the developer of TapForms for the helpful responses to the support requests, especially about the structure of the TapForms data in CouchDB.

No comments

No Comments

Leave a comment

%d bloggers like this: