I recently uploaded over a million rows of data to my live app on Heroku.

The topic of how to upload large quantities of data to Heroku is a topic I've seen come up repeatedly, with no officially endorsed method in existence.

In a simplified version of my case, LibraryMixer maintains a table Media, each row of which is the wiki entry for a given piece of media. I wanted to import a list of around 1.5 million books into this table, but without overwriting the entries already present in the Media table.

Taps is pretty much completely unworkable for large data transfers, as it is extremely slow, and also tends to fail in unpleasant ways in my experience.

The way I ended up doing this was to first import my list into my development machine's PostgreSQL database in a new table, ImportedMedia. Then, I used the `pg_dump` utility to dump my table into a file containing a series of SQL commands:

pg_dump --no-acl --no-owner -t ImportedMedia Website_development > books.sql

Then I logged into my heroku database using the psql console:

heroku pg:psql

Finally, I uploaded the SQL command-equivalent of my data to my Heroku database using this command in the psql console:

\i /path/to/dump/books.sql

Now I had the ImportedMedia table in my Heroku database as well. I now just had to convert this raw data into actual rows in the Media table.

Note that it wouldn't have been a good idea for me to create the rows in the Media table in advance locally and upload those, as that would be fragile to any difference in the tables between my development machine and live site, such as by a media wiki entry created on the website by a user during this process. This was exacerbated by the fact that my actual situation was more complicated, and involved an Authors table, and a join table between them as well.

It wasn't possible to simply create a rake task that would process the data and save entries in the Media table, as Heroku terminates long-running rake tasks. It also wasn't possible to simply create a model method and call it from the console, as Heroku also terminates long-running console commands.

Therefore, the solution was to create a model method and spawn a Delayed Job worker to call the model method to read in rows from the ImportedMedia table and save new entries in the Media table. It took more than a day for the Delayed Job worker to process all of the rows, but that was fine as the site was fully operational the whole time.

Somehow, my relevant Delayed Job worker was killed at some point during this process, and I had to manually go in and restart the process. However, I don't believe this was Heroku killing a long-running process again, as it then ran for something like 15 hours straight without issue.

In my particular case, the ImportedMedia table was keyed off of a non-numerical primary key, and splitting up the work would have introduced unnecessary room for programming mistakes. However, a more robust solution for a different dataset would have been to insert multiple Delayed Job tasks for the rows to be processed, each with different rows to handle, enabling the work to be scaled to multiple workers.

Be careful with `puts` when running rake commands on Heroku

Just a quick note, go easy on `puts` commands for rake tasks on Heroku.

I needed to run some operations against the database the other day. The same rake task running on my development machine (which is a kinda old Dell laptop that could really use an upgrade) was processing at about 1/10 the speed on my Heroku dedicated database (non-shared).

After much hair-pulling trying to figure out why performance was so slow, it turns out something about calling `puts` to print out status updates as the operation runs places an incredible burden when called through Heroku, while it's almost negligible locally. I would have expected sending the display output over the Internet to me would be buffered, and wouldn't slow down the operation which was database-constrained locally, but apparently not.

Stripped out the `puts` commands, and Heroku was able to run the task at about the same speed as my local machine. Still not amazing, but good enough for what I was doing.

Facebook iframe Dialogs Distorted

This one's a pretty trivial little update, but it was so annoying to hunt down I figured I'd post this here in case anyone runs into the same problem to save them the headache.

I was running into an odd problem where every Facebook Dialog that I opened up in an iframe was distorted. Their width and height were both completely messed up, and they were floating little boxes in the middle of their background spanning my whole screen.

Turns out the Facebook Dialogs use a css class called "loading". I'd had a class of the same name in my stylesheet, and the properties in my stylesheet were messing up their loading element.

If I may say so, pretty ungentlemanly behavior of Facebook to use such a common class name in their API dialogs.

Serving up thumbnails from S3 efficiently

Amazon S3 is a great solution for storing a website's static assets. One thing I do on LibraryMixer is store all of the users' uploaded profile pictures on S3.

Amazon S3 also provides the option to store the pictures in a bucket with public access denied. Then when I want to display the pictures to a user, I can generate a time-limited link to the image.

This is nice because the users' profile pictures are protected against scrapers, while simultaneously able to be downloaded directly by the users' browsers without having to pass through my server on Heroku again.

Here's my original User model code for returning a link to a user's profile picture (using the AWS gem for S3):

def avatar_url(style)
  if avatar_file_name
    AWS::S3::S3Object.url_for self.avatar.path(style), avatar.bucket_name end
  else
    #return default picture
  end
end

It worked pretty well, but it had a disadvantage that didn't occur to me when I first wrote it. Every time you wanted to show a picture, it would have a different link. This was fine for most things, but user thumbnails often showed up on a page multiple times, and this meant both the number of download requests and the amount of data to be downloaded was unnecessarily bloated.

While obvious once you realize the problem, here is the correct way to do things, by caching the generated urls in an instance variable and returning those on future requests:

def avatar_url(style)
  if avatar_file_name
    @avatar_url ||= {}
    if @avatar_url[style] then @avatar_url[style]
    else @avatar_url[style] = AWS::S3::S3Object.url_for self.avatar.path(style), avatar.bucket_name end
  else
    #return default picture
  end
end

As an added benefit, it saves on S3 hosting costs, which are billed by the request received and by the byte sent.

Detecting Partial Search Results with Solr Search

For LibraryMixer, I'm using the Websolr add-on through Heroku to handle searching of the media database. While integrating Websolr into my site, I came across a useful feature of Solr that doesn't get nearly as much attention as it should: the partialResults flag.

Unless you have a dedicated server for your Solr search, and maybe even if you do have one, you're going to have a search time limit. When this time limit is hit, Solr will return whatever it has so far, which means partial results, or even nothing at all.

For some applications, that would be fine, but it's trouble when someone wants to add something to their LibraryMixer library and they search only to find that nothing is returned and that apparently what they're looking for isn't in the database yet. Most users will probably give up right there, or at best, will add a duplicate entry to the database that will later have to be cleaned up.

However, when Solr bails out of a search, it marks the header of its response with the partialResults flag. By reading this flag, you can let the user know that their search is incomplete, or better yet, add some javascript to let the user know his page is loading, and then requesting the search again from your Solr server. The second time the same query hits the server, the results will be cached from the first request, and you should have enough time to load the full results for all except the biggest searches.

For LibraryMixer, I've taken a hybrid of these two approaches. The first time a query is made that returns partial results, I show a javascript loading page to the user while I try the query again. I've found that this is enough for 99% of the queries that were giving problems before. For that 1% case where even after pre-caching not all the results can be collected in time, I put up a notice at the top of the search results letting the user know that their search results may not be complete rather than forcing the user to wait even longer.

I'm using the Sunspot gem to handle Solr on LibraryMixer, and for the most part it is excellent, but unfortunately, it does not have any direct access to the partialResults flag or even the results header. However, it's still easy enough to pull the flag out if you know where to look! It turns out that Sunspot stores in Solr results the full Solr response as a hash in an instance variable called "@solr_result". That hash in turn contains under the key "responseHeader" the full header as a hash, with our flag for partial results stored under the key "partialResults" (if it is present). So to check if you have only partial results on a query, all you need to do is:

partial_results = solr_results.instance_variable_get("@solr_result")["responseHeader"]["partialResults"]

Automatic database backups to Amazon S3 on Heroku

Heroku recently changed over its database backups from using its own bundles system to using the standard pg_dump utility with the PG Backups add-on.

This makes it a snap to download a backup and quickly load it onto your development machine, so you can work off of the same data as your production server. Better yet, you can upload a pg_dump to Amazon S3 (or anywhere else) and then instruct Heroku to restore it, which is a great change, and takes care of a lot of the difficulties in the past with reanimating your backups in the event of trouble.

One thing that Heroku has never made dead easy, though, is automatic backups, which should be a basic part of any well-managed application.

Fortunately, Heroku has a Cron add-on for automated jobs, which combined with the PG Backups add-on, can be used to put together to automatically upload a copy of your database to Amazon S3 daily. Best of all, both the daily Cron add-on and the PG Backups basic add-on are free!

The Heroku Cron add-on will automatically call the cron task in lib/tasks/cron.rake. In addition to the following code, this also requires that you have the Heroku and AWS-S3 gems installed and setup on your application. Your S3 credentials are automatically drawn from AWS-S3, but your Heroku credentials should be added manually to the file. Here is the code that I have in my cron task to automate my backups:

require "heroku"
require "heroku/command"

task :cron => :environment do
  Rake::Task['backups:backup'].invoke
end

namespace :backups do
  desc "create a pg_dump and send to S3"
  task :backup => :environment do
 
    HEROKU_USERNAME = ''
    HEROKU_PASSWORD = ''
    APP_NAME = ''
    BACKUP_BUCKET = ''
    PATH_INSIDE_BUCKET = ''
  
    puts "Backup started @ #{Time.now}"

    heroku = Heroku::Client.new HEROKU_USERNAME, HEROKU_PASSWORD

    puts "Capturing new pg_dump"
    Heroku::Command.run_internal 'pgbackups:capture', ['--expire', '--app', APP_NAME], heroku
   
    puts "Opening S3 connection"
    config = YAML.load(File.open("#{RAILS_ROOT}/config/s3.yml"))[RAILS_ENV]
    AWS::S3::Base.establish_connection!(
      :access_key_id     => config['access_key_id'],
      :secret_access_key => config['secret_access_key']
    )

    begin
      bucket = AWS::S3::Bucket.find(BACKUP_BUCKET)
    rescue AWS::S3::NoSuchBucket
      AWS::S3::Bucket.create(BACKUP_BUCKET)
      bucket = AWS::S3::Bucket.find(BACKUP_BUCKET)
    end

    puts "Opening new pg_dump"
    pg_backup = Heroku::Command::Pgbackups.new(['--app', APP_NAME], heroku)
    local_pg_dump = open(pg_backup.pgbackup_client.get_latest_backup['public_url'])
    puts "Finished opening new pg_dump"

    puts "Uploading to S3 bucket"
    AWS::S3::S3Object.store(Time.now.to_s(:number), local_pg_dump, bucket.name + PATH_INSIDE_BUCKET)

    puts "Backup completed @ #{Time.now}"
  end
 
end

If you're not a fan of AWS-S3 and want to use Fog instead, definitely checkout bakkuappu:

https://github.com/mrich54907/bakkuappu

In writing this, I consulted both bakkuappu as well as this page:

http://metaskills.net/2011/01/03/automating-heroku-pg-backups/

This whole thing is based on another script I had running backing up bundles, which was based heavily on:

http://gist.github.com/451597/6c1945765e4091b73df70835ee8e3be6e963bd77