I recently uploaded over a million rows of data to my live app on Heroku.
The topic of how to upload large quantities of data to Heroku is a topic I've seen come up repeatedly, with no officially endorsed method in existence.
In a simplified version of my case, LibraryMixer maintains a table Media, each row of which is the wiki entry for a given piece of media. I wanted to import a list of around 1.5 million books into this table, but without overwriting the entries already present in the Media table.
Taps is pretty much completely unworkable for large data transfers, as it is extremely slow, and also tends to fail in unpleasant ways in my experience.
The way I ended up doing this was to first import my list into my development machine's PostgreSQL database in a new table, ImportedMedia. Then, I used the `pg_dump` utility to dump my table into a file containing a series of SQL commands:
pg_dump --no-acl --no-owner -t ImportedMedia Website_development > books.sql
Then I logged into my heroku database using the psql console:
heroku pg:psql
Finally, I uploaded the SQL command-equivalent of my data to my Heroku database using this command in the psql console:
\i /path/to/dump/books.sql
Now I had the ImportedMedia table in my Heroku database as well. I now just had to convert this raw data into actual rows in the Media table.
Note that it wouldn't have been a good idea for me to create the rows in the Media table in advance locally and upload those, as that would be fragile to any difference in the tables between my development machine and live site, such as by a media wiki entry created on the website by a user during this process. This was exacerbated by the fact that my actual situation was more complicated, and involved an Authors table, and a join table between them as well.
It wasn't possible to simply create a rake task that would process the data and save entries in the Media table, as Heroku terminates long-running rake tasks. It also wasn't possible to simply create a model method and call it from the console, as Heroku also terminates long-running console commands.
Therefore, the solution was to create a model method and spawn a Delayed Job worker to call the model method to read in rows from the ImportedMedia table and save new entries in the Media table. It took more than a day for the Delayed Job worker to process all of the rows, but that was fine as the site was fully operational the whole time.
Somehow, my relevant Delayed Job worker was killed at some point during this process, and I had to manually go in and restart the process. However, I don't believe this was Heroku killing a long-running process again, as it then ran for something like 15 hours straight without issue.
In my particular case, the ImportedMedia table was keyed off of a non-numerical primary key, and splitting up the work would have introduced unnecessary room for programming mistakes. However, a more robust solution for a different dataset would have been to insert multiple Delayed Job tasks for the rows to be processed, each with different rows to handle, enabling the work to be scaled to multiple workers.