Saturday, February 20, 2010

GAE Developer Tip: Updating Database Schema

The backend of Google App Engine is not relational. One of the implications is different records in the same table can have different number of columns. Sometimes it's just fine, other times you want it the conventional way - all rows having the same columns.

The practical consequence of such a design is that a field added to a domain object will only appear in objects of this type created after the structure of the object was modified. All the objects created before the change will happily live with the old structure.

Before you know it you'll want to filter or sort on the newly added attributes. What happens next is you realize your queries totally ignore your good ol' buddies without the new attributes. Which makes you wanting desperately to bring your schema to order and add the missing columns to all the records of your table so to speak.

Here comes a surprise. You can't really do that easily. What you need to do is to fetch ALL the objects and re-save them. You might have thought that the new columns are there and you can query for something like "myNewColumn == null". No such luck. GAE makes a stark distinction between null and missing columns. Until you explicitly set your newly added columns to null they are non-existent in the eyes of our beloved App Engine.

Back to the show. You need to fetch ALL the objects and re-save them. While not a problem per se, the approach faces a number of technical challenges due to another design feature of App Engine - the limitation on the number of results the query can return and on the duration of a single request. Currently a query is restricted to 1000 results and a single request can't take longer than a certain number of milliseconds. Meaning you have to batch up your work.

Batching up the to-be updated objects can become a non-trivial task and will likely be a one-off solution for every schema update. This is not optimal as what you want is to think once and run everywhere.

If you made it to this point we are on the same page. Now bear with me.

For our batching effort to succeed we need something that we can query reliably to figure out what needs to be updated. To this end we create a special attribute called, say, schemaVersion.

We created schemaVersion attribute for ALL our entities and assign it a default value. From now on all our objects are versioned in a database sense.

Fast forward into the future. Having been in production for a few months we realize we need another attribute on that entity. What do we do?
  1. Add the new attribute to the entity
  2. Increase the default value of schemaVersion
  3. Create a finder method in your DAO tier to retrieve a fixed number of entities whose version number is below current
  4. Create a cron job to retrieve the outdated objects and update them as needed based on their version and the current schema version for the entity in question
Rejoice - you are in control of your database schema again.

The code snippets below illustrate the approach.

In your domain object:

 public class Action implements Serializable { 
@Persistent
private Integer deleted = 0;
@Persistent
private Integer schemaVersion = 1;
}

Somewhere in your DAO layer:

    PersistenceManager pm = PMF.get().getPersistenceManager(); 
try {
Query query = pm.newQuery(Action.class);
query.setFilter("schemaVersion < :schemaVersion");
query.setRange(0, 100);
actions.addAll(pm.detachCopyAll((Collection<Action>) query.execute(currentVersion)));
} finally {
DAOUtil.closePM(pm, logger);
}

In your batch job:

    List<Action> actions = actionDAO.findForSchemaUpdate(1); 
for (Action action : actions) {
int currentVersion = action.getSchemaVersion();
switch (currentVersion) {
case 0:
action.setDeleted(0);
default:
break;
}
action.setSchemaVersion(1);
actionDAO.save(action);
}

0 comments: