Launch Disasters, SimCity Style

By now you’ve probably already heard about SimCity’s epic smooth launch.

SimCity Meteor Disaster

SimCity Launch Disaster

By all accounts, it’s quite tragic.  Obviously from a consumer standpoint, but also from a developer point of view.

Here’s what most customers are greeted with upon start.

SimCity Login Server

SimCity “Queue”

The wonderful thing is, after that 18:55 minutes passes, your “login” will likely fail and you’d be re-queued for 20 mins.  Does this sound like a queue or an API back-off?  That’s the first clue as to what’s going on.  It’s also where the sadness comes in.

Naming aside, a back-off is the right standard thing to do.

As a developer, I’m quite interested in why SimCity would even get to the point of having to present a back-off.  To be clear, this isn’t a condemnation or finger pointing at the EA devs.  I think they did things by the book.  The back-off is a good sign of that.  But at the same time, their very public crash has led me to a retrospection of my own architecture designs.

House of Errors

I’m not going to be privy to EA postmortems, so I have to resort to piecing together evidence from other people’s reports of errors they’re encountering:

  • Validation failure
  • Failure to properly save game server-side
  • Failure to load games server-side
  • Server ‘queues’
  • Per-server profiles (Endless Tutorial of Death)

Followed by some information concerning temporary fixes:

There’s should be at least two different databases in operation here.  First, there’s the EA Origin account servers.   This is distinct from the game-login servers and handle the primary login/password authentication and key-code validation.  The second set of databases seem to live on each location server (I’m going to call these location clusters from now on).  These contain the city-data, along with a stub account tied to each Origin user.

Here’s how I envision the straight-up login process going.  Successfully logging in to Origin gets you through the first gate and will show you the games you have.  You can launch SimCity from the Origin client.  From the game’s main menu, choosing a server starts the process for the second.  Under normal circumstances, you connect to the location cluster, which itself connects with the Origin account backend to verify your authorization.  This should be relatively quick.

However, there are a couple of ways this can fail.  The two major ones being: the origin backend could be unresponsive due to unusually high database loads.  Or the location cluster gets blocked due to its own load.  Between the two, I’m leaning towards the later.  Aside from the initial validation errors, Origin as a service doesn’t appear to be blocking.  So the location server is throwing up, and telling each client to try back in 20 minutes.

But what could cause that?  An internal API call to the Origin backend shouldn’t be that computationally expensive, even with thousands of users per location…

Database ACID

To ensure that data going into a database is processed and properly stored, there’s a set of properties called ACID.  It stands for atomicity, consistency, isolation, and durability.  Ideally, all four should be met to ensure something is completely saved.  But that comes at cost as parts of the database need to be locked across multiple connections to ensure that only one connection is ever updating critical sections.

A new user being created on the server can potentially lock all reads from the database if it’s updating an index (eg, sorted username).  One or two users a second, that should be okay.  But launch day?  The same could be said for anything that’s indexed, whether it be cities or new regions.

And then there are transactions.  Heaven forbid there was a transaction that updated user records while someone new was joining in.  With transactions, there can be only one.  Everyone else is rolled back and has to try again.

There can be only one!

There can be only one!

And.. what about cheetah speed?  Each tick of the trade import/export routines hits the region tables to determine what goods are available and at what cost.  Each tick also means writing to the table so that other players can also read your exports.

And… given the nature of having city save-files stored on-server, I presume they have a strong requirement that data-loss is not an option (durable).  As such, I expect the databases to be replicated as master and slave.  The replication logs typically decrease throughput by 50% due to the extra locking and writing required.

I feel sorry for the database servers.

I think the dev team is making some strides in removing some of these database contention areas now that they know they exist.  One thing EA definitely isn’t short of is DBA experts armed with debuggers.  I’m going to have to revisit some of my table designs as well to minimize these effects.

Relax, the Zynga Way

Bringing in beefier database servers or new location shards isn’t the only way to tackle this problem however.  If you’re willing to relax that durability property and push some additional logic into your game servers, then something along the lines of Zynga’s Memcache-MySQL hybrid can work.

There, they interject memcache between the servers and the database.  Memcache stores everything in memory as a simple key-value pair, in effect trading memory for performance.  Because of the way it’s designed, there’s no concept of column indexes (each record is only accessible by that unique key) and it can also be spread across multiple machines so that reads and writes are distributed and parallel.  However, its lack of durability has to be made up by pairing it with a MySQL backing store that periodically persists the data.

That works fine for something like Farmville and they’ve carried that project further into Membase/Couchbase Server.

Services, Services, Services

The more I think about this, the more I am convinced that the varying ACID requirements for login, transaction, location, chat logging, and markets, points toward splitting these things up into their own databases.  The data vulnerability in the Zynga cache method would be totally unacceptable for a trade transaction or in-game store.  While chat logging and location services wouldn’t be that picky about being flushed once every minute.

A new problem arises then from such a loosely coupled distributed system.  CAP theorem states that we have consistency, availability, or partition tolerance.  But we can only pick two out of the three.  Ah…. it never ends.

About these ads

One comment

  1. Dude, I wish I knew all the awesome things you know. Great post! Since day one of the launch I’ve wanted to be on the SimCity team to see what happened and be part of the drama. There’s something weird about that.

    Anyway! I’ve heard some rumors that would give credence to the theory that Origin was the initial chokepoint. Also given the news about the way pathfinding is broken on roads, it wouldn’t surprise me if there are more fundamental issues in other areas. Until this post I thought it may have been bloated data, like xml or json instead of binary, or poor balance of client vs server processing. Especially when someone stated that the game is simulated server side. Either would increase load and reduce the bandwidth for new logins, as well as decreasing stability in running games.

    Unfortunate that these are the kind of things you only find out about years later when your on the same team as one of the devs.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 39 other followers