Tuesday, June 27, 2017

The Ugly of Event Sourcing - Projection Schema Changes

Event Sourcing is a beautiful solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. Last year, I already blogged about the things I would do differently next time. But after attending another introductory presentation about Event Sourcing recently, I realized it is time to talk about some real experiences. So in this multi-part post, I will share the good, the bad and the ugliness to prepare you for the road ahead. After having dedicated the last posts on the pains of wrongly designed aggregates, it is time to talk about the ugliness of dealing with projection schema changes.

rebuilding_walls_large1

As I explained in the beginning of this series, projections in event sourcing are a very powerful concept that provide ample opportunities to optimize the performance of your system. However, as far as I'm concerned, they also offer you the most painful challenges. Projections are great if their structure or the way they interpret event streams don't change. But as soon as any of these change, you'll be faced with the problem of increasing rebuild times. The bigger your database becomes, the longer rebuilding will take. And considering the nature of databases, this problem tends to grow non-linearly. Over the years we've experimented and implemented various solutions to keep this process to a minimum.

Side-by-side projections

The first step we made was by exploiting the fact that the event store is an append-only database. By rebuilding a new set of projections next to the original ones, while the system is still being used, we could reduce the amount of down-time to a minimum. We simply tracked the checkpoint of the latest change to the event store when that rebuild process started and continued until all projections were rebuild up to that point. Only then did we need to bring down the system to project the remainder of the changes that were added to the event store in the mean time. By repeating the first stage a couple of times, this solution could reduce the down time to a couple of seconds. However, it did mean somebody needed to monitor the upgrade process in case something failed and it had to be restarted. So we still had to find a way to reduce that time even more.

Archivability

The situation may be different in your domain, but in ours, a lot of the data had a pretty short lifecycle, typically between 7 and 30 days. And the only reason why people would still look for that old data, is to use it as a template for further work. To benefit from that, we started to track graphs of aggregates that are used together and introduced a job that would update that graph whenever an aggregate reached its functional end-of-life. Then, whenever the graph was 'closed', it would mark the corresponding event stream as archivable. This would eventually be used by another job to mark all events of the involved streams with an archivability date. With that, we essentially enriched the event store with metadata that individual projections could use to make smart decisions about the work that needed to be done. By allowing some of the more expensive projections to run asynchronously and keeping track of their own progress, we could exclude them from the normal migration process. This caused a tremendous reduction of the total migration time, especially by those projections that exploited the archivable state of the event store. And as a nice bonus, it allows you to rebuild individual projections in production in case some kind of high-priority temporary fix is needed that requires schema changes or a repair of a corrupted projection.

Projections-in-flight

But this autonomy introduces a new challenge. The data projected by those projections would not become available up until a while after the system started. Worse, because the events are still being processed by the projection, it might be possible that queries would return data that is half-way projected and in the wrong state. Whether the first is a real problem is a functional discussion. Maybe adding the date of the last event projected or an ETA telling the end-user how long it will take to complete the projection work is sufficient. Being able to to do that does require some infrastructure in your projection code that allows you to get a decent ETA calculation. Showing data in the wrong state could cause some pretty serious problems to end-users. But even that can sometimes be handled in a more functional way. If that's not possible, you might be able to exploit the specific purpose and attributes of that projection to filter out half-projected data. For instance, maybe that projection is supposed to only show documents in the closed state. So as long as the projection data doesn't represent that state, you can exclude those from the results.

Not all projections are equal

With the introduction of autonomous projections that provide tracking information and ETA calculation, you can do one more thing to speed up the migration process; prioritization of projections. If you have many asynchronous projections (which you should), it is very likely that some of them are more crucial for the end-users than others. So why would you have them run all at the same time. Maybe it makes sense to hold off some of them until the critical ones have completed, or maybe the projection gets rebuild in-memory every time the system restarts. Another option you now have is that an individual projection is rebuild by processing the event store more than once, thereby focusing on the most recent or relevant data first. This does require the right metadata associated with the events, but most event stores have you covered on this. And if you have associated your events with a (natural) partition key, you could spin up multiple asynchronous projection processes in parallel, each focusing on a particular partition.

To OR/M or not to OR/M

Now, what about the actual technology that you use to write to your underlying projections database? Some have argued that using raw SQL is the fasted method for updating RDBMS-backed projections. Others would say that using an OR/M still has merits, in particular because it has a unit-of-work that allows you to process multiple related events before hitting the database. We've seen teams that use both, but we haven't identified the definitive ultimate solution.

One thing we're planning to see how we can exploit the OR/M solution to break the projection work into large chunks where the projection work happens in memory and is then flushed back to the database. Some first spikes showed a tremendous performance improvement that would be very difficult to do with raw SQL (unless you're building your own implementation of the Unit of Work pattern).

True Blue/Green

Even with all these improvements, rebuilding projections can still take a while to complete. However, if your system is HTTP based (e.g. a web application, a microservice or HTTP API), you can exploit load balancers and HTTP response codes in a pretty neat way to completely automate the migration process. Here's what this process can look like:

  1. Deploy the new application side-by-side with the original version. The website will return HTTP 503 (Service Unavailable) until it has been fully provisioned.
  2. Allow the load balancer to serve both the old and new sites from the same URL
  3. Stage 1 of the out-of-place migration process runs to copy over all events up to the checkpoint that the source database was when the stage started.
  4. Repeat stage 1 two times more to copy over the remainder of the data.
  5. Stage 2 is started to complete the migration, but not before the source application returns HTTP 503 as well. This is the real downtime.
  6. Stage 2 completes, after which the new application becomes responsive again and everybody is happy again.
  7. If stage 2 would fail, it would simply reset the source application's state so that it would no longer return HTTP 503.

Notice how during the migration there's no manual intervention needed to switch DNS entries or fiddle with the load balancer? That's what I would call true blue-green deployments. Even if you use immutable infrastructure where the new application is deployed as a pre-baked cloud machine this will work.

What about you?

So what do you think? Do these solutions make sense to you? Do you even recognize these problems? And if so, what other solutions did you employ to resolve the long rebuilding times? I'd love to know what you think about this by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you build your Event Sourced systems in an agile world.