Thursday, March 23, 2017

The Bad of Event Sourcing - The Pains of Wrongly Designed Aggregates

Event Sourcing is a brilliant solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. In June, I already blogged about the things I would do differently next time. But after attending another introduction to Event Sourcing recently, I realized it is time to talk about some real experiences. So in this multi-part post, I will share the good, the bad and the ugliness to prepare you for the road ahead. After having dedicated the two last posts on the good of event sourcing, let’s talk about some some of the pains we went through.

Designing your domain based on ownership

When I started practicing domain driven design almost 9 years ago, I thought I knew all about aggregates, value objects, repositories, domain services and bounded contexts. I read Eric Evans' blue book, Jimmy Nilsson's white book and various papers such as InfoQ's DDD Quickly. Our main driver for designing our aggregates was based on who owns what property or relationship. We designed for optimistic concurrency, which meant that we needed to use the version of the aggregate to detect concurrent updates. The same applied to relationships (or association properties on a technical level). If it was important to protect the number of children a parent has, you needed to add the child through the parent. Since we were using an OR/M, we could use its built-in versioning mechanism to detect such violations. Surely I had not heard of eventual consistency or consistency boundaries yet, and Vaughn Vernon had not published his brilliant three-parts series yet. In short, I was approaching DDD as a bunch of technical design patterns rather than the business development experience it is supposed to be.

Relying on eventual consistent projections

Because we didn't consider the business rules (or invariants) enough while designing our aggregates, we often needed to use projections to verify certain functional scenarios. We knew that those projections were not transactional consistent with the transactions, and that other web requests could affect those projections while we were using it. But the functional requirements allowed us to work around this for the most part. Until the point that we wanted to make a projection run asynchronously of course. That's the point were we either had to stick to an (expensive) synchronous projector, or accept the fact that we couldn't entirely protect a business rule. Next time, I'll make sure to consider the consistency of a business rule. In other words, if the system must protect a rule at all costs, design the aggregates for it. If not, assume the rule is not consistent and provide functional compensation for it.

Bad choice in aggregate keys

As many information management systems do, we had an aggregate to represent users. Each user was uniquely identified by his or her username and all was fine and dandy. All the other aggregates would refer to those users by their username. Then, at a later point of time, we introduced support for importing users from Active Directory. That sounded pretty trivial, until we discovered that Active Directory allows you to change somebody's username. So we based our aggregate key on something that can change (and may not even be unique), including the domain events that such an aggregate emits. And since a big part of the system is using users to determine authorization policies, this affected the system in many painful ways. We managed to apply some magic to convert the usernames to a deterministic Guid (ping me for the algorithm), but it still was a lot of work. Next time, I will just need to accept that no functional key is stable enough to be the aggregate key and start from a Guid instead.

Using domain events as a way to externalize business rules

The system that I worked on is highly customizable and has a domain with many extension points to influence the functional behavior. At that time, before we converted the system to Event Sourcing, we used Udi Dahan's domain event implementation to have the domain raise events from inside the aggregates. We could then introduce domain event handlers that hook into these and which provide the customized behavior without altering the core domain. This worked pretty well for a while, in particular because those events were essentially well-defined contracts. With some specialized plumbing we made sure all that code would run under the same unit of work and therefor behaved transactionally.

But when we switched to event sourcing, this mechanism became a whole lot less useful. We had to make decisions on many aspects. Are the events the aggregates emit the same as domain events? Should we still raise them from inside the aggregate? Or wait until the aggregate changes have been persisted to the event store? It took a while until we completely embraced the notion that an event is something that has already happened and should never be used to protect other invariants. Those cases that did misuse them have been converted into domain services or by redesigning the aggregate boundaries. You can still use the events as a way to communicate from one aggregate to another, but then you either need to keep the changes into the same database transaction, or use sagas or process managers to handle compensation or retries.

Domain-specific value types in events

Remember the story about how we choose the wrong functional key for a user and had to touch a large part of the code base to fix that? As with any bad situation, people will try to come up with measures that will prevent this in the first place. Consequently, we decided to not directly use primitive types in our code-base anymore, and introduce domain-specific types for almost everything. For instance, a user was now identified by a UserId object with value semantics. So whether it contained a Guid or a simple string was no longer of concern for anything but that type itself.

But as often happens with a lot of new 'practices', we applied it way too dogmatic. We used them everywhere; in commands, aggregates, projections and even events. Naïve as we were, we didn't realize that this would cause a tremendous amount of coupling between the entire code-base. And I didn't even mention the horror of somebody changing the internal constraints of a value type causing crashes caused by an old event that couldn't be deserialized because its old value didn't meet the new constraints. Having learned our lessons, nowadays we make sure we consider the boundaries of such value types. You may still use them in the aggregates, domain services and value objects within the same bounded contexts, but never in commands, projections and events.

Event Granularity

Consider a permit-to-work process in which the risk assessment level can be determined to be 1 or 2. If the level is 2, the process requires a risk assessment team to be formed that will identify the real-world risks involved in the work. However, if the risk level is reduced to 1, then the team can be disbanded. To model the intent of this event, we have two options. Either we capture this situation by first emitting a couple of fine-grained MemberRemovedFromRiskAssessmentTeam domain events, followed by a RiskAssessmentLevelChanged domain event. Or we decide to capture this as a single RiskAssessmentLevelDemoted event. So which is better?

Considering the fact that we're practicing Domain-Driven Design, I guess most people will go for the coarse-grained RiskAssessmentLevelDemoted event. And indeed, it does properly capture the actual thing that happened. But it has a drawback as well. Both the domain as well as the projection logic must know to interpret that event as a demotion in the actual level and the disbandment of the risk assessment team. But what happens if the expected behavior changes in a later version of the software? Maybe the rules change in such a way that team will need to be kept intact, even if the level changes. If you take the coarse-grained event path, you will need to duplicate that logic. We don't share code between the command and query sides in a CQRS architecture style. And what happens when you rebuild domain aggregates form the event store that existed before the software update was completed? There's no ultimate answer here, but considering the relatively high rate of change and configurability in our system's business rules, we choose for fine-grained events.

Event Versioning

Much has been written about event versioning and there are plenty of examples how to convert one or more older events into a couple of new events. We use NEventStore, which provides a simple event upconversion hook out of the box. That library uses Newtonsoft’s Json.NET to serialize the events into its underlying storage, which, by default, includes the .NET runtime type of the event in the JSON payload. This has caused us some pain. What if the .NET namespace of the event type changes because somebody refactored some of the code? Or what if the event is moved from one DLL to another because we decide two split one project in two or two projects in one? We had to tweak the JSON serializer considerably to ensure it would ignore most of the run-time type info and find a reasonably stable mechanism to match a JSON-serialized event with its run-time .NET type. Maybe there's an event store implementation that solves this for you, but we have not come across one yet.

Bugs

Great developers don't write bugs I often hear some of my colleagues say, but somehow I keep running into apparent less-than-great-developers. So bugs are inevitable. We didn't have too many problems with bugs affecting the domain. Either we could handle them by changing the way the existing events were used to rebuild the domain, or by repairing flawed events using smart upconverters. However, I do remember a particular painful bug that was really hard to fix. One of our background processes was responsible for importing information about a back office system into the domain. It would collect the changes from that system and translate them into a couple of commands that were handled by the domain. All was well for a while, until we got some reports about some weird timeouts in the system. After some investigation we discovered that a particular single aggregate had more than a million events associated with it. Considering the average of a handful of events per aggregate instance, this was a serious problem. Apparently the aggregate root contained a bug that caused it to be not so idempotent as it should be, injecting new events for things that didn't even change. We finally managed to fix this by marking the stream as archivable, a concept we build ourselves. But it most definitely wasn't fun….

What about you?

So what do you think? Do you recognize the challenges of designing your aggregates correctly? If not, what is your primary source for guidance on this complicated topic? I'd love to know what you think about this by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for better aggregates.

Sunday, February 19, 2017

The Good of Event Sourcing - Conflict Handling, Replication and Domain Evolution

Event Sourcing is a brilliant solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. In @June, I already blogged about the things I would do differently next time. But after attending another introductionary presentation about Event Sourcing recently, I realized it is time to talk about some real experiences. So in this multi-part post, I will share the good, the bad and the ugliness to prepare you for the road ahead. After talking about the awesomeness of projections last time, let’s talk about some more goodness.

Conflict handling becomes a business concern

Many modern systems are build with optimistic concurrency as underlying concurrency principle. So as long as two or more people are not doing things on the same domain object, everything is fine. But if they do, then the first one that manages to get its changes handled by the domain wins. When the number of concurrent users is low this is usually fine, but in a highly concurrent website such as an online bookstore like Amazon, this is going to fall apart very quickly. At Domain-Driven Design Europe, Udi Dahan mentioned a strategy in which completing an order for a book would never fail, even if you're out of books. Instead, he would hold the order for a certain period of time to see if new stock could be used to fulfil the order after all. Only after this time expires, he would send an email to the customer and reimburse the money. Most systems don't need this kind of sophistication, but event sourcing does offer a strategy that allows you to have more control on how conflicts should be handled; event merging. Consider the below diagram.

clip_image001

The red line denotes the state of the domain entity user with ID User-Dedo as seen by two concurrent people working on this system. First, the user was created, then a role was granted and finally his or her phone number was added. Considering there were three events at the time the this all started, the revision was 3. Now consider those two people doing administrative work and thereby causing changes without knowing about that. The left side of the diagram depicts one of them changing the password and revoking a role, whereas the right side shows another person granting a role and changing the password as well.

When the time comes to commit the changes of the right side into the left side, the system should be able to detect that the left side already has two events since revision 3 and is now at revision 5. It then needs to cross-reference each of these with the two events from the right side to see if there are conflicts. Is the GrantedRoleEvent from the right side in conflict with the PasswordChangedEvent on the left? Probably not, so we append that event to the left side. But who are we to decide? Can we take decisions like these? No, and that's a good thing. It’s the job of the business to decide that. Who else is better at understanding the domain?

Continuing with our event merging process, let's compare that GrantedRoleEvent with the RoleRevokedEvent on the left. If these events were acting on the same role, we would have to consult the business again. But since we know that in this case they dealt different roles, we can safely merge the event into the left side and give it revision 6. Now what about those attempts to change the passwords at almost the same time? Well, after talking to our product owner, we learned that taking the last password was fine. The only caveat is that the clock of different machines can vary up to five minutes, but the business decided to ignore that for now. Just imagine if you would let the developers make that decision. They would probably introduce a very complicated protocol to ensure consistency whatever the cost….

Distributed Event Sourcing

As you know, an event store contains an immutable history of everything that ever happened in the domain. Also most (if not all) event store implementations allow you to enumerate the store in order of occurrence starting at a specific point in time. Usually, those events are grouped in functional transactions that represent the events that were applied on the same entity at the same point in time. NEventStore for example, calls these commits and uses a checkpoint number to identify such a transaction. With that knowledge, you could imagine that an event store is a great vehicle for application-level replication. You replicate one transaction at the time and use the checkpoint of that transaction as a way to continue in case the replication process gets interrupted somehow. The diagram below illustrates this process.

clip_image002

As you can see, the receiving side only has to append the incoming transactions on its local event store and ask the projection code to update its projections. Pretty neat, isn't it. If you want, you can even optimize the replication process by amending the transactions in the event store with metadata such as a partitioning key. The replication code can use that information to limit the amount of the data that needs to be replicated. And what if your event store has the notion of archivability? In that case, you can even exclude certain transactions entirely from the replication process. The options you have are countless.

But if the replication process involves many nodes, a common problem is that you need to make sure that the code used by all these nodes is in sync. However, using the event store as the vehicle for this enables another trick: staggered roll-outs. Since events are immutable, any change in the event schema requires the code to keep supporting every possible past version of a certain event. Most implementations use some kind of automatic event up-conversion mechanism that enables the domain to only need to support the latest version. But because of that, it becomes completely feasible for a node that is running a newer version of the code to keep receiving events from an older node. It will just convert those older events in whatever it expects and continue the projection process. It will be a bit more challenging, but with some extra work you could event support the opposite. The older node would still store the newer events, but hold off the projection work until it has been upgraded to the correct version. Nonetheless, upgrading individual nodes probably already provides sufficient flexibility.

Keeping your domain and the business aligned

So you've been doing Event Storming, domain modeling, brain storming and all other techniques that promise you a chance to peek into the brains of those business people. Then you and your team go off to build an awesome domain model from one of the identified bounded contexts and everybody is happy. Then, after a certain period of production happiness, a new release is being build and you and the developers discover that you got the aggregate boundaries all wrong. You simply can't accommodate the new invariants in your existing aggregate roots. Now what? Build a new system? Make everything eventual consistent and introduce process managers and sagas to handle those edge cases? You could, but remember, your aggregates are dehydrated from the event store using the events. So why not just redesign or refactor your domain instead? Just split that big aggregate into two or more, convert that entity into a value object, or merge that weird value object into an existing entity. Sure, you would need more sophisticated event converter that know how to combine or split one or more event streams, but it surely cheaper than completely rebuilding your system…..

clip_image003

Now if you were not convinced yet after reading my last post, are you now? Do you see other cool things you could do? Or do you have any concerns to share? I'd love to hear about that by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for a better architecture style.

Saturday, February 11, 2017

The Good of Event Sourcing - Projections

It was in 2009 in Utrecht, The Netherlands, when I first learned about Event Sourcing and the Command Query Responsibility Segregation (CQRS) patterns at a training Greg Young gave there. I remembered to be awed by the scalability and architectural simplicity those styles provided. However, I also remembered the technical complexity that comes with it. In 2012, I lead the transitioning steps to migrate a CQRS-based system to Event Sourcing. I knew it would be non-trivial, but I still underestimated the number of new challenges I would run into over the course of four years. During that time, I've experienced first-hand how a large group of developers had to deal with the transition. Since then I've talked about this many times, both in the Netherlands and outside.

Event Sourcing is a brilliant solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. In June, I already blogged about the things I would do differently next time. But after attending another introduction to Event Sourcing recently, I realized it is time to talk about some real experiences. In this multi-part series, I will share the good, the bad and the ugly of Event Sourcing to prepare you for the road ahead. Let's start with the the good.

The power of projections

Event Sourcing requires you to store the domain changes as a series of historical intentioned-revealing events. Because of this simple structure, you can't run those queries you may have been used to when working with relational databases. Instead, you'll need to build and maintain queryable representations of those events. However, these projections can be optimized for the purpose they serve. So if your user interface requires the data to be grouped in a certain way, you can store the data pre-grouped in your persistent storage. By the time the query is executed, the data no longer needs any grouping, thereby off-loading your database. You can do the same with aggregated calculations, like a count per grouping. What's important to realize is that you'll end up with multiple autonomous projections that are build from the same events and have a single purpose. Consequently, duplication of data is a very normal thing in ES.

An added benefit of using projections is that there's no technical dependency between the events in the event store and the projection code that uses it. Storing the projections in a completely different database (or cluster) from the event store is perfectly fine. In fact, since each projection is completely independent from the others in terms of the data that it uses, you can easily partition the underlying tables without introducing any side-effects. But that is not all, considering that same autonomy, why would you use the same storage mechanism for all projections? Maybe you have a projection that doesn't contain that much data and can be rebuild in-memory at start-up. Or what about a projection that is written to a local embedded version of RavenDB instead of a relatively slow relational database? Especially in load-balanced scenarios a shared database can be a bottleneck. Having the option to keep the projection on the (load-balanced) front-end machines increases scalability and avoids network overhead.

Independence of place and time

Having discussed that, you might wonder whether these projections need to be in sync with the domain at all. In other words, do you need the update the projection in the same call or transaction that triggered the event in the first place? The answer to that (and many other design challenges) is: it depends. I generally prefer to run the projection code asynchronously from the command handling. It gives you the most amount of flexibility and allows you to reason about a projection without the need to consider anything else. It also enables you to decide how and when that projection is rebuild. You can even have projections that represent the domain at a certain point of time, simply by projecting the events up to that point. How cool is that? However, if the accuracy of a particular projection is important for handling a command, you may decide to treat it differently. Be aware though, if you decide to update your domain and projection in the same database transaction, it will hurt performance and scalability.

Now, one more thing. Given how autonomous each projection is and they way it is optimized to give you the aggregated data in a format that suits your needs, you can imagine how it resolves the friction between the object-oriented world and the relational database world. In fact, you don't any Object-Relation Mappers like NHibernate or Entity Framework at all. A solution that uses raw SQL, something like Dapper or a NoSQL solution like RavenDB will work perfectly fine.

Building a reporting node

Quite often, the first question you get when you tell somebody about event sourcing is how you build reports from that. Since all your events are persisted in a single database table (or whatever storage mechanism you use), it will be non-trivial to connect your ETL product of choice to it. But you do have two options here. First, you could build an ATOM feed on top of your event store so that more sophisticated ETL products can subscribe to what happens in the event store. Greg Young's Event Store does just that. But if you really need a traditional relation model to dig through, you're not out of luck yet. Just build another set of projections that look like a relational database model, completely asynchronous from the other projections or your domain. You could even build some kind of replication process that allows you to run those projections on a completely different machine.

So what are your experiences with Event Sourcing? Any interesting usages of projections that you would not have been able to do without Event Sourcing? I've love to know what you think by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for better projections.