Wednesday, November 01, 2017

The Ugly of Event Sourcing–Real-world Production Issues

Event Sourcing is a beautiful solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. After having dedicated a post on the challenges of dealing with projection migrations and how to optimize that, it is time to talk about some of the problems that can happen in production.


So you've managed to design your aggregate boundaries properly, optimized your projection rebuilding logic so that migrations from one version to another complete painlessly, but then you face a run-time issue in production that you never saw before. I generally divide these kinds of problems in two categories. Those that you run into quite quickly and those that keep you awake outside business hours.

Issues that usually reveal themselves pretty quickly

Something we ran into a couple of times is a change in the maximum length of some aggregate root property. Maybe the title of a product was constraint to 50 characters, which seemed to be a very sensible limit for a long time. But then somebody changes the domain and increases that length. If your projection store isn't prepared for that, you'll end up with truncated data at best or a database error at worst. You could just define that column as being the database's max length, but I know for a fact that this has some serious performance implications on SQL Server. That's why we have the projector explicitly truncate the event data. Something similar but less likely can happen with columns that were supposed to hold a 32-bit integer, but then change into 64-bit longs.

Another interesting problem we've run into is an event which has a property that everybody expects to have a valid value but for which almost nobody remembers older versions of that event didn't even have that property. You won't spot that problem during day-to-day development, unless you happen to be running a build against older production databases like we do. The more versions you have of an event (we have one that is post-fixed with V5), the more of this knowledge dissipates into history. Unless you test your projectors against every earlier incarnation of an event (instead of relying on upconverters to do their thing), the only thing you can do is to document your events properly.

So in most cases the developers that change the domain are also the ones that work on the projection code. It's not entirely unconceivable that such a developer makes assumptions about the order the projector gets to see the events in. We have a guideline that states that you shouldn’t extend events for the purpose of improving the projector, and assuming an order may not feel like a violation to that. Just imagine what happens when somebody alters the domain in such a way that the order changes. And yes, this happened to us as well.

Issues that won't show up until the most inconvenient time

A very common problem in a CQRS architecture is the separation between the domain side (the write side) and the query side (the read side). Somehow those two worlds need to be kept in sync. With Event Sourcing, this is done using events. And though in most cases, the same developer deals with both sides of the system, both sides may evolve independently, especially in bigger code bases. This introduces the risk that the projection code doesn't entirely handle the events the way the domain intended them to be used. At some point somebody will replace, split or merge one or more events in the domain and forget to update the corresponding projections. And this is exactly what happened to us, more than once.

Another class of pain-in-the-butt problems are projectors that have misbehaving or unexpected dependencies. You may remember from one of my earlier posts that we started with a CQRS architecture and a traditional database-backed domain model. We didn’t move to Event Sourcing until much later. To keep that migration as smooth as possible, we introduced synchronous projectors that would maintain the immediate consistent query data as projections. If those synchronous projectors would be completely autonomous (as they should), everything would be fine and we could all go on with our lives.

However, over the years, some unexpected dependencies sneaked into the codebase. Apparently some developer decided it was a good idea to reuse the data that was maintained by another projector. This surfaced in two separate incidents. The first happened when we were rebuilding a projection after a schema upgrade. The projector ended up reading from another projection that was at a state much further in the event store's history. As this didn't cause any crashes, it took us quite some time to figure out why the rebuilt projection contained some unexpected data. The other one was quite similar and was caused by an asynchronous projector relying on the data persisted by a synchronous projector. Again, the autonomy of projectors is a key requirement.

In that respect, lookups can have similar problems even though they must be owned by the projector that maintains the main projection. Reuse of lookups is not that common, but not entirely exceptional either. I've seen lookups that can be used to find recurring things such as looking up the user's full name based on the identity. Since this is quite a common requirement, I can imagine such a lookup from being reused. However, the actuality of that look-up must be considered carefully. First, who maintains the lookup and how does the state of the lookup reflect on the projector that relies on it? What happens if they get updated at a different rate? And what if the lookup uses some kind of in-memory LRU cache? How will that work in a web farm? All questions that need to be answered on a case-by-case basis. Although there's no generic guideline here, we tend to ensure a lookup is used and owned by a single projector only. This simplifies the situation a bit and allows us to make more localized decisions on cachability, exception handling and how that affects the lookups, as well as the accuracy of it.

Those who have been using NEventStore as their storage engine are kind of forced into a model where the event store actively pushes events into the projectors. In other words, the event store tracks whether an event was handled by all projectors or not. So unless your solution wraps the projectors' work in one large database transaction, your projectors need to be idempotent. A common solution is to use the version of the aggregate that is often included in the event to see if that event was already handled. Although that is a pretty naïve solution, it gets worse if you need to aggregate events from multiple aggregates. Do you track two separate versions per projection? Or do you create some separate administration per projection? These kinds of problems let us to believe that we shouldn't use NEventStore anymore.

Hey, didn’t I say we only had two categories of problems? I did, but it just happens there is a third undocumented category of problems.

Things you would never expect they could happen

To speed up the projection work, at some point we started to experiment with batching a large number of projection operations into a single unit-of-work (we were and are still using NHibernate). But because we didn't want to maintain a database transaction of that size, we relied on the idempotency of the projectors to be able to replay multiple events when any of the projection work failed. This all worked fine for a while, until we got reports about projection exceptions referring to non-null database constraints. After some in-depth analysis, extended logging and painstakingly long debug sessions, we found the following events (no pun intended) happened:

  1. Event number 20 required a projection to be deleted, which it did.
  2. Some more unrelated events were handled after the application stopped or crashed for some reason.
  3. After restarting, the process restarted with event 10, which expected this projection to be still there.
  4. Since our code just creates projection the first time it is referred to, we created a new instance of this projection with all its properties set to default values, except those related to event 10.
  5. This projection got flushed into the database where it ran into a couple of non-null constraints and…boom!

This made us decide to abandon the idea of batching until we managed to reduce the scope of those transactions.

Another interesting problem happened when we got a production report about a unique key violation happening in one of the projection tables. Since that projector maintained a single projection per aggregate and the violation involved the functional key of that aggregate, we were at loss initially. After requesting a copy of the production database and studying the event streams we discovered two aggregates which identities where exactly the same except for their casing. Our event store does not treat those identities as equivalent because we started our project with an earlier version of NEventStore that required GUIDs as the stream identities. We convert natural keys to GUIDs by using an algorithm written by Bradley Grainger to generate deterministic GUIDs from strings. However, SQL Server, which serves as our projections store, does not care about casing differences. So even though our event store treated those identities as separate streams, the projection code ran into the database's unique key violation. Fortunately most event store implementations use strings for identifying streams. For our legacy scenario, we decided to generate those GUIDs from the lowercase version of the identity.

In another mystery case we received some complains that editing a particular document got slower and slower. Reading the data didn't show any issues, but writing definitely did. We quickly concluded that the involved projection was perfectly fine and started to look for bugs in the event store code, transaction management and the way we hydrate aggregates from events. We couldn't find anything out of the ordinary, until we requested a dump of that specific aggregates’ event history. We have a special diagnostics page to dump the event stream in JSON format, but somehow that page timed out. We needed to get the actual production database before we discovered a single event stream with over 100K events! Some kind of background job that ran regularly was updating the aggregate pretty often. But since the aggregate method involved didn't check for idempotency, a new event was emitted for each update. After a couple of months this definitely added up. We had to delete the entire event stream from the event store and rebuild the involved projections to resolve the issue.

However, the most painful problem we encountered did not surface until after months of regular load testing. It appeared as if a projector missed some events for some reason. We first assumed the projector itself had a bug, but then we discovered similar problems with other projects. We also learned that it only happened under high load, so we suspected that the projection plumbing didn't properly roll back the transactions that wrap the projections. We blamed kind of every part of the code base and even looked at the implementation of NEventStore itself. But we never considered the fact that a SQL Server identity column (which we use to identify the order we should project events) could result in inserts that complete out of order. So if the second insert completes before the first completes, it is possible that the projector will process that second event before it even had a chance to see the first one. We had to use exclusive locks during event store inserts to prevent this. And since our read-write ratio is 100:1, this doesn't affect our performance in any way. Other event stores have used an alternative solution by just reloading a page of events if a gap is detected.

What does that mean for the future?

Well, we did learn from all of this and identified a couple of guidelines that might be useful to you too.

  • Projections should never crash. Always truncate textual data, but log a warning if that happens.
  • If a projector throws and retrying doesn't help (transient exception et al), mark the projection as corrupt so that the UI can handle this.
  • Projectors should be autonomous. In other words, they run independent of other projectors, track their own progress and decide themselves when to rebuild. The consequence of this is that they need to run asynchronously.
  • Build infrastructure to extract individual streams or related streams for diagnostic purposes.
  • Account for case sensitivity of aggregate identities. However, how you handle them depends on the event store implementation and underlying persistency store.

A lot of the problems described in this post have been the main driving force for us to invest in LiquidProjections, a set of light-weight libraries for building various types of autonomous projectors. But that's a topic for another blog post….

What about you?

Hopefully this will be my last post on the dark side of Event Sourcing, which means I'd love to know whether you recognize any of these problems. Did you run into any other noticeable issues? Or did you find alternative or better solutions? If so, let me know by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you build your projections in an Event Sourced world.

Tuesday, July 18, 2017

Key takeaways from QCon New York 2017: The Soft Stuff

This year, for the third time since I joined Aviva Solutions, I attended the New York edition of the famous QCon conference organized by InfoQ. As always, this was a very inspiring week with topics on large-scale distributed architecture, microservices, security, APIs, organizational culture and personal development. It also allowed me the mental breathing room to form a holistic view on the way I drive software development myself. After having discussed the tech stuff in the last post, it's time to talk about the softer aspects of our profession.

"Here's my plan. What's wrong with it?"

This year's keynote was one of the most inspiring keynotes ever. Former police officer Matt Sakaguchi, now a senior Google manager who's terminally ill talked about how important is to be completely yourself at the office. He called it "bring your whole self to work" and how he never could while working in the macho culture that existed at the police force. According to Matt, the prerequisites for that is to have the feeling to have impact and meaning and the need for structure and clarity on what is expected from somebody. He particularly emphasized the need for people to have psychological safety. Nothing is more restraining to someone's potential than feeling shame. It's the responsibility of a manager or team lead to provide such an environment and to make sure that no employee calls down on another for asking "dumb" question. It should be perfectly fine to ask them. In fact, somebody else might be wondering the same, but might be afraid to ask because of that same shame. A nice example of a manager that did that quite well involved his former SWAT sergeant. Whenever they had an assignment, the sergeant would start the briefing with the statement "Here's my plan. What's wrong with it".

Staying relevant when getting older

Something you wouldn't expect at a conference like QCon was a talk about what it takes to stay up-to-date and relevant when getting older. I didn't expect it, but the room was fully loaded, both with young and older people. In fact, after a short inquiry by Don Denoncourt, it appeared only three people were older than 50. That does say something about our profession. Don shared some of the great stories of people who loved what they did until they died of old age, and emphasized that failure makes you stronger. So keep challenging yourself and keep exercising your brain. If you don't, you'll loose the ability to learn.

Don had a couple of suggestions on how to do that. First of all, he wanted us to look at what the best in our profession do and follow their lead. For example, they read code for fun, they might join or start one or more open-source projects, they write blog posts and speak at user groups and conferences. They also discover technologies that other team members are adept it. And finally, they mentor other people. Next to that, Don wants us to understand our learning style. If you’re an auditory person, class room session, audio books or Youtube videos might be the thing for you. But if you're more like a visual person, books and articles might be better suited. However, if you're a kinesthetic type, doing some tech projects in the evenings is probably the most efficient method to gain new knowledge.

He also suggests to do short bursts of learning while waiting for our unit tests to complete, in between Pomodoros, between projects and while job hunting. He even wants us to do some learning during our commute, during workouts, or, when you are a (grand) parent like Don, while taking your (grand) kids with the stroller. And if you're into a certain topic, be sure to apply multi-pass learning by reading more than one article or book on the same subject. And finally, to make sure you don't run out of learning material, stockpile books, blogs, online courses and videos. And don't forget to accumulate posts from web magazines, newsletters, conferences and seminar videos. A great tool to collect all this is Pocket. Apparently Don and me have more in common than I thought.

Communication is a skill

One third of all projects fail because of poor communications and ineffective listening. At least, this what Andrea Goulet and Scott Ford told us. And to be clear, failure includes missed deadlines and overrunning budgets, and seems to be pretty traumatic. They also told us that the outcomes we experience, both individually and socially, come from conversations we had, didn't had, did well and didn't do so well. So being able to become more effective in communication is a pretty important skill to learn.

Scott and Andrea shared a couple of techniques and tips to help you with that. First of all, you should try to see each other's potential before commencing in a fruitful discussion. Just by thinking about that persons possibilities rather than focusing on their bad habits can change the way you approach a person. Next to that, it's useful to understand the speaking types. According to the model Andrea uses, people can be transactional where the interaction is based on asking questions and telling people about what needs to be done. People can also be positional where they advocate their opinions from a particular role or position. And finally, some people are transformational in which they share a vision and try to steer the discussion in a direction that aligns with that.

Emotions are part of an face to face interaction as well and can negatively influence your ability to communicate effectively, so it's imperative to transform an agitated response to a state of deep listening. If you do feel agitated, Andrei strongly suggested us to pause, feel your feet, watch your breath and to remember what you care about. To help us understand how your body, your emotions and what you're saying work together, she shared a model where each layer contributes to the next. Our body is the first layer and informs us about threats to our safety. It helps us to identify our friends or foes and lets us know how to fit in. It provides us with a sense of reality and helps us to make judgement calls. Our emotions form the 2nd layer and provide us with biological reactions to circumstances that we perceive as fear, anger, sadness or happiness. Our speaking represents the 3rd and last layer and allows us to express our opinions, say what we think what is and what is not factual in our world. It also allows us to ask people to do things for us and gives us the ability to commit ourselves to do things for others.

Pretty abstract ideas, right? Fortunately they had some simple and actionable tips as well. For instance, instead of stating your frustration as a fact, they advised us to use phrases like "I feel….when you…..because", or to respond to proposals you don't (entirely) agree with using an affirmative statement followed by "and" instead of "but". And apparently we also need to be humble, helpful and immediate (instead of sugar coating things). And when we do have criticism, keep praising people publicly and save that criticism for a private setting.

Opinions please

So what do you think? Does any of this resonate well with you? I know this is pretty soft stuff and might not appeal to you techies. But I do believe that being successful in the software development profession requires great communication skills. So let me know what you think by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you and me communicate in a technological world.

Monday, July 10, 2017

Key takeaways from QCon New York 2017: The Tech Stuff

This year, for the third time since I joined Aviva Solutions, I attended the New York edition of the famous QCon conference organized by InfoQ. As always, this was a very inspiring week with topics on large-scale distributed architecture, microservices, security, APIs, organizational culture and personal development. It also allowed me the mental breathing room to form a holistic view on the way I drive software development myself. So let me first share the key takeaways on tech stuff.

The state of affair on microservices

QCon is not QCon without a decent coverage of microservices, and this year was no different. In 2014, the conference was all about the introduction of microservices and the challenge around deployment and versioning. Since then, numerous tools and products emerged that should make this all a no-brainer. I don't believe in silver bullets though, especially if a vendor tries to convince people they've build visual tools that allow you to design and connect your microservices without coding (they really did). Fortunately the common agreement is that microservices should never be a first-class architecture, but are a way to break down the monolith. Randy Shoup's summarized this perfectly: "If you don't end up regretting early technology decisions, you probably overengineered"

Interestingly enough, a lot of the big players are moving away from frameworks and products that impose too much structure on microservice teams. Instead, I've noticed an increasing trend to use code generators to generate most of the client and service code based on some formal specification. Those handle the serialization and deserialization concerns, but also integrate reliability measures such as the circuit breaker pattern. And this is where the new kid in town joins the conversation: gRpc. Almost every company that talked about microservices seems to be switching to gRpc and Protobuf as their main communication framework. In particularly the efficiency of the wire format, its reliance on HTTP/2, the versioning flexibility and gRpc's Interface Definition Language (IDL) are its main arguments. But even with code generators and custom libraries, teams are completely free to adopt whatever they want. No company, not even Netflix, imposes any restrictions on its team. Cross-functional "service" teams, often aligned with business domains are given a lot of autonomy.

About removing developer friction

Quite a lot of the talks and open space sessions I attended talked about the development experience, and more specifically about removing friction. Although some of it should be common sense, they all tried to minimize the distance between a good idea and having it run in production.

  • Don't try to predict the future, but don't take a shortcut. Do it right (enough) the first time. But don't forget that right is not perfect. And don't build stuff that already exists as a decent and well-supported open-source project.
  • Don't bother developers with infrastructure work. Build dedicated tools that abstract the infrastructure in a way that helps the developers get their work done quickly. Especially Spotify seems to be moving away from the true DevOps culture. They noticed that there was too much overlap and it resulted in too many disparate solutions.
  • Bugs should not be tracked as a separate thing. Just fix them right away or decide to not fix them at all. Tracking all of these bugs is just going to create a huge list of bugs that no one will look at again….ever.
  • Closely tied to that is to keep distractions away from the team by assigning a Red Hot Engineer on rotation. This person handles all incoming requests, is the first responder when builds fail, and keeps anybody else from disturbing the team.
  • To track the happiness of the team, introduce and update a visual dashboard that shows the teams sentiment on various factors using traffic light. Adrian Trenaman showed a nice example of this. This should also allow you to track or prove whether any actions helped or not.
  • Don't run your code locally anymore. If you're unsure if something works, write a unit test and learn to trust your tests. Just don't forget how to make those tests maintainable and self-explanatory.


Drop your OTA environment. Just deploy!

Another interesting trend at QCon was the increased focus on reducing overhead by dropping a separate development, testing and acceptance environments while trying to bring something into production. Many companies have found that those staging environments don't really make their product better and have a lot of drawbacks. They are often perceived as a fragile and expensive bottleneck. And when something fails, it is difficult to understand failure. The problems companies find in those environments are not that critical at all and never as interesting as the ones that happen in production. In fact, they might even give the wrong incentive, the one where developers rely on some QA engineer to do the real testing work on the test environment, rather than invest in automated testing.

According to their talks, both the Gilt Group and Netflix seem to wholeheartedly support this mindset by working according to a couple of common principles. For starters, teams have end-to-end ownership of the quality and performance of the features they build. In other words, you build it, you run it. Teams have unfettered control to their own infrastructure. They assume continuous delivery in the design process. For instance, by heavily investing in automated testing, employing multi-tenant Canary Testing and making sure there's one way to do something. A nice example that Michael Bryzek of Gilt gave was a bot that would place a real order every few minutes and then cancel it automatically. Teams also act like little start-ups by providing services to other dev teams. This gives them the mentality to try to provide reliable services that are designed to allow delay instead of outage. They may even decide to ship MVPs of their services to quickly help out other teams to conquer a new business opportunity, and then mature their service in successive releases.

You should be afraid for hackers

The second day's keynote was hosted by the CTO of CloudStrike, a security firm often involved in investigating hacking attempts by nation states such as China. It was a pretty in-depth discussion on how they and similar government agencies map the behavior of hacking groups. I never really realized this, but it's amazing to see how persistent some of these groups are. I kind of assumed that hackers would find the path of least resistance, but the patience with which they inject malware, lure people into webpages or downloading .LNK files that will install the initial implant is truly scary. I particular awed at the idea how hackers manage to embed an entire web shell into a page which allows them to run arbitrary Windows commands on a host system with elevated administrator rights. My takeaway from this session was that if you're targeted by any of these groups, there's nothing you can do. Unless you have the money to hire a company like CloudStrike of course….

The Zen of Architecture

Juval Lowy, once awarded the prestigious title of Software Legend, has been in our business for a long time. Over the years, I've heard many rumors about his characters but it is safe to say….all of them are true. Nonetheless, his one

day workshop was one of the best, the most hilarious and intriguing workshops ever. After ridiculing the status quo of the software development and agile community, he enlightened us on the problems of functional decomposition. According to Juval, this results in a design that focusses on breaking down the functionality into smaller functional component that don’t take any of the non-functional requirements into account. He showed us many examples of real-world failures to reinforce that notion.

Instead, he wants us to decompose based on volatility. He wants us identify the areas of the future system that will potentially see the most amount change, and encapsulate those into components and services. The objective is keep thinking about what would change and encapsulate accordingly. That this is not always self-evident, and may take longer than management is expecting, is something that we as architects should be prepared for. However, as many architects still do, this mindset does allow you to stop fighting changes. Just encapsulate the change so that it doesn't touch the entire system. Even when I'm writing this, I'm still not sure how I architecture my systems. What Juval says makes sense, but also sounds very logical. Regardless, his workshop was a great reminder for us architects that we should keep sharing our thought processes, trade-offs, insights and use-class analysis to the developers we work with.

Juval also had some opinions about Agile (obviously). First of all, unlike many agilists, he believes the agile mindset and architecture are not a contradiction at all. He sees architecture as an activity that happens within the agile development process. But he does hold a strong opinion on how sprints are organized. Using some nice real-world stories, he explained and convinced us that sprints should not go back-to-back. Any great endeavor starts with some proper planning, so you need some room between the sprints to consider the status quo, make adjustments and work on the plan for the next sprint. That doesn't necessarily mean that everything is put on hold while the architect decomposes the next set of requirements based on volatility. It's perfectly fine for the developers to work on client apps, the user interface, utilities and infrastructure.

Opinions please

So what do you think? Does any of this resonate well with you? If not, what concerns do you have? Let me know by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you build your systems in an agile world.

Tuesday, June 27, 2017

The Ugly of Event Sourcing - Projection Schema Changes

Event Sourcing is a beautiful solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. Last year, I already blogged about the things I would do differently next time. But after attending another introductory presentation about Event Sourcing recently, I realized it is time to talk about some real experiences. So in this multi-part post, I will share the good, the bad and the ugliness to prepare you for the road ahead. After having dedicated the last posts on the pains of wrongly designed aggregates, it is time to talk about the ugliness of dealing with projection schema changes.


As I explained in the beginning of this series, projections in event sourcing are a very powerful concept that provide ample opportunities to optimize the performance of your system. However, as far as I'm concerned, they also offer you the most painful challenges. Projections are great if their structure or the way they interpret event streams don't change. But as soon as any of these change, you'll be faced with the problem of increasing rebuild times. The bigger your database becomes, the longer rebuilding will take. And considering the nature of databases, this problem tends to grow non-linearly. Over the years we've experimented and implemented various solutions to keep this process to a minimum.

Side-by-side projections

The first step we made was by exploiting the fact that the event store is an append-only database. By rebuilding a new set of projections next to the original ones, while the system is still being used, we could reduce the amount of down-time to a minimum. We simply tracked the checkpoint of the latest change to the event store when that rebuild process started and continued until all projections were rebuild up to that point. Only then did we need to bring down the system to project the remainder of the changes that were added to the event store in the mean time. By repeating the first stage a couple of times, this solution could reduce the down time to a couple of seconds. However, it did mean somebody needed to monitor the upgrade process in case something failed and it had to be restarted. So we still had to find a way to reduce that time even more.


The situation may be different in your domain, but in ours, a lot of the data had a pretty short lifecycle, typically between 7 and 30 days. And the only reason why people would still look for that old data, is to use it as a template for further work. To benefit from that, we started to track graphs of aggregates that are used together and introduced a job that would update that graph whenever an aggregate reached its functional end-of-life. Then, whenever the graph was 'closed', it would mark the corresponding event stream as archivable. This would eventually be used by another job to mark all events of the involved streams with an archivability date. With that, we essentially enriched the event store with metadata that individual projections could use to make smart decisions about the work that needed to be done. By allowing some of the more expensive projections to run asynchronously and keeping track of their own progress, we could exclude them from the normal migration process. This caused a tremendous reduction of the total migration time, especially by those projections that exploited the archivable state of the event store. And as a nice bonus, it allows you to rebuild individual projections in production in case some kind of high-priority temporary fix is needed that requires schema changes or a repair of a corrupted projection.


But this autonomy introduces a new challenge. The data projected by those projections would not become available up until a while after the system started. Worse, because the events are still being processed by the projection, it might be possible that queries would return data that is half-way projected and in the wrong state. Whether the first is a real problem is a functional discussion. Maybe adding the date of the last event projected or an ETA telling the end-user how long it will take to complete the projection work is sufficient. Being able to to do that does require some infrastructure in your projection code that allows you to get a decent ETA calculation. Showing data in the wrong state could cause some pretty serious problems to end-users. But even that can sometimes be handled in a more functional way. If that's not possible, you might be able to exploit the specific purpose and attributes of that projection to filter out half-projected data. For instance, maybe that projection is supposed to only show documents in the closed state. So as long as the projection data doesn't represent that state, you can exclude those from the results.

Not all projections are equal

With the introduction of autonomous projections that provide tracking information and ETA calculation, you can do one more thing to speed up the migration process; prioritization of projections. If you have many asynchronous projections (which you should), it is very likely that some of them are more crucial for the end-users than others. So why would you have them run all at the same time. Maybe it makes sense to hold off some of them until the critical ones have completed, or maybe the projection gets rebuild in-memory every time the system restarts. Another option you now have is that an individual projection is rebuild by processing the event store more than once, thereby focusing on the most recent or relevant data first. This does require the right metadata associated with the events, but most event stores have you covered on this. And if you have associated your events with a (natural) partition key, you could spin up multiple asynchronous projection processes in parallel, each focusing on a particular partition.

To OR/M or not to OR/M

Now, what about the actual technology that you use to write to your underlying projections database? Some have argued that using raw SQL is the fasted method for updating RDBMS-backed projections. Others would say that using an OR/M still has merits, in particular because it has a unit-of-work that allows you to process multiple related events before hitting the database. We've seen teams that use both, but we haven't identified the definitive ultimate solution.

One thing we're planning to see how we can exploit the OR/M solution to break the projection work into large chunks where the projection work happens in memory and is then flushed back to the database. Some first spikes showed a tremendous performance improvement that would be very difficult to do with raw SQL (unless you're building your own implementation of the Unit of Work pattern).

True Blue/Green

Even with all these improvements, rebuilding projections can still take a while to complete. However, if your system is HTTP based (e.g. a web application, a microservice or HTTP API), you can exploit load balancers and HTTP response codes in a pretty neat way to completely automate the migration process. Here's what this process can look like:

  1. Deploy the new application side-by-side with the original version. The website will return HTTP 503 (Service Unavailable) until it has been fully provisioned.
  2. Allow the load balancer to serve both the old and new sites from the same URL
  3. Stage 1 of the out-of-place migration process runs to copy over all events up to the checkpoint that the source database was when the stage started.
  4. Repeat stage 1 two times more to copy over the remainder of the data.
  5. Stage 2 is started to complete the migration, but not before the source application returns HTTP 503 as well. This is the real downtime.
  6. Stage 2 completes, after which the new application becomes responsive again and everybody is happy again.
  7. If stage 2 would fail, it would simply reset the source application's state so that it would no longer return HTTP 503.

Notice how during the migration there's no manual intervention needed to switch DNS entries or fiddle with the load balancer? That's what I would call true blue-green deployments. Even if you use immutable infrastructure where the new application is deployed as a pre-baked cloud machine this will work.

What about you?

So what do you think? Do these solutions make sense to you? Do you even recognize these problems? And if so, what other solutions did you employ to resolve the long rebuilding times? I'd love to know what you think about this by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you build your Event Sourced systems in an agile world.

Thursday, March 23, 2017

The Bad of Event Sourcing - The Pains of Wrongly Designed Aggregates

Event Sourcing is a brilliant solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. In June, I already blogged about the things I would do differently next time. But after attending another introduction to Event Sourcing recently, I realized it is time to talk about some real experiences. So in this multi-part post, I will share the good, the bad and the ugliness to prepare you for the road ahead. After having dedicated the two last posts on the good of event sourcing, let’s talk about some some of the pains we went through.

Designing your domain based on ownership

When I started practicing domain driven design almost 9 years ago, I thought I knew all about aggregates, value objects, repositories, domain services and bounded contexts. I read Eric Evans' blue book, Jimmy Nilsson's white book and various papers such as InfoQ's DDD Quickly. Our main driver for designing our aggregates was based on who owns what property or relationship. We designed for optimistic concurrency, which meant that we needed to use the version of the aggregate to detect concurrent updates. The same applied to relationships (or association properties on a technical level). If it was important to protect the number of children a parent has, you needed to add the child through the parent. Since we were using an OR/M, we could use its built-in versioning mechanism to detect such violations. Surely I had not heard of eventual consistency or consistency boundaries yet, and Vaughn Vernon had not published his brilliant three-parts series yet. In short, I was approaching DDD as a bunch of technical design patterns rather than the business development experience it is supposed to be.

Relying on eventual consistent projections

Because we didn't consider the business rules (or invariants) enough while designing our aggregates, we often needed to use projections to verify certain functional scenarios. We knew that those projections were not transactional consistent with the transactions, and that other web requests could affect those projections while we were using it. But the functional requirements allowed us to work around this for the most part. Until the point that we wanted to make a projection run asynchronously of course. That's the point were we either had to stick to an (expensive) synchronous projector, or accept the fact that we couldn't entirely protect a business rule. Next time, I'll make sure to consider the consistency of a business rule. In other words, if the system must protect a rule at all costs, design the aggregates for it. If not, assume the rule is not consistent and provide functional compensation for it.

Bad choice in aggregate keys

As many information management systems do, we had an aggregate to represent users. Each user was uniquely identified by his or her username and all was fine and dandy. All the other aggregates would refer to those users by their username. Then, at a later point of time, we introduced support for importing users from Active Directory. That sounded pretty trivial, until we discovered that Active Directory allows you to change somebody's username. So we based our aggregate key on something that can change (and may not even be unique), including the domain events that such an aggregate emits. And since a big part of the system is using users to determine authorization policies, this affected the system in many painful ways. We managed to apply some magic to convert the usernames to a deterministic Guid (ping me for the algorithm), but it still was a lot of work. Next time, I will just need to accept that no functional key is stable enough to be the aggregate key and start from a Guid instead.

Using domain events as a way to externalize business rules

The system that I worked on is highly customizable and has a domain with many extension points to influence the functional behavior. At that time, before we converted the system to Event Sourcing, we used Udi Dahan's domain event implementation to have the domain raise events from inside the aggregates. We could then introduce domain event handlers that hook into these and which provide the customized behavior without altering the core domain. This worked pretty well for a while, in particular because those events were essentially well-defined contracts. With some specialized plumbing we made sure all that code would run under the same unit of work and therefor behaved transactionally.

But when we switched to event sourcing, this mechanism became a whole lot less useful. We had to make decisions on many aspects. Are the events the aggregates emit the same as domain events? Should we still raise them from inside the aggregate? Or wait until the aggregate changes have been persisted to the event store? It took a while until we completely embraced the notion that an event is something that has already happened and should never be used to protect other invariants. Those cases that did misuse them have been converted into domain services or by redesigning the aggregate boundaries. You can still use the events as a way to communicate from one aggregate to another, but then you either need to keep the changes into the same database transaction, or use sagas or process managers to handle compensation or retries.

Domain-specific value types in events

Remember the story about how we choose the wrong functional key for a user and had to touch a large part of the code base to fix that? As with any bad situation, people will try to come up with measures that will prevent this in the first place. Consequently, we decided to not directly use primitive types in our code-base anymore, and introduce domain-specific types for almost everything. For instance, a user was now identified by a UserId object with value semantics. So whether it contained a Guid or a simple string was no longer of concern for anything but that type itself.

But as often happens with a lot of new 'practices', we applied it way too dogmatic. We used them everywhere; in commands, aggregates, projections and even events. Naïve as we were, we didn't realize that this would cause a tremendous amount of coupling between the entire code-base. And I didn't even mention the horror of somebody changing the internal constraints of a value type causing crashes caused by an old event that couldn't be deserialized because its old value didn't meet the new constraints. Having learned our lessons, nowadays we make sure we consider the boundaries of such value types. You may still use them in the aggregates, domain services and value objects within the same bounded contexts, but never in commands, projections and events.

Event Granularity

Consider a permit-to-work process in which the risk assessment level can be determined to be 1 or 2. If the level is 2, the process requires a risk assessment team to be formed that will identify the real-world risks involved in the work. However, if the risk level is reduced to 1, then the team can be disbanded. To model the intent of this event, we have two options. Either we capture this situation by first emitting a couple of fine-grained MemberRemovedFromRiskAssessmentTeam domain events, followed by a RiskAssessmentLevelChanged domain event. Or we decide to capture this as a single RiskAssessmentLevelDemoted event. So which is better?

Considering the fact that we're practicing Domain-Driven Design, I guess most people will go for the coarse-grained RiskAssessmentLevelDemoted event. And indeed, it does properly capture the actual thing that happened. But it has a drawback as well. Both the domain as well as the projection logic must know to interpret that event as a demotion in the actual level and the disbandment of the risk assessment team. But what happens if the expected behavior changes in a later version of the software? Maybe the rules change in such a way that team will need to be kept intact, even if the level changes. If you take the coarse-grained event path, you will need to duplicate that logic. We don't share code between the command and query sides in a CQRS architecture style. And what happens when you rebuild domain aggregates form the event store that existed before the software update was completed? There's no ultimate answer here, but considering the relatively high rate of change and configurability in our system's business rules, we choose for fine-grained events.

Event Versioning

Much has been written about event versioning and there are plenty of examples how to convert one or more older events into a couple of new events. We use NEventStore, which provides a simple event upconversion hook out of the box. That library uses Newtonsoft’s Json.NET to serialize the events into its underlying storage, which, by default, includes the .NET runtime type of the event in the JSON payload. This has caused us some pain. What if the .NET namespace of the event type changes because somebody refactored some of the code? Or what if the event is moved from one DLL to another because we decide two split one project in two or two projects in one? We had to tweak the JSON serializer considerably to ensure it would ignore most of the run-time type info and find a reasonably stable mechanism to match a JSON-serialized event with its run-time .NET type. Maybe there's an event store implementation that solves this for you, but we have not come across one yet.


Great developers don't write bugs I often hear some of my colleagues say, but somehow I keep running into apparent less-than-great-developers. So bugs are inevitable. We didn't have too many problems with bugs affecting the domain. Either we could handle them by changing the way the existing events were used to rebuild the domain, or by repairing flawed events using smart upconverters. However, I do remember a particular painful bug that was really hard to fix. One of our background processes was responsible for importing information about a back office system into the domain. It would collect the changes from that system and translate them into a couple of commands that were handled by the domain. All was well for a while, until we got some reports about some weird timeouts in the system. After some investigation we discovered that a particular single aggregate had more than a million events associated with it. Considering the average of a handful of events per aggregate instance, this was a serious problem. Apparently the aggregate root contained a bug that caused it to be not so idempotent as it should be, injecting new events for things that didn't even change. We finally managed to fix this by marking the stream as archivable, a concept we build ourselves. But it most definitely wasn't fun….

What about you?

So what do you think? Do you recognize the challenges of designing your aggregates correctly? If not, what is your primary source for guidance on this complicated topic? I'd love to know what you think about this by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for better aggregates.

Sunday, February 19, 2017

The Good of Event Sourcing - Conflict Handling, Replication and Domain Evolution

Event Sourcing is a brilliant solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. In @June, I already blogged about the things I would do differently next time. But after attending another introductionary presentation about Event Sourcing recently, I realized it is time to talk about some real experiences. So in this multi-part post, I will share the good, the bad and the ugliness to prepare you for the road ahead. After talking about the awesomeness of projections last time, let’s talk about some more goodness.

Conflict handling becomes a business concern

Many modern systems are build with optimistic concurrency as underlying concurrency principle. So as long as two or more people are not doing things on the same domain object, everything is fine. But if they do, then the first one that manages to get its changes handled by the domain wins. When the number of concurrent users is low this is usually fine, but in a highly concurrent website such as an online bookstore like Amazon, this is going to fall apart very quickly. At Domain-Driven Design Europe, Udi Dahan mentioned a strategy in which completing an order for a book would never fail, even if you're out of books. Instead, he would hold the order for a certain period of time to see if new stock could be used to fulfil the order after all. Only after this time expires, he would send an email to the customer and reimburse the money. Most systems don't need this kind of sophistication, but event sourcing does offer a strategy that allows you to have more control on how conflicts should be handled; event merging. Consider the below diagram.


The red line denotes the state of the domain entity user with ID User-Dedo as seen by two concurrent people working on this system. First, the user was created, then a role was granted and finally his or her phone number was added. Considering there were three events at the time the this all started, the revision was 3. Now consider those two people doing administrative work and thereby causing changes without knowing about that. The left side of the diagram depicts one of them changing the password and revoking a role, whereas the right side shows another person granting a role and changing the password as well.

When the time comes to commit the changes of the right side into the left side, the system should be able to detect that the left side already has two events since revision 3 and is now at revision 5. It then needs to cross-reference each of these with the two events from the right side to see if there are conflicts. Is the GrantedRoleEvent from the right side in conflict with the PasswordChangedEvent on the left? Probably not, so we append that event to the left side. But who are we to decide? Can we take decisions like these? No, and that's a good thing. It’s the job of the business to decide that. Who else is better at understanding the domain?

Continuing with our event merging process, let's compare that GrantedRoleEvent with the RoleRevokedEvent on the left. If these events were acting on the same role, we would have to consult the business again. But since we know that in this case they dealt different roles, we can safely merge the event into the left side and give it revision 6. Now what about those attempts to change the passwords at almost the same time? Well, after talking to our product owner, we learned that taking the last password was fine. The only caveat is that the clock of different machines can vary up to five minutes, but the business decided to ignore that for now. Just imagine if you would let the developers make that decision. They would probably introduce a very complicated protocol to ensure consistency whatever the cost….

Distributed Event Sourcing

As you know, an event store contains an immutable history of everything that ever happened in the domain. Also most (if not all) event store implementations allow you to enumerate the store in order of occurrence starting at a specific point in time. Usually, those events are grouped in functional transactions that represent the events that were applied on the same entity at the same point in time. NEventStore for example, calls these commits and uses a checkpoint number to identify such a transaction. With that knowledge, you could imagine that an event store is a great vehicle for application-level replication. You replicate one transaction at the time and use the checkpoint of that transaction as a way to continue in case the replication process gets interrupted somehow. The diagram below illustrates this process.


As you can see, the receiving side only has to append the incoming transactions on its local event store and ask the projection code to update its projections. Pretty neat, isn't it. If you want, you can even optimize the replication process by amending the transactions in the event store with metadata such as a partitioning key. The replication code can use that information to limit the amount of the data that needs to be replicated. And what if your event store has the notion of archivability? In that case, you can even exclude certain transactions entirely from the replication process. The options you have are countless.

But if the replication process involves many nodes, a common problem is that you need to make sure that the code used by all these nodes is in sync. However, using the event store as the vehicle for this enables another trick: staggered roll-outs. Since events are immutable, any change in the event schema requires the code to keep supporting every possible past version of a certain event. Most implementations use some kind of automatic event up-conversion mechanism that enables the domain to only need to support the latest version. But because of that, it becomes completely feasible for a node that is running a newer version of the code to keep receiving events from an older node. It will just convert those older events in whatever it expects and continue the projection process. It will be a bit more challenging, but with some extra work you could event support the opposite. The older node would still store the newer events, but hold off the projection work until it has been upgraded to the correct version. Nonetheless, upgrading individual nodes probably already provides sufficient flexibility.

Keeping your domain and the business aligned

So you've been doing Event Storming, domain modeling, brain storming and all other techniques that promise you a chance to peek into the brains of those business people. Then you and your team go off to build an awesome domain model from one of the identified bounded contexts and everybody is happy. Then, after a certain period of production happiness, a new release is being build and you and the developers discover that you got the aggregate boundaries all wrong. You simply can't accommodate the new invariants in your existing aggregate roots. Now what? Build a new system? Make everything eventual consistent and introduce process managers and sagas to handle those edge cases? You could, but remember, your aggregates are dehydrated from the event store using the events. So why not just redesign or refactor your domain instead? Just split that big aggregate into two or more, convert that entity into a value object, or merge that weird value object into an existing entity. Sure, you would need more sophisticated event converter that know how to combine or split one or more event streams, but it surely cheaper than completely rebuilding your system…..


Now if you were not convinced yet after reading my last post, are you now? Do you see other cool things you could do? Or do you have any concerns to share? I'd love to hear about that by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for a better architecture style.

Saturday, February 11, 2017

The Good of Event Sourcing - Projections

It was in 2009 in Utrecht, The Netherlands, when I first learned about Event Sourcing and the Command Query Responsibility Segregation (CQRS) patterns at a training Greg Young gave there. I remembered to be awed by the scalability and architectural simplicity those styles provided. However, I also remembered the technical complexity that comes with it. In 2012, I lead the transitioning steps to migrate a CQRS-based system to Event Sourcing. I knew it would be non-trivial, but I still underestimated the number of new challenges I would run into over the course of four years. During that time, I've experienced first-hand how a large group of developers had to deal with the transition. Since then I've talked about this many times, both in the Netherlands and outside.

Event Sourcing is a brilliant solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. In June, I already blogged about the things I would do differently next time. But after attending another introduction to Event Sourcing recently, I realized it is time to talk about some real experiences. In this multi-part series, I will share the good, the bad and the ugly of Event Sourcing to prepare you for the road ahead. Let's start with the the good.

The power of projections

Event Sourcing requires you to store the domain changes as a series of historical intentioned-revealing events. Because of this simple structure, you can't run those queries you may have been used to when working with relational databases. Instead, you'll need to build and maintain queryable representations of those events. However, these projections can be optimized for the purpose they serve. So if your user interface requires the data to be grouped in a certain way, you can store the data pre-grouped in your persistent storage. By the time the query is executed, the data no longer needs any grouping, thereby off-loading your database. You can do the same with aggregated calculations, like a count per grouping. What's important to realize is that you'll end up with multiple autonomous projections that are build from the same events and have a single purpose. Consequently, duplication of data is a very normal thing in ES.

An added benefit of using projections is that there's no technical dependency between the events in the event store and the projection code that uses it. Storing the projections in a completely different database (or cluster) from the event store is perfectly fine. In fact, since each projection is completely independent from the others in terms of the data that it uses, you can easily partition the underlying tables without introducing any side-effects. But that is not all, considering that same autonomy, why would you use the same storage mechanism for all projections? Maybe you have a projection that doesn't contain that much data and can be rebuild in-memory at start-up. Or what about a projection that is written to a local embedded version of RavenDB instead of a relatively slow relational database? Especially in load-balanced scenarios a shared database can be a bottleneck. Having the option to keep the projection on the (load-balanced) front-end machines increases scalability and avoids network overhead.

Independence of place and time

Having discussed that, you might wonder whether these projections need to be in sync with the domain at all. In other words, do you need the update the projection in the same call or transaction that triggered the event in the first place? The answer to that (and many other design challenges) is: it depends. I generally prefer to run the projection code asynchronously from the command handling. It gives you the most amount of flexibility and allows you to reason about a projection without the need to consider anything else. It also enables you to decide how and when that projection is rebuild. You can even have projections that represent the domain at a certain point of time, simply by projecting the events up to that point. How cool is that? However, if the accuracy of a particular projection is important for handling a command, you may decide to treat it differently. Be aware though, if you decide to update your domain and projection in the same database transaction, it will hurt performance and scalability.

Now, one more thing. Given how autonomous each projection is and they way it is optimized to give you the aggregated data in a format that suits your needs, you can imagine how it resolves the friction between the object-oriented world and the relational database world. In fact, you don't any Object-Relation Mappers like NHibernate or Entity Framework at all. A solution that uses raw SQL, something like Dapper or a NoSQL solution like RavenDB will work perfectly fine.

Building a reporting node

Quite often, the first question you get when you tell somebody about event sourcing is how you build reports from that. Since all your events are persisted in a single database table (or whatever storage mechanism you use), it will be non-trivial to connect your ETL product of choice to it. But you do have two options here. First, you could build an ATOM feed on top of your event store so that more sophisticated ETL products can subscribe to what happens in the event store. Greg Young's Event Store does just that. But if you really need a traditional relation model to dig through, you're not out of luck yet. Just build another set of projections that look like a relational database model, completely asynchronous from the other projections or your domain. You could even build some kind of replication process that allows you to run those projections on a completely different machine.

So what are your experiences with Event Sourcing? Any interesting usages of projections that you would not have been able to do without Event Sourcing? I've love to know what you think by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for better projections.