Wednesday, November 01, 2017

The Ugly of Event Sourcing–Real-world Production Issues

Event Sourcing is a beautiful solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. After having dedicated a post on the challenges of dealing with projection migrations and how to optimize that, it is time to talk about some of the problems that can happen in production.

 

So you've managed to design your aggregate boundaries properly, optimized your projection rebuilding logic so that migrations from one version to another complete painlessly, but then you face a run-time issue in production that you never saw before. I generally divide these kinds of problems in two categories. Those that you run into quite quickly and those that keep you awake outside business hours.

Issues that usually reveal themselves pretty quickly

Something we ran into a couple of times is a change in the maximum length of some aggregate root property. Maybe the title of a product was constraint to 50 characters, which seemed to be a very sensible limit for a long time. But then somebody changes the domain and increases that length. If your projection store isn't prepared for that, you'll end up with truncated data at best or a database error at worst. You could just define that column as being the database's max length, but I know for a fact that this has some serious performance implications on SQL Server. That's why we have the projector explicitly truncate the event data. Something similar but less likely can happen with columns that were supposed to hold a 32-bit integer, but then change into 64-bit longs.

Another interesting problem we've run into is an event which has a property that everybody expects to have a valid value but for which almost nobody remembers older versions of that event didn't even have that property. You won't spot that problem during day-to-day development, unless you happen to be running a build against older production databases like we do. The more versions you have of an event (we have one that is post-fixed with V5), the more of this knowledge dissipates into history. Unless you test your projectors against every earlier incarnation of an event (instead of relying on upconverters to do their thing), the only thing you can do is to document your events properly.

So in most cases the developers that change the domain are also the ones that work on the projection code. It's not entirely unconceivable that such a developer makes assumptions about the order the projector gets to see the events in. We have a guideline that states that you shouldn’t extend events for the purpose of improving the projector, and assuming an order may not feel like a violation to that. Just imagine what happens when somebody alters the domain in such a way that the order changes. And yes, this happened to us as well.

Issues that won't show up until the most inconvenient time

A very common problem in a CQRS architecture is the separation between the domain side (the write side) and the query side (the read side). Somehow those two worlds need to be kept in sync. With Event Sourcing, this is done using events. And though in most cases, the same developer deals with both sides of the system, both sides may evolve independently, especially in bigger code bases. This introduces the risk that the projection code doesn't entirely handle the events the way the domain intended them to be used. At some point somebody will replace, split or merge one or more events in the domain and forget to update the corresponding projections. And this is exactly what happened to us, more than once.

Another class of pain-in-the-butt problems are projectors that have misbehaving or unexpected dependencies. You may remember from one of my earlier posts that we started with a CQRS architecture and a traditional database-backed domain model. We didn’t move to Event Sourcing until much later. To keep that migration as smooth as possible, we introduced synchronous projectors that would maintain the immediate consistent query data as projections. If those synchronous projectors would be completely autonomous (as they should), everything would be fine and we could all go on with our lives.

However, over the years, some unexpected dependencies sneaked into the codebase. Apparently some developer decided it was a good idea to reuse the data that was maintained by another projector. This surfaced in two separate incidents. The first happened when we were rebuilding a projection after a schema upgrade. The projector ended up reading from another projection that was at a state much further in the event store's history. As this didn't cause any crashes, it took us quite some time to figure out why the rebuilt projection contained some unexpected data. The other one was quite similar and was caused by an asynchronous projector relying on the data persisted by a synchronous projector. Again, the autonomy of projectors is a key requirement.

In that respect, lookups can have similar problems even though they must be owned by the projector that maintains the main projection. Reuse of lookups is not that common, but not entirely exceptional either. I've seen lookups that can be used to find recurring things such as looking up the user's full name based on the identity. Since this is quite a common requirement, I can imagine such a lookup from being reused. However, the actuality of that look-up must be considered carefully. First, who maintains the lookup and how does the state of the lookup reflect on the projector that relies on it? What happens if they get updated at a different rate? And what if the lookup uses some kind of in-memory LRU cache? How will that work in a web farm? All questions that need to be answered on a case-by-case basis. Although there's no generic guideline here, we tend to ensure a lookup is used and owned by a single projector only. This simplifies the situation a bit and allows us to make more localized decisions on cachability, exception handling and how that affects the lookups, as well as the accuracy of it.

Those who have been using NEventStore as their storage engine are kind of forced into a model where the event store actively pushes events into the projectors. In other words, the event store tracks whether an event was handled by all projectors or not. So unless your solution wraps the projectors' work in one large database transaction, your projectors need to be idempotent. A common solution is to use the version of the aggregate that is often included in the event to see if that event was already handled. Although that is a pretty naïve solution, it gets worse if you need to aggregate events from multiple aggregates. Do you track two separate versions per projection? Or do you create some separate administration per projection? These kinds of problems let us to believe that we shouldn't use NEventStore anymore.

Hey, didn’t I say we only had two categories of problems? I did, but it just happens there is a third undocumented category of problems.

Things you would never expect they could happen

To speed up the projection work, at some point we started to experiment with batching a large number of projection operations into a single unit-of-work (we were and are still using NHibernate). But because we didn't want to maintain a database transaction of that size, we relied on the idempotency of the projectors to be able to replay multiple events when any of the projection work failed. This all worked fine for a while, until we got reports about projection exceptions referring to non-null database constraints. After some in-depth analysis, extended logging and painstakingly long debug sessions, we found the following events (no pun intended) happened:

  1. Event number 20 required a projection to be deleted, which it did.
  2. Some more unrelated events were handled after the application stopped or crashed for some reason.
  3. After restarting, the process restarted with event 10, which expected this projection to be still there.
  4. Since our code just creates projection the first time it is referred to, we created a new instance of this projection with all its properties set to default values, except those related to event 10.
  5. This projection got flushed into the database where it ran into a couple of non-null constraints and…boom!

This made us decide to abandon the idea of batching until we managed to reduce the scope of those transactions.

Another interesting problem happened when we got a production report about a unique key violation happening in one of the projection tables. Since that projector maintained a single projection per aggregate and the violation involved the functional key of that aggregate, we were at loss initially. After requesting a copy of the production database and studying the event streams we discovered two aggregates which identities where exactly the same except for their casing. Our event store does not treat those identities as equivalent because we started our project with an earlier version of NEventStore that required GUIDs as the stream identities. We convert natural keys to GUIDs by using an algorithm written by Bradley Grainger to generate deterministic GUIDs from strings. However, SQL Server, which serves as our projections store, does not care about casing differences. So even though our event store treated those identities as separate streams, the projection code ran into the database's unique key violation. Fortunately most event store implementations use strings for identifying streams. For our legacy scenario, we decided to generate those GUIDs from the lowercase version of the identity.

In another mystery case we received some complains that editing a particular document got slower and slower. Reading the data didn't show any issues, but writing definitely did. We quickly concluded that the involved projection was perfectly fine and started to look for bugs in the event store code, transaction management and the way we hydrate aggregates from events. We couldn't find anything out of the ordinary, until we requested a dump of that specific aggregates’ event history. We have a special diagnostics page to dump the event stream in JSON format, but somehow that page timed out. We needed to get the actual production database before we discovered a single event stream with over 100K events! Some kind of background job that ran regularly was updating the aggregate pretty often. But since the aggregate method involved didn't check for idempotency, a new event was emitted for each update. After a couple of months this definitely added up. We had to delete the entire event stream from the event store and rebuild the involved projections to resolve the issue.

However, the most painful problem we encountered did not surface until after months of regular load testing. It appeared as if a projector missed some events for some reason. We first assumed the projector itself had a bug, but then we discovered similar problems with other projects. We also learned that it only happened under high load, so we suspected that the projection plumbing didn't properly roll back the transactions that wrap the projections. We blamed kind of every part of the code base and even looked at the implementation of NEventStore itself. But we never considered the fact that a SQL Server identity column (which we use to identify the order we should project events) could result in inserts that complete out of order. So if the second insert completes before the first completes, it is possible that the projector will process that second event before it even had a chance to see the first one. We had to use exclusive locks during event store inserts to prevent this. And since our read-write ratio is 100:1, this doesn't affect our performance in any way. Other event stores have used an alternative solution by just reloading a page of events if a gap is detected.

What does that mean for the future?

Well, we did learn from all of this and identified a couple of guidelines that might be useful to you too.

  • Projections should never crash. Always truncate textual data, but log a warning if that happens.
  • If a projector throws and retrying doesn't help (transient exception et al), mark the projection as corrupt so that the UI can handle this.
  • Projectors should be autonomous. In other words, they run independent of other projectors, track their own progress and decide themselves when to rebuild. The consequence of this is that they need to run asynchronously.
  • Build infrastructure to extract individual streams or related streams for diagnostic purposes.
  • Account for case sensitivity of aggregate identities. However, how you handle them depends on the event store implementation and underlying persistency store.

A lot of the problems described in this post have been the main driving force for us to invest in LiquidProjections, a set of light-weight libraries for building various types of autonomous projectors. But that's a topic for another blog post….

What about you?

Hopefully this will be my last post on the dark side of Event Sourcing, which means I'd love to know whether you recognize any of these problems. Did you run into any other noticeable issues? Or did you find alternative or better solutions? If so, let me know by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you build your projections in an Event Sourced world.

Tuesday, July 18, 2017

Key takeaways from QCon New York 2017: The Soft Stuff

This year, for the third time since I joined Aviva Solutions, I attended the New York edition of the famous QCon conference organized by InfoQ. As always, this was a very inspiring week with topics on large-scale distributed architecture, microservices, security, APIs, organizational culture and personal development. It also allowed me the mental breathing room to form a holistic view on the way I drive software development myself. After having discussed the tech stuff in the last post, it's time to talk about the softer aspects of our profession.

"Here's my plan. What's wrong with it?"

This year's keynote was one of the most inspiring keynotes ever. Former police officer Matt Sakaguchi, now a senior Google manager who's terminally ill talked about how important is to be completely yourself at the office. He called it "bring your whole self to work" and how he never could while working in the macho culture that existed at the police force. According to Matt, the prerequisites for that is to have the feeling to have impact and meaning and the need for structure and clarity on what is expected from somebody. He particularly emphasized the need for people to have psychological safety. Nothing is more restraining to someone's potential than feeling shame. It's the responsibility of a manager or team lead to provide such an environment and to make sure that no employee calls down on another for asking "dumb" question. It should be perfectly fine to ask them. In fact, somebody else might be wondering the same, but might be afraid to ask because of that same shame. A nice example of a manager that did that quite well involved his former SWAT sergeant. Whenever they had an assignment, the sergeant would start the briefing with the statement "Here's my plan. What's wrong with it".

Staying relevant when getting older

Something you wouldn't expect at a conference like QCon was a talk about what it takes to stay up-to-date and relevant when getting older. I didn't expect it, but the room was fully loaded, both with young and older people. In fact, after a short inquiry by Don Denoncourt, it appeared only three people were older than 50. That does say something about our profession. Don shared some of the great stories of people who loved what they did until they died of old age, and emphasized that failure makes you stronger. So keep challenging yourself and keep exercising your brain. If you don't, you'll loose the ability to learn.

Don had a couple of suggestions on how to do that. First of all, he wanted us to look at what the best in our profession do and follow their lead. For example, they read code for fun, they might join or start one or more open-source projects, they write blog posts and speak at user groups and conferences. They also discover technologies that other team members are adept it. And finally, they mentor other people. Next to that, Don wants us to understand our learning style. If you’re an auditory person, class room session, audio books or Youtube videos might be the thing for you. But if you're more like a visual person, books and articles might be better suited. However, if you're a kinesthetic type, doing some tech projects in the evenings is probably the most efficient method to gain new knowledge.

He also suggests to do short bursts of learning while waiting for our unit tests to complete, in between Pomodoros, between projects and while job hunting. He even wants us to do some learning during our commute, during workouts, or, when you are a (grand) parent like Don, while taking your (grand) kids with the stroller. And if you're into a certain topic, be sure to apply multi-pass learning by reading more than one article or book on the same subject. And finally, to make sure you don't run out of learning material, stockpile books, blogs, online courses and videos. And don't forget to accumulate posts from web magazines, newsletters, conferences and seminar videos. A great tool to collect all this is Pocket. Apparently Don and me have more in common than I thought.

Communication is a skill

One third of all projects fail because of poor communications and ineffective listening. At least, this what Andrea Goulet and Scott Ford told us. And to be clear, failure includes missed deadlines and overrunning budgets, and seems to be pretty traumatic. They also told us that the outcomes we experience, both individually and socially, come from conversations we had, didn't had, did well and didn't do so well. So being able to become more effective in communication is a pretty important skill to learn.

Scott and Andrea shared a couple of techniques and tips to help you with that. First of all, you should try to see each other's potential before commencing in a fruitful discussion. Just by thinking about that persons possibilities rather than focusing on their bad habits can change the way you approach a person. Next to that, it's useful to understand the speaking types. According to the model Andrea uses, people can be transactional where the interaction is based on asking questions and telling people about what needs to be done. People can also be positional where they advocate their opinions from a particular role or position. And finally, some people are transformational in which they share a vision and try to steer the discussion in a direction that aligns with that.

Emotions are part of an face to face interaction as well and can negatively influence your ability to communicate effectively, so it's imperative to transform an agitated response to a state of deep listening. If you do feel agitated, Andrei strongly suggested us to pause, feel your feet, watch your breath and to remember what you care about. To help us understand how your body, your emotions and what you're saying work together, she shared a model where each layer contributes to the next. Our body is the first layer and informs us about threats to our safety. It helps us to identify our friends or foes and lets us know how to fit in. It provides us with a sense of reality and helps us to make judgement calls. Our emotions form the 2nd layer and provide us with biological reactions to circumstances that we perceive as fear, anger, sadness or happiness. Our speaking represents the 3rd and last layer and allows us to express our opinions, say what we think what is and what is not factual in our world. It also allows us to ask people to do things for us and gives us the ability to commit ourselves to do things for others.

Pretty abstract ideas, right? Fortunately they had some simple and actionable tips as well. For instance, instead of stating your frustration as a fact, they advised us to use phrases like "I feel….when you…..because", or to respond to proposals you don't (entirely) agree with using an affirmative statement followed by "and" instead of "but". And apparently we also need to be humble, helpful and immediate (instead of sugar coating things). And when we do have criticism, keep praising people publicly and save that criticism for a private setting.

Opinions please

So what do you think? Does any of this resonate well with you? I know this is pretty soft stuff and might not appeal to you techies. But I do believe that being successful in the software development profession requires great communication skills. So let me know what you think by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you and me communicate in a technological world.