Tuesday, July 18, 2017

Key takeaways from QCon New York 2017: The Soft Stuff

This year, for the third time since I joined Aviva Solutions, I attended the New York edition of the famous QCon conference organized by InfoQ. As always, this was a very inspiring week with topics on large-scale distributed architecture, microservices, security, APIs, organizational culture and personal development. It also allowed me the mental breathing room to form a holistic view on the way I drive software development myself. After having discussed the tech stuff in the last post, it's time to talk about the softer aspects of our profession.

"Here's my plan. What's wrong with it?"

This year's keynote was one of the most inspiring keynotes ever. Former police officer Matt Sakaguchi, now a senior Google manager who's terminally ill talked about how important is to be completely yourself at the office. He called it "bring your whole self to work" and how he never could while working in the macho culture that existed at the police force. According to Matt, the prerequisites for that is to have the feeling to have impact and meaning and the need for structure and clarity on what is expected from somebody. He particularly emphasized the need for people to have psychological safety. Nothing is more restraining to someone's potential than feeling shame. It's the responsibility of a manager or team lead to provide such an environment and to make sure that no employee calls down on another for asking "dumb" question. It should be perfectly fine to ask them. In fact, somebody else might be wondering the same, but might be afraid to ask because of that same shame. A nice example of a manager that did that quite well involved his former SWAT sergeant. Whenever they had an assignment, the sergeant would start the briefing with the statement "Here's my plan. What's wrong with it".

Staying relevant when getting older

Something you wouldn't expect at a conference like QCon was a talk about what it takes to stay up-to-date and relevant when getting older. I didn't expect it, but the room was fully loaded, both with young and older people. In fact, after a short inquiry by Don Denoncourt, it appeared only three people were older than 50. That does say something about our profession. Don shared some of the great stories of people who loved what they did until they died of old age, and emphasized that failure makes you stronger. So keep challenging yourself and keep exercising your brain. If you don't, you'll loose the ability to learn.

Don had a couple of suggestions on how to do that. First of all, he wanted us to look at what the best in our profession do and follow their lead. For example, they read code for fun, they might join or start one or more open-source projects, they write blog posts and speak at user groups and conferences. They also discover technologies that other team members are adept it. And finally, they mentor other people. Next to that, Don wants us to understand our learning style. If you’re an auditory person, class room session, audio books or Youtube videos might be the thing for you. But if you're more like a visual person, books and articles might be better suited. However, if you're a kinesthetic type, doing some tech projects in the evenings is probably the most efficient method to gain new knowledge.

He also suggests to do short bursts of learning while waiting for our unit tests to complete, in between Pomodoros, between projects and while job hunting. He even wants us to do some learning during our commute, during workouts, or, when you are a (grand) parent like Don, while taking your (grand) kids with the stroller. And if you're into a certain topic, be sure to apply multi-pass learning by reading more than one article or book on the same subject. And finally, to make sure you don't run out of learning material, stockpile books, blogs, online courses and videos. And don't forget to accumulate posts from web magazines, newsletters, conferences and seminar videos. A great tool to collect all this is Pocket. Apparently Don and me have more in common than I thought.

Communication is a skill

One third of all projects fail because of poor communications and ineffective listening. At least, this what Andrea Goulet and Scott Ford told us. And to be clear, failure includes missed deadlines and overrunning budgets, and seems to be pretty traumatic. They also told us that the outcomes we experience, both individually and socially, come from conversations we had, didn't had, did well and didn't do so well. So being able to become more effective in communication is a pretty important skill to learn.

Scott and Andrea shared a couple of techniques and tips to help you with that. First of all, you should try to see each other's potential before commencing in a fruitful discussion. Just by thinking about that persons possibilities rather than focusing on their bad habits can change the way you approach a person. Next to that, it's useful to understand the speaking types. According to the model Andrea uses, people can be transactional where the interaction is based on asking questions and telling people about what needs to be done. People can also be positional where they advocate their opinions from a particular role or position. And finally, some people are transformational in which they share a vision and try to steer the discussion in a direction that aligns with that.

Emotions are part of an face to face interaction as well and can negatively influence your ability to communicate effectively, so it's imperative to transform an agitated response to a state of deep listening. If you do feel agitated, Andrei strongly suggested us to pause, feel your feet, watch your breath and to remember what you care about. To help us understand how your body, your emotions and what you're saying work together, she shared a model where each layer contributes to the next. Our body is the first layer and informs us about threats to our safety. It helps us to identify our friends or foes and lets us know how to fit in. It provides us with a sense of reality and helps us to make judgement calls. Our emotions form the 2nd layer and provide us with biological reactions to circumstances that we perceive as fear, anger, sadness or happiness. Our speaking represents the 3rd and last layer and allows us to express our opinions, say what we think what is and what is not factual in our world. It also allows us to ask people to do things for us and gives us the ability to commit ourselves to do things for others.

Pretty abstract ideas, right? Fortunately they had some simple and actionable tips as well. For instance, instead of stating your frustration as a fact, they advised us to use phrases like "I feel….when you…..because", or to respond to proposals you don't (entirely) agree with using an affirmative statement followed by "and" instead of "but". And apparently we also need to be humble, helpful and immediate (instead of sugar coating things). And when we do have criticism, keep praising people publicly and save that criticism for a private setting.

Opinions please

So what do you think? Does any of this resonate well with you? I know this is pretty soft stuff and might not appeal to you techies. But I do believe that being successful in the software development profession requires great communication skills. So let me know what you think by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you and me communicate in a technological world.

Monday, July 10, 2017

Key takeaways from QCon New York 2017: The Tech Stuff

This year, for the third time since I joined Aviva Solutions, I attended the New York edition of the famous QCon conference organized by InfoQ. As always, this was a very inspiring week with topics on large-scale distributed architecture, microservices, security, APIs, organizational culture and personal development. It also allowed me the mental breathing room to form a holistic view on the way I drive software development myself. So let me first share the key takeaways on tech stuff.

The state of affair on microservices

QCon is not QCon without a decent coverage of microservices, and this year was no different. In 2014, the conference was all about the introduction of microservices and the challenge around deployment and versioning. Since then, numerous tools and products emerged that should make this all a no-brainer. I don't believe in silver bullets though, especially if a vendor tries to convince people they've build visual tools that allow you to design and connect your microservices without coding (they really did). Fortunately the common agreement is that microservices should never be a first-class architecture, but are a way to break down the monolith. Randy Shoup's summarized this perfectly: "If you don't end up regretting early technology decisions, you probably overengineered"

Interestingly enough, a lot of the big players are moving away from frameworks and products that impose too much structure on microservice teams. Instead, I've noticed an increasing trend to use code generators to generate most of the client and service code based on some formal specification. Those handle the serialization and deserialization concerns, but also integrate reliability measures such as the circuit breaker pattern. And this is where the new kid in town joins the conversation: gRpc. Almost every company that talked about microservices seems to be switching to gRpc and Protobuf as their main communication framework. In particularly the efficiency of the wire format, its reliance on HTTP/2, the versioning flexibility and gRpc's Interface Definition Language (IDL) are its main arguments. But even with code generators and custom libraries, teams are completely free to adopt whatever they want. No company, not even Netflix, imposes any restrictions on its team. Cross-functional "service" teams, often aligned with business domains are given a lot of autonomy.

About removing developer friction

Quite a lot of the talks and open space sessions I attended talked about the development experience, and more specifically about removing friction. Although some of it should be common sense, they all tried to minimize the distance between a good idea and having it run in production.

  • Don't try to predict the future, but don't take a shortcut. Do it right (enough) the first time. But don't forget that right is not perfect. And don't build stuff that already exists as a decent and well-supported open-source project.
  • Don't bother developers with infrastructure work. Build dedicated tools that abstract the infrastructure in a way that helps the developers get their work done quickly. Especially Spotify seems to be moving away from the true DevOps culture. They noticed that there was too much overlap and it resulted in too many disparate solutions.
  • Bugs should not be tracked as a separate thing. Just fix them right away or decide to not fix them at all. Tracking all of these bugs is just going to create a huge list of bugs that no one will look at again….ever.
  • Closely tied to that is to keep distractions away from the team by assigning a Red Hot Engineer on rotation. This person handles all incoming requests, is the first responder when builds fail, and keeps anybody else from disturbing the team.
  • To track the happiness of the team, introduce and update a visual dashboard that shows the teams sentiment on various factors using traffic light. Adrian Trenaman showed a nice example of this. This should also allow you to track or prove whether any actions helped or not.
  • Don't run your code locally anymore. If you're unsure if something works, write a unit test and learn to trust your tests. Just don't forget how to make those tests maintainable and self-explanatory.


Drop your OTA environment. Just deploy!

Another interesting trend at QCon was the increased focus on reducing overhead by dropping a separate development, testing and acceptance environments while trying to bring something into production. Many companies have found that those staging environments don't really make their product better and have a lot of drawbacks. They are often perceived as a fragile and expensive bottleneck. And when something fails, it is difficult to understand failure. The problems companies find in those environments are not that critical at all and never as interesting as the ones that happen in production. In fact, they might even give the wrong incentive, the one where developers rely on some QA engineer to do the real testing work on the test environment, rather than invest in automated testing.

According to their talks, both the Gilt Group and Netflix seem to wholeheartedly support this mindset by working according to a couple of common principles. For starters, teams have end-to-end ownership of the quality and performance of the features they build. In other words, you build it, you run it. Teams have unfettered control to their own infrastructure. They assume continuous delivery in the design process. For instance, by heavily investing in automated testing, employing multi-tenant Canary Testing and making sure there's one way to do something. A nice example that Michael Bryzek of Gilt gave was a bot that would place a real order every few minutes and then cancel it automatically. Teams also act like little start-ups by providing services to other dev teams. This gives them the mentality to try to provide reliable services that are designed to allow delay instead of outage. They may even decide to ship MVPs of their services to quickly help out other teams to conquer a new business opportunity, and then mature their service in successive releases.

You should be afraid for hackers

The second day's keynote was hosted by the CTO of CloudStrike, a security firm often involved in investigating hacking attempts by nation states such as China. It was a pretty in-depth discussion on how they and similar government agencies map the behavior of hacking groups. I never really realized this, but it's amazing to see how persistent some of these groups are. I kind of assumed that hackers would find the path of least resistance, but the patience with which they inject malware, lure people into webpages or downloading .LNK files that will install the initial implant is truly scary. I particular awed at the idea how hackers manage to embed an entire web shell into a page which allows them to run arbitrary Windows commands on a host system with elevated administrator rights. My takeaway from this session was that if you're targeted by any of these groups, there's nothing you can do. Unless you have the money to hire a company like CloudStrike of course….

The Zen of Architecture

Juval Lowy, once awarded the prestigious title of Software Legend, has been in our business for a long time. Over the years, I've heard many rumors about his characters but it is safe to say….all of them are true. Nonetheless, his one

day workshop was one of the best, the most hilarious and intriguing workshops ever. After ridiculing the status quo of the software development and agile community, he enlightened us on the problems of functional decomposition. According to Juval, this results in a design that focusses on breaking down the functionality into smaller functional component that don’t take any of the non-functional requirements into account. He showed us many examples of real-world failures to reinforce that notion.

Instead, he wants us to decompose based on volatility. He wants us identify the areas of the future system that will potentially see the most amount change, and encapsulate those into components and services. The objective is keep thinking about what would change and encapsulate accordingly. That this is not always self-evident, and may take longer than management is expecting, is something that we as architects should be prepared for. However, as many architects still do, this mindset does allow you to stop fighting changes. Just encapsulate the change so that it doesn't touch the entire system. Even when I'm writing this, I'm still not sure how I architecture my systems. What Juval says makes sense, but also sounds very logical. Regardless, his workshop was a great reminder for us architects that we should keep sharing our thought processes, trade-offs, insights and use-class analysis to the developers we work with.

Juval also had some opinions about Agile (obviously). First of all, unlike many agilists, he believes the agile mindset and architecture are not a contradiction at all. He sees architecture as an activity that happens within the agile development process. But he does hold a strong opinion on how sprints are organized. Using some nice real-world stories, he explained and convinced us that sprints should not go back-to-back. Any great endeavor starts with some proper planning, so you need some room between the sprints to consider the status quo, make adjustments and work on the plan for the next sprint. That doesn't necessarily mean that everything is put on hold while the architect decomposes the next set of requirements based on volatility. It's perfectly fine for the developers to work on client apps, the user interface, utilities and infrastructure.

Opinions please

So what do you think? Does any of this resonate well with you? If not, what concerns do you have? Let me know by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you build your systems in an agile world.

Tuesday, June 27, 2017

The Ugly of Event Sourcing - Projection Schema Changes

Event Sourcing is a beautiful solution for high-performance or complex business systems, but you need to be aware that this also introduces challenges most people don't tell you about. Last year, I already blogged about the things I would do differently next time. But after attending another introductory presentation about Event Sourcing recently, I realized it is time to talk about some real experiences. So in this multi-part post, I will share the good, the bad and the ugliness to prepare you for the road ahead. After having dedicated the last posts on the pains of wrongly designed aggregates, it is time to talk about the ugliness of dealing with projection schema changes.


As I explained in the beginning of this series, projections in event sourcing are a very powerful concept that provide ample opportunities to optimize the performance of your system. However, as far as I'm concerned, they also offer you the most painful challenges. Projections are great if their structure or the way they interpret event streams don't change. But as soon as any of these change, you'll be faced with the problem of increasing rebuild times. The bigger your database becomes, the longer rebuilding will take. And considering the nature of databases, this problem tends to grow non-linearly. Over the years we've experimented and implemented various solutions to keep this process to a minimum.

Side-by-side projections

The first step we made was by exploiting the fact that the event store is an append-only database. By rebuilding a new set of projections next to the original ones, while the system is still being used, we could reduce the amount of down-time to a minimum. We simply tracked the checkpoint of the latest change to the event store when that rebuild process started and continued until all projections were rebuild up to that point. Only then did we need to bring down the system to project the remainder of the changes that were added to the event store in the mean time. By repeating the first stage a couple of times, this solution could reduce the down time to a couple of seconds. However, it did mean somebody needed to monitor the upgrade process in case something failed and it had to be restarted. So we still had to find a way to reduce that time even more.


The situation may be different in your domain, but in ours, a lot of the data had a pretty short lifecycle, typically between 7 and 30 days. And the only reason why people would still look for that old data, is to use it as a template for further work. To benefit from that, we started to track graphs of aggregates that are used together and introduced a job that would update that graph whenever an aggregate reached its functional end-of-life. Then, whenever the graph was 'closed', it would mark the corresponding event stream as archivable. This would eventually be used by another job to mark all events of the involved streams with an archivability date. With that, we essentially enriched the event store with metadata that individual projections could use to make smart decisions about the work that needed to be done. By allowing some of the more expensive projections to run asynchronously and keeping track of their own progress, we could exclude them from the normal migration process. This caused a tremendous reduction of the total migration time, especially by those projections that exploited the archivable state of the event store. And as a nice bonus, it allows you to rebuild individual projections in production in case some kind of high-priority temporary fix is needed that requires schema changes or a repair of a corrupted projection.


But this autonomy introduces a new challenge. The data projected by those projections would not become available up until a while after the system started. Worse, because the events are still being processed by the projection, it might be possible that queries would return data that is half-way projected and in the wrong state. Whether the first is a real problem is a functional discussion. Maybe adding the date of the last event projected or an ETA telling the end-user how long it will take to complete the projection work is sufficient. Being able to to do that does require some infrastructure in your projection code that allows you to get a decent ETA calculation. Showing data in the wrong state could cause some pretty serious problems to end-users. But even that can sometimes be handled in a more functional way. If that's not possible, you might be able to exploit the specific purpose and attributes of that projection to filter out half-projected data. For instance, maybe that projection is supposed to only show documents in the closed state. So as long as the projection data doesn't represent that state, you can exclude those from the results.

Not all projections are equal

With the introduction of autonomous projections that provide tracking information and ETA calculation, you can do one more thing to speed up the migration process; prioritization of projections. If you have many asynchronous projections (which you should), it is very likely that some of them are more crucial for the end-users than others. So why would you have them run all at the same time. Maybe it makes sense to hold off some of them until the critical ones have completed, or maybe the projection gets rebuild in-memory every time the system restarts. Another option you now have is that an individual projection is rebuild by processing the event store more than once, thereby focusing on the most recent or relevant data first. This does require the right metadata associated with the events, but most event stores have you covered on this. And if you have associated your events with a (natural) partition key, you could spin up multiple asynchronous projection processes in parallel, each focusing on a particular partition.

To OR/M or not to OR/M

Now, what about the actual technology that you use to write to your underlying projections database? Some have argued that using raw SQL is the fasted method for updating RDBMS-backed projections. Others would say that using an OR/M still has merits, in particular because it has a unit-of-work that allows you to process multiple related events before hitting the database. We've seen teams that use both, but we haven't identified the definitive ultimate solution.

One thing we're planning to see how we can exploit the OR/M solution to break the projection work into large chunks where the projection work happens in memory and is then flushed back to the database. Some first spikes showed a tremendous performance improvement that would be very difficult to do with raw SQL (unless you're building your own implementation of the Unit of Work pattern).

True Blue/Green

Even with all these improvements, rebuilding projections can still take a while to complete. However, if your system is HTTP based (e.g. a web application, a microservice or HTTP API), you can exploit load balancers and HTTP response codes in a pretty neat way to completely automate the migration process. Here's what this process can look like:

  1. Deploy the new application side-by-side with the original version. The website will return HTTP 503 (Service Unavailable) until it has been fully provisioned.
  2. Allow the load balancer to serve both the old and new sites from the same URL
  3. Stage 1 of the out-of-place migration process runs to copy over all events up to the checkpoint that the source database was when the stage started.
  4. Repeat stage 1 two times more to copy over the remainder of the data.
  5. Stage 2 is started to complete the migration, but not before the source application returns HTTP 503 as well. This is the real downtime.
  6. Stage 2 completes, after which the new application becomes responsive again and everybody is happy again.
  7. If stage 2 would fail, it would simply reset the source application's state so that it would no longer return HTTP 503.

Notice how during the migration there's no manual intervention needed to switch DNS entries or fiddle with the load balancer? That's what I would call true blue-green deployments. Even if you use immutable infrastructure where the new application is deployed as a pre-baked cloud machine this will work.

What about you?

So what do you think? Do these solutions make sense to you? Do you even recognize these problems? And if so, what other solutions did you employ to resolve the long rebuilding times? I'd love to know what you think about this by commenting below. Oh, and follow me at @ddoomen to get regular updates on my everlasting quest for knowledge that significantly improves the way you build your Event Sourced systems in an agile world.