Thursday, February 11, 2016

A least recently used cache that you can use without worrying

At the beginning of this year, I reported on my endeavors to use RavenDB as a projection store for an event sourced system. One of the things I tried to speed up RavenDB's projection speed was to use the Least Recently Used cache developed by Brian Agnes a couple of years ago. This cache gave us a nice speed increase, but the original author appears to be unreachable and seemingly abandoned the project. So I started looking for a way to distribute the code in a way that makes it painless to consume it in other projects. With this in mind, I decided to initiate a new open-source project Fluid Caching. As of August 10th, version 1.0.0 is officially available as a source-only package on NuGet.

The cache in itself did meet our requirements quite well. It supported putting a limit on its capacity. It allows you to specify the minimum amount of time objects must be kept in the cache (even if that would exceed the capacity), as well as a maximum amount of time. It's completely safe to use in multi-threaded scenarios and is using an algorithm that keeps the performance under control, regardless of the number of items in the cache. It also supports multiple indexes based on different keys on top of the same cache, and is pretty flexible in how you get keys from objects. All credits for the initial implementation go to Brian, and I welcome you to read his original article from 2009 on some of the design choices.

What does it look like

Considering a User class, in its simplest form, you can use the cache like this:

var cache = new FluidCache<User>(1000, TimeSpan.FromMinutes(1), TimeSpan.FromMinutes(10)), () => DateTime.Now);

This will create a cache with a maximum capacity of a 1000 items, but considering the minimum age of 1 minute, the actual capacity may exceed that for a short time. After an object hasn't been requested for as long as 10 minutes, it will be eligible for garbage collection. To retrieve objects from the cache, you need to create an index:

IIndex<User, string> indexById = cache.AddIndex("byId", user => user.Id);

The lambda you pass in will be used to extract the key from the User object. Notice that you can have multiple indexes at the same time without causing duplication of the User instances. To retrieve an object from the cache (or create it on the spot), use the cache like this:

User user = await indexById.GetItem("dennisd", id => Task.FromResult(new User { Id = id }));

Obviously, that factory will be invoked only once, and only if the object isn't in the cache already. The current API is async only, but as you can see, in those cases where usage of async/await is not really needed, it results in some ugly code. I'm considering to provide both a synchronous as well as an asynchronous API. Also, instead of passing the factory method in the call to GetItem, you can also pass a more global factory into the AddIndex method.

Why am I doing this

Good question. It's 2016, so some of the custom thread synchronization primitives are part of the .NET framework these days. Next to that, we all write asynchronous code and thus have a need for support for async/await. These days, being able to compile the code against any modern .NET version, even a Portable Class Library or .NET Core, isn't a luxury either. Other features I'm adding include thread-safe factory methods and some telemetry for tuning the cache to your needs. In terms of code distribution, my preferred method for a library like this would be a source-only NuGet package so that you don't have to bother the consumers of your packages with another dependency. Also, the code quality itself needs some Object Calisthenics love as well as some sprinkles of my Coding Guidelines. And finally, you can't ship a library without at least a decent amount properly scoped unit tests and a fully automated build pipeline.

So how does it work

As I mentioned before, I highly recommend the original article if you want to understand some of the design decisions, but let me share some of the inner workings right now. The FluidCache class is the centerpiece of the solution. It's more of a factory for other objects than a cache per see. Instead, all items in the cache are owned by the LifeSpanManager. Its responsibility is to track when item has been 'touched' and use that to calculate when it is eligible for garbage collection while accounting for the provided minimum age.

Each item in the cache is represented by a Node, which on itself is part of a linked list of such nodes. The LifeSpanManager maintains an internal collection of 256 bags of which only one is open at the same time. Each bag has a limited lifetime and contains a pointer to the first node in the linked list. Whenever a new item is added to the cache, it is inserted into the head of the linked list. Similarly, if an existing item is requested, it is moved to the current bag as well. But whenever an item is added, requested or removed and a certain internal interval is passed, a (synchronous) clean-up operation is executed. This clean-up will iterate over all the bags, starting with the oldest one, and try to remove all nodes from that bag, provided that the bag's end-date matches the configured minimum and maximum age. When the clean-up has completed, the current bag is closed (it's end date and time is set) and the next one is marked as 'open'.

Through the FluidCache class, you can create multiple indexes and provide an optional factory method. However, indexes are nothing more than simple dictionaries that connect the key of the index with a weak reference to the right node. They will never prevent the garbage collector for cleaning your item. Only after the LifeSpanManager removes the reference from its internal collection of aging bags, the GC can do its job.

I'm not even close to an algorithmic magician, but I believe the project has a solid foundation. So check out the code base and let me know what you think. And follow me at @ddoomen to get regular updates on my everlasting quest for better solutions.

Thursday, February 04, 2016

How Git can help you prevent building a monolith

During last weeks' Git Like a Pro talk I tried to convey the message that switching to Git is much more than introducing a new source control system. It will affect not just the way you commit source code, branch or merge, it changes the entire development workflow. In fact, I'm willing to claim that switching from any centralized source control system to Git and a decent hosted Git service such as GitHub, BitBucket or GitLab can help you to prevent building a monolith. Or maybe I should revert that claim by saying that not using hosted Git will make it a whole lot more difficult to prevent building monoliths. Why? Let me elaborate on this a bit.

Preventing a monolith is difficult and painful, especially if the team is under a lot of pressure to release early and often. Technology like NuGet can help you to distribute components in a controlled way. And if that component needs to expose HTTP APIs, you can use OWIN to host that NuGet component in virtually every kind of host. On an organizational level, you might be aware of Conway's Law and decide to split your teams in a way that aligns with the envisioned components of your architecture. Physical boundaries will tend to cause your teams to introduce procedures and API contracts that formalize the interaction between those teams. However, that's, in my opinion, not enough.

If I look back at my own projects, a recurring reason for not building some feature as a component is that would cause unexpected planning dependencies between the teams. So unless it was quite obvious that something could be built as a functional slice, we usually just build the feature into the existing codebase. So why this planning dependency? That's simple; because we were using Microsoft Team Foundation Server (TFS). TFS offers a nice and integrated collaboration environment for teams, but it used to offer a centralized source control system only. If you assign a team to own a particular component, usually maintained in a dedicated source code repository, only that team can make changes the component. Sure, you can grant access to other teams as well, but then you have no easy mechanism to control code changes that end up in the core code base.

Considering planning dependencies, the worst thing that can happen to a team is that they have to wait for another team to finish a change request or bug report before they can continue. If your team practices an agile methodology like Scrum, it becomes very hard to plan for these things. Usually, this dependency doesn't popup until you start to work on some part of the code that relies on such a component. Unless the owning team has nothing planned that day, that user story that you carefully planned in your sprint planning meeting will get blocked for a couple of days. If your sprints only run for two weeks, being blocked in such a way is a significant impediment for high velocity teams.

 

And even if that other team has plenty of time to spare, in most TFS projects, the release strategy for a component is everything but optimized. In many TFS projects I've seen, the most advanced thing they do is to add the change set number of the last source control change to the component version. Anything more advanced than that is pretty painful with TFS's build system. So what is it that Git and hosted Git services offer that TFS doesn't? I’ve already blogged about this last year, but those that are essential are… pull requests and forking.

Forking a repository on GitHub means that you're going to create a clone of the original source code repository that is stored under your own account. The crucial feature here is that GitHub will create a link between your clone and the original repository. Your clone gets all the branches and tags from that original and you're free to create new branches, experiment with the original code base and do other advanced stuff like rebasing. All without affecting the original repo.

You could add some new extension point to the original code or fix that bug that you found while your team was working on some code changes. If the owners of the original repo added build scripts such as PSake as part of the source, you can even build a NuGet package from it and use that for the time being. Now, if you're confident that the bug was fixed, or that new extension point serves its purpose well, it's time for that pull request. With such a PR, you're asking the owners of the original owners to pull your changes into the official code base. The PR serves as a central point for discussions, and code reviews, and may even include status changes and unit test results from their build system. In essence, forking and PRs facilitate the autonomy needed for successful agile teams. You're never blocked by the other team simply because you can alter the code to your need without going in separate ways. That on itself, is a crucial ingredient for component-driven development in emergent architectures. And I haven't even started about advanced Git-only release strategies such as GitFlow and GithubFlow or automatic versioning through GitVersion.

Now, some of the TFS proponents around me might say that TFS has supported Git as a source control system for a couple of years now. Hey, they may even tell you that TFS offers pull requests these days. But, it does not support creating cross-repository pull requests yet. Don't get me wrong though. This blog is not a rant against TFS. But over the last two years I've come to realize that we have been held back significantly by using TFS. And although I fully realize that Microsoft has pushed a lot of new features to TFS over the last year or so, we still miss that crucial feature.

If it's up to me, TFS is not a viable option for large-scale agile development anytime soon. So what do you think? Let me know by commenting below. And follow me at @ddoomen to get regular updates on my everlasting quest for better solutions.