Good tests give you confidence to change anything

by toni 26. November 2012 09:04

Introduction

The system has been under development for two years and the purpose of the project is to replace server side of an old legacy system. Clients and server communicate using messages. Following figure shows the message flow.

legacy_message_flow

The normal communication flow is as follows

  1. Start session between client and server.
  2. Initialization with some additional data.
  3. Exchange any number of different messages.
  4. Client sends Stop which ends the exchange of messages.

How this works is that the first two messages: Start Session and Initialize affect to every other message that server receives. Any change to those messages will affect other messages. Server is basically keeping some state information and doing some additional things based on those two messages that are received first. Since this server is supposed to serve old legacy clients we cannot change that logic.

After two years of development there was a change request that would change the functionality of Start Session and Initialize messages. Since every other message depends on those we can list few challenges:

  • How do we analyze how this change will affect other messages
  • How long does it take to make the changes
  • How long does it take to test that everything works

Test Suite

Imagine system without any tests. It would be impossible to make this change without huge effort. Remember this functionality was implemented two years ago. What if it was five years ago? What if the developers are no longer working for the company?

Fortunately the situation wasn’t that since our test suite contains lot of integration tests. Yes, not unit tests but integration tests. In the scope of this post term integration tests means test that starts from the input (message) received by the server and ends with the database.

legacy_integration_tests

If you had only unit tests you could only tell whether single message was processed successfully but you would have no idea how the system would work. How that change would affect the processing of following messages?

From the start we built integration tests based on what happens in production. Following figure shows simple scenario from production.

legacy_message_flow_addpart

In the figure above the scenario is that before AddPart can be sent message Start Session, Initialize and AddProduct must have been sent. This means in our integration test for AddPart we had the following code that is executed before each test.

public void Initialize()
{
    this.Given_Session_Start_has_been_sent();
    this.Given_Initialize_has_been_sent();
    this.Given_AddProduct_has_been_sent();
}

This code will actually send the needed messages the same way they would be sent in production environment. This gives us following benefits:

  • It is easy to put the system into different states by simply sending messages in different order.
  • The integration test works just like the production i.e. no need to put fake data into database to put the system into specific state.
  • Smaller change of having bugs in the actual tests/test data.
  • Easy to test situations where messages are received in the wrong order.

With this kind of integration test setup making changes to code is lot easier. Even if you don’t know the code base that well you can be confident that you are not breaking anything when making changes.

Conclusion

Unit tests are great. If you have algorithm that checks that format of bank account is correct you should write unit tests. But unit tests only tell you that small piece of code works. They don’t tell you whether your system still works, whether all the interactions between different parts work. That’s why you need integration tests.

There is one book that I always recommend: Growing Object-Oriented Software, Guided by Tests. Yes it is few years old and code examples are for Java but it is a great book. It shows how to build systems by starting with the integration tests so that you know your end-to-end scenarios will work. And most importantly that they still work after you make changes.

Code Analysis helps you find issues with disposable resources

by toni 20. November 2012 07:14

A while ago I was working with old code base and I decided to run Code Analysis against it. Since that had been used during development I didn’t expect to find anything major. I just wanted to see whether Code Analysis had improved during this time and whether it could find something.

I was pretty surprised when it reported CA2000 warning.

CA2000 Dispose objects before losing scope In method 'Connection.CreateConnection()', call System.IDisposable.Dispose on object 'resource' before all references to it are out of scope.

The code in question looked like this (line causing the warning highlighted).

public void CreateConnection()
{
    var resource = new DisposableResource();
    this.legacyConnection = new LegacyConnectionWrapper(resource);
    this.legacyConnection.Connect();
}

First I didn’t understand why Code Analysis thought there was a problem since

  • Connection, LegacyConnectionWrapper and DisposableResource all implement IDisposable correctly.
  • LegacyConnectionWrapper owns DisposableResource and will dispose it.
  • Connection owns LegacyConnectionWrapper and will dispose it.

Then I decided to see what MDSN would say about this warning

If a disposable object is not explicitly disposed before all references to it are out of scope, the object will be disposed at some indeterminate time when the garbage collector runs the finalizer of the object. Because an exceptional event might occur that will prevent the finalizer of the object from running, the object should be explicitly disposed instead.

Now it all made sense. If there would be an exception on line 4 or 5 the DisposableResource would not be disposed right away. In some cases that can cause really hard to find bugs. Assume you try to call CreateConnection() with aggressive retry policy. The call keeps failing and you are never releasing any resources. At some point the whole reason for failure might be the fact that you have run out of some unmanaged resources because garbage collector hasn’t freed those resources yet.

It is also worth mentioning that I only got this warning when I used “All Rules” rule set. Using just the minimum or recommended rules did not produce this warning.

Beware of JSON over HTTP architecture Part 7

by toni 12. November 2012 05:11

In the previous part we came up with a solution to our problem: Queues.

queue_solution

Steps are

  1. Connect to the bank and to get the payment information.
  2. Notify others with “Payment Received” event.
  3. Get the event, match the payment to customer/account and increase the limit accordingly.
  4. Notify others with “Balance Increased” event.
  5. Get the event and mark the invoice as paid.

The new solution looks lot better but how about the problems we had, did it fix any of those?

Solving Original Problems

Dependencies

In our original SOA based architecture we found out that sooner or later every system knows every other systems because we need to call them directly. Now that we have started to use queues that problem doesn’t exist anymore. Once system has completed it’s part it will notify others. It doesn’t have to care what happens next.

Hidden Dependencies

We still have hidden dependencies but they are bit different now. Our design for the user story looked like this:

soa_get_payments2

In the third part we came into following conclusion about hidden dependency:

“…in the scope of the user story the Bank Services actually depends on Invoicing because if the invoicing service is down Bank Services will receive an error and the processing of payment fails.”

Now that we are using queues we don’t have that problem anymore. Of course the whole user story cannot complete if Invoicing is down but we have the following benefits thanks to queues.

  • Other systems can complete their part even if Invoicing is down.
  • Other systems are not affected if Invoicing takes lot longer to do it’s job since they are no longer waiting for it.

Availability

Since there are no direct calls between our systems we don’t have to have 100% availability. If Invoicing is down balances are still updated and the “Balance Increased” events are stored in the queue. As far as “Customer & Account” or “Bank Services” knows everything is working just fine. They are not affected by the downtime of Invoicing.

Error Handling

Remember in the previous solution we needed to manually implement error handling? E.g. retry the request three times.  Now the queue is actually part of our infrastructure. We can do e.g. mirroring so in case of a hardware failure our system just keeps on running. If we are running the system in the cloud (Amazon, Azure) we might still have to do some manual error handling (“retry request n times”) but cloud providers have pretty good infrastructure so you can be quite sure that queue just works.

Now that different systems are clearly separated it is lot easier to handle possible errors. We only need to care about single system.

bugs_in_single_system

If there is bug in Invoicing it is limited to that system. We don’t make any REST calls so there is no need to handle possible errors that happen in other systems. Easier to debug, fix and maintain.

Long-Term Errors

As long as you have free disk space the queue will store the messages until we are ready to consume them. In case of long-term error there might be tens of thousands of messages in the queue. Once the problem has been solved we can temporarily increase the number of consumers in order to handle the load as fast as possible.

multiple_invoicing_processes

Solving Problems in Production

It is lot easier to solve problems when one system does one thing and you don’t have to think about scenarios like “When payment is received we call system B which then calls system C and if the original input was this then this fails because…”.

When Invoicing does not work we can be pretty sure the problem is actually in that service. Also in case of an error it is reported inside single system. Typically with SOA based systems each system reports the failure in their own log since they log the responses they get from other systems.

Take it Down

Since we are using queues it is lot easier to “mess with” the systems that are in production. We can do e.g. the following to a single system

  • Take it down for hours/days
  • Make changes to it (hot fixes, configuration changes)
  • Connect to it remotely, attach debuggers etc. without affecting other systems

Challenges

It seems we have fixed most of the problems of the original architecture but our solution is not all about sunshine and unicorns. There are some challenges.

Poison messages

When messages are stored in queue it is very likely that you will encounter poison messages:

A poison message is a message in a queue that has exceeded the maximum number of delivery attempts to the receiving application. This situation can arise, for example, when an application reads a message from a queue as part of a transaction, but cannot process the message immediately because of errors.

If the problem with the message is not corrected, the receiving application can get stuck in an infinite loop, starting transactions to receive the message and then aborting them.

To handle those messages make sure sure you have the means to do following

  • Move poison messages into another queue (retry queue, dead letter queue)
  • Analyze poison messages to see what is wrong (corrupted messages, missing data etc.)
  • Fix the messages and resend them again

Different types of messages in single queue

Once you start using queues it is common to use single queue to handle different types of messages. There isn’t anything wrong with that but you have to remember at least one possible downside. Assume the consumer of the messages has been unavailable and you have tens of  thousand messages in the queue. When the consumer comes back up he starts to process those messages. Now if you are using the same queue to handle messages with different priorities following happens.

important_messages_blocked

As you can see your important messages are stuck there in the middle of thousands of less important messages. Some queue solutions support Hybrid FIFO/Priority queue which might be handy in situations like these or you might just have different queues for different messages. Also the AMQP (Advanced Message Queuing Protocol) support priorities. Just make sure if your queue solution supports AMQP it also implements support for priorities.

Inconsistent State

Since you don’t have direct REST calls between the systems you must plan for “inconsistent state”. In our example it can mean e.g. following.

  • Even if the payment has been processed by “Bank Services” the balance of the account has not been updated yet.
  • Even if the account balance has been updated the invoices has not been processed yet.

In many cases this means that you need to store some kind of timestamp for different operations because at least the help desk must be able to tell the customer what has happened and when. In case part of the system is down they can tell the customer “We have received your payment but your balance was last updated last Friday so the payment has not been processed yet.”.

Overusing Queues

I think this is perhaps the biggest problem. Now that we have our hammer every problem looks like a nail. Understand why and when queue is a good solution and do not try to force every user story and system to use it. There are lot of common examples of situations where using queue (or service bus) is not a good idea but they really deserve blog post of their own.

Final Words

This is the last part of the series. I think we managed to take a pretty good look into the challenges of the typical “JSON over HTTP” / SOA architecture and we found a solution that fixes most of the problems.

Beware of JSON over HTTP architecture Part 6

by toni 8. November 2012 20:09

In the previous part we tried to solve our problems using “Service Router”. That didn’t end well so it is time to try something else.

High Availability

In the world of high availability terms like “five nines” are used to describe how much downtime the system can have e.g. per year.

Availability % Downtime per year
99% (“two nines”) 3.65 days
99,9% (“three nines”) 8.76 hours
99,99% (“four nines”) 52.56 minutes
99,999% (“five nines”) 5.26 minutes

To put this in simple terms if someone says “Our system has 99% availability” it means it can be down about 3.5 days/year. You might wonder “What does this have anything to do with our system?”. Actually a lot. Let’s look again our original user story.

soa_get_payments2

Steps are

  1. Connect to the bank and to get the payment information.
  2. Match the payment to customer/account and increase the limit accordingly.
  3. Mark the invoice as paid.

If you think about availability requirements for this user story we could describe them as follows:

All the systems that take part into this user story must be up and running. If any of the systems is down or fails to complete it’s part we cannot process any incoming payments, balances are not updated and invoices are not processed.

If you look at the SLA (Service Level Agreement) of Amazon S3, Amazon EC2 or Windows Azure you can see that they talk about uptime of ~99,9%. Even if you are using something like HP NonStop server it seems like a bad idea to design system where we have such high availability requirements for a simple user story.

Baby Steps

From business perspective we want to have our user story but obviously there is no way to have 100% uptime. So what to do? As it happens the answer is right in front of us. I have described the steps of the user story so many times:

  1. Connect to the bank and to get the payment information.
  2. Match the payment to customer/account and increase the limit accordingly.
  3. Mark the invoice as paid.

Even though these steps are listed as a sequence of operations there really isn’t any requirement to execute them immediately one by one. So instead of doing this:

soa_sequence_diagram

We would do this:

soa_event_driven

This means once “Bank Services” get the payments it would somehow notify the “Customer & Account” instead of calling it directly. Once “Customer & Account” has updated the balance it would notify Invoicing which would execute the last part of the user story.

From Direct Calls To Events

Before looking how to implement the event (notification) between different systems we should list some requirements for the solution:

  • FIFO – First In First Out. If we send notifications A,B,C in most cases we want them to be received in the same order. Even if it doesn’t matter in which order the notifications are received debugging is lot easier when you can be sure of the order.
  • Reliability: We don’t want to lose notifications or get duplicates. If receiver is not up notifications should be stored.
  • Guarantee that (if wanted) each notification is only received and handled once. E.g. if we are going to update the balance of account based on notification we want to be sure it is only done once.
  • Ability to filter out corrupted notifications (poisonous messages) into “dead letter” queue so that they don’t halt the processing of following notifications.

There are many other requirements (e.g. routing, high availability, management ad tracing) we could list but they are not so interesting in the scope of this post and our system.

One solution that seems to solve all our problems is some kind of queue. By queue I mean FIFO queue where publisher can write the event and consumer can read it.

queue_publish_consume

There are many products that have all the features we need: RabbitMQ, NServiceBus, MSMQ, Windows Azure Service Bus, Windows Azure Queue, Amazon SQS etc. I’m not going to talk about specific product. For the sake of this post it is enough we have piece of infrastructure that can be used to deliver messages between two systems. Let’s replace the direct calls with queues and see how our original user story looks.

queue_solution

Steps are

  1. Connect to the bank and to get the payment information. This hasn’t changed at all since the bank only offers REST interface.
  2. Notify others with “Payment Received” event.
  3. Get the event, match the payment to customer/account and increase the limit accordingly.
  4. Notify others with “Balance Increased” event.
  5. Get the event and mark the invoice as paid.

Now this looks lot better. There are no direct calls between our systems and queues take care of storing the messages (events). Notice how I wrote “Notify others” since the publisher of the event doesn’t really care what happens next. As far as it is concerned it has done it’s job.

Instead of accepting this as the solution we must dive into our original problems and see did it actually fix them. That is something we are going to do in the next part.

Beware of JSON over HTTP architecture Part 5

by toni 8. November 2012 07:29

In previous posts I have described some of the challenges that you might encounter when working with solution built using “JSON over HTTP” Service Oriented Architecture pattern. In this part we are going to look at one common solution which tries fix the problems.

Fixing Dependencies

Let’s look at the original figure that shows our imaginary system and all the dependencies. Remember each colored line shows different user story.

soa_lot_of_connections

When looking at this figure it is easy to say “We have way too many dependencies”. If you don’t analyze or think “Why too many dependencies is a problem” then common solution is just to remove all of them. This can be done by introducing another system into the architecture. This system is often called Proxy or a (Service) Router. You could compare this to Mediator or Facade pattern in software development or perhaps to Reverse Proxy (web architecture).

service_router_introduced

As you can see our new Service Router has just removed all dependencies between the systems. Since it is hard to look at imaginary figure of the architecture let’s see what our original user story would look like when using Service Router.

First we have the original user story from part two.

soa_get_payments2

And now same thing using “Service Router”. Note the “double” numbering of the steps. There is now two calls per steps because everything goes through the “Service Router”.

user_story_with_service_router

As you can see dependencies are (at least on paper) gone but what does that actually mean? Let’s look at the problems listed in previous parts and see which of those we have fixed.

Every System Knows Every Other System

We could argue that this has been fixed. After all there is no longer direct connection between e.g. “Bank Services” and “Customer & Account” but the reality is this that all that has changes is the address where we send the request.

Before HTTP POST account.com/account/123/deposit
After HTTP POST servicerouter.com/account/123/deposit

If we would trace all our user stories by drawing lines from system to system you would see that all that has changed is the fact that now every request goes through the “Service Router”.

Dependencies

Again even though in paper we removed all the dependencies they still exist. Remember in part three we wrote down “definition of a dependency”:

  • There is a dependency if system cannot complete the user story without calling another system.
  • There is a dependency if system cannot complete the request without calling another system.

The “Bank Services” still needs to call “Customer & Account”. The fact that it does it by using “Service Router” doesn’t change anything.

Hidden Dependencies

In part three we also wrote down the definition for hidden dependency:

There is a hidden dependency between systems if taking system A offline affects system B even though there is no direct connection between them.

Same thing here. Remember the hidden dependency between “Bank Services” and “Invoicing”. It is still there. If you take down Invoicing you will see error in “Bank Services” no matter how many “Service Routers” you put between them.

Big Ball of Mud

As you can see the figure looks so much nicer without all the lines. Each system only touches the new “Service Router” because that is the only visible dependency.

soa_service_router_dependencies

I think this is even worse now. We have introduced yet another system into our architecture and now every single system depends on that. In addition to that it looks like there is no dependencies between the “blue parts” even though in reality the dependencies / hidden dependencies are still in place.

Error Handling

Instead of having system specific error handling in every system we could put it into “Service Router”. It would take care of the retry logic etc. Now there are only few issues with that:

  • We still need system specific error handling because the “Service Router” might be down and we need to resend the request.
  • The actual error handling logic is just code. We can put it into shared library and every system can use it. There is no need to copy paste code. If you are using something like HttpClient you can implement the whole logic pretty nicely.

Long-Term Errors

The “Service Router” could store the requests (local storage) and we could use some other tool/scheduler to resend them. This would work just fine if you only need to store the actual HTTP request and nothing else.

service_router_save_for_retry

With this kind of solution there is no need have system specific storage for the failed requests. This kind of solution is not without challenges.

Return Codes

When “Customer & Account” service is down what should the “Service Router” return to Streaming? Since the (long-term) retry logic is now baked into “Service Router” this is not an error case. One way is to “lie” and just return the same HTTP 200 or we could use HTTP 202:

The request has been accepted for processing, but the processing has not been completed.  The request might or might not eventually be acted upon, as it might be disallowed when processing actually takes place.

In case you need/want to change the actual return codes beware of the leaky abstraction that we talk about next.

Leaky Abstraction

The leaky abstraction is defined as follows:

A leaky abstraction is any implemented abstraction, intended to reduce (or hide) complexity, where the underlying details are not completely hidden. The term is most frequently used to call attention to a flaw in a software or hardware abstraction.

I would say there is a high probability that the details of “Service Router” will quickly leak into other system. One good example is the retry logic and HTTP codes. At some point there will be a piece of code in the Streaming service that looks like this:

var response = httpClient.Send(request);
if (HttpStatusCode.Accepted == response.StatusCode)
{
    // HACK!
    // The Service Router accepted the request but
    // it was not executed. The "Customer & Account"
    // is probably down so we need to...
    SomeMethodWeCall(Status.CustomerAndAccountIsDown);
}

God Object

The God object is defined as follows:

In object-oriented programming, a god object is an object that knows too much or does too much. The god object is an example of an anti-pattern.

Even though we are not talking about OOP the same applies to “Service Router”. Since it already knows each and every system there is a high probability that some day the logic inside it will contain exceptions based on systems calling it or systems it is calling.

Solving Problems in Production

At first it might sound like this would be lot simpler with “Service Router” since we have single place which contains:

  • Logs for all the requests
  • Logs for all the responses
  • Logs (and implementation) of retry logic

In many cases those will certainly help you understand why something failed but they are missing one key piece of information: The actual business logic that happens inside each service. You get the raw HTTP requests and responses but there is no context.

Let’s look at the original user story implemented with “Service Router”.

user_story_with_service_router

Steps are

  1. Connect to the bank and to get the payment information.
  2. Match the payment to customer/account and increase the limit accordingly.
  3. Mark the invoice as paid.

Let’s look at the log files of raw HTTP requests

HTTP GET bank.com/payments/ Successful with body:
{
    "ReferenceNumber": 940403940,
    "Amount": 100.00
}
Received HTTP 200 from bank.com/payments/

HTTP POST from bankservices.com to account.com/payments
{
    "ReferenceNumber": 940403940,
    "Amount": 100.00
}

HTTP PUT from account.com to invoicing.com/account/39302/payments
{
    "ReferenceNumber": 940403940,
    "Amount": 100.00
}

Received HTTP 200 from invoicing.com/account/39302/payments
Received HTTP 200 from account.com/payments

As you can see we are logging every request/response in the “Service Router”. Now let’s compare this logging to traditional service specific log files.

Payment ($100.00) with reference number 940403940 received from bank.
Sending deposit request to Account service (Ref #940403940, $100.00).

Received payment (Ref #940403940, Amount $100.00)
Account #39302 uses reference number 940403940.
Account #39302 current balance is $47.50. Adding $100.00 to it.
Notifying Invoicing of account update (Account #39302).

Account #39302 has received payment. Looking for invoices.
Found invoice #203093000 and marking it as paid.

It is quite obvious that solving problems in production is pretty hard with just the logs “Service Router” provides. Raw HTTP logs can help you trace other things like performance problems or configuration issues but they don’t help you track down business logic related problems.

Take it Down

How easy it is to mess with single system now that we have “Service Router”. As you can remember from previous part “mess with” means

  • Take the system down since other systems don’t depend on it
  • Connect to it remotely, attach debugger etc. without affecting much the other parts of the system or the system as a whole
  • Bring the system back up and expect things just work after that

The “Service Router” doesn’t really help us. Sure we can use it to return something like HTTP 503 Service Unavailable but then we would need to implement additional logic into the calling service to gracefully handle it. That is what the “Service Router” should have done for us.

A fix that doesn’t really fix anything

When I started to write this part my initial thought was “This Service Router is going to be a disaster”. I was really surprised to find out that in some cases (Handling long-term errors) it can provide functionality that might actually work and be useful. Sure it is just one case and not without it’s own challenges but still it was nice to find out that the whole solution is not as bad as I thought it would be.

The fact is that “Service Router” doesn’t fix (and sometimes makes them even worse) most of the initial challenges we had with Service Oriented Architecture means we there will be at least another part for the series.