In the previous part we came up with a solution to our problem: Queues.
- Connect to the bank and to get the payment information.
- Notify others with “Payment Received” event.
- Get the event, match the payment to customer/account and increase the limit accordingly.
- Notify others with “Balance Increased” event.
- Get the event and mark the invoice as paid.
The new solution looks lot better but how about the problems we had, did it fix any of those?
Solving Original Problems
In our original SOA based architecture we found out that sooner or later every system knows every other systems because we need to call them directly. Now that we have started to use queues that problem doesn’t exist anymore. Once system has completed it’s part it will notify others. It doesn’t have to care what happens next.
We still have hidden dependencies but they are bit different now. Our design for the user story looked like this:
In the third part we came into following conclusion about hidden dependency:
“…in the scope of the user story the Bank Services actually depends on Invoicing because if the invoicing service is down Bank Services will receive an error and the processing of payment fails.”
Now that we are using queues we don’t have that problem anymore. Of course the whole user story cannot complete if Invoicing is down but we have the following benefits thanks to queues.
- Other systems can complete their part even if Invoicing is down.
- Other systems are not affected if Invoicing takes lot longer to do it’s job since they are no longer waiting for it.
Since there are no direct calls between our systems we don’t have to have 100% availability. If Invoicing is down balances are still updated and the “Balance Increased” events are stored in the queue. As far as “Customer & Account” or “Bank Services” knows everything is working just fine. They are not affected by the downtime of Invoicing.
Remember in the previous solution we needed to manually implement error handling? E.g. retry the request three times. Now the queue is actually part of our infrastructure. We can do e.g. mirroring so in case of a hardware failure our system just keeps on running. If we are running the system in the cloud (Amazon, Azure) we might still have to do some manual error handling (“retry request n times”) but cloud providers have pretty good infrastructure so you can be quite sure that queue just works.
Now that different systems are clearly separated it is lot easier to handle possible errors. We only need to care about single system.
If there is bug in Invoicing it is limited to that system. We don’t make any REST calls so there is no need to handle possible errors that happen in other systems. Easier to debug, fix and maintain.
As long as you have free disk space the queue will store the messages until we are ready to consume them. In case of long-term error there might be tens of thousands of messages in the queue. Once the problem has been solved we can temporarily increase the number of consumers in order to handle the load as fast as possible.
Solving Problems in Production
It is lot easier to solve problems when one system does one thing and you don’t have to think about scenarios like “When payment is received we call system B which then calls system C and if the original input was this then this fails because…”.
When Invoicing does not work we can be pretty sure the problem is actually in that service. Also in case of an error it is reported inside single system. Typically with SOA based systems each system reports the failure in their own log since they log the responses they get from other systems.
Take it Down
Since we are using queues it is lot easier to “mess with” the systems that are in production. We can do e.g. the following to a single system
- Take it down for hours/days
- Make changes to it (hot fixes, configuration changes)
- Connect to it remotely, attach debuggers etc. without affecting other systems
It seems we have fixed most of the problems of the original architecture but our solution is not all about sunshine and unicorns. There are some challenges.
When messages are stored in queue it is very likely that you will encounter poison messages:
A poison message is a message in a queue that has exceeded the maximum number of delivery attempts to the receiving application. This situation can arise, for example, when an application reads a message from a queue as part of a transaction, but cannot process the message immediately because of errors.
If the problem with the message is not corrected, the receiving application can get stuck in an infinite loop, starting transactions to receive the message and then aborting them.
To handle those messages make sure sure you have the means to do following
- Move poison messages into another queue (retry queue, dead letter queue)
- Analyze poison messages to see what is wrong (corrupted messages, missing data etc.)
- Fix the messages and resend them again
Different types of messages in single queue
Once you start using queues it is common to use single queue to handle different types of messages. There isn’t anything wrong with that but you have to remember at least one possible downside. Assume the consumer of the messages has been unavailable and you have tens of thousand messages in the queue. When the consumer comes back up he starts to process those messages. Now if you are using the same queue to handle messages with different priorities following happens.
As you can see your important messages are stuck there in the middle of thousands of less important messages. Some queue solutions support Hybrid FIFO/Priority queue which might be handy in situations like these or you might just have different queues for different messages. Also the AMQP (Advanced Message Queuing Protocol) support priorities. Just make sure if your queue solution supports AMQP it also implements support for priorities.
Since you don’t have direct REST calls between the systems you must plan for “inconsistent state”. In our example it can mean e.g. following.
- Even if the payment has been processed by “Bank Services” the balance of the account has not been updated yet.
- Even if the account balance has been updated the invoices has not been processed yet.
In many cases this means that you need to store some kind of timestamp for different operations because at least the help desk must be able to tell the customer what has happened and when. In case part of the system is down they can tell the customer “We have received your payment but your balance was last updated last Friday so the payment has not been processed yet.”.
I think this is perhaps the biggest problem. Now that we have our hammer every problem looks like a nail. Understand why and when queue is a good solution and do not try to force every user story and system to use it. There are lot of common examples of situations where using queue (or service bus) is not a good idea but they really deserve blog post of their own.
This is the last part of the series. I think we managed to take a pretty good look into the challenges of the typical “JSON over HTTP” / SOA architecture and we found a solution that fixes most of the problems.