You have to deal with the consequences of microservices in your monolith

I strongly disagree with implementing a Microservices architecture for the sake of implementing a Microservices architecture. When I say "for the sake", I mean many different cases:

The team is small (less than 20 developers) - just use a monolith and don't waste time dealing with synchronization.
We want to enforce proper modularization - just do proper code reviews.
We need it for scaling, even though we currently have around 20 clients - yup, do it after you managed to max out the biggest server you can find in your cloud provider.
The investors demand it - just tell them you are doing microservices, they won't check it. They still don't know why they want it.
Microservices will look great on my resume - just put it on your resume without doing it, no one seems to care if you know how to do it right.

But there are situations where some of the complexity of the Microservices architecture is thrown in your direction, even if you try to evade it. The most common situation is the fact that you most likely use some kind of third party service. It may be the credit card payment provider, or the inventory provider, or the credit scoring system, or any other of the million third party services out there.

A third-party service that requires synchronization

If this third-party service is something from which you only read, you may be able to still skip the extra complexity. Just read from it, and when the service is unavailable, either retry it or forward the failure back to whoever asked for the data.

The problem resides when you need some kind of synchronization. You must write to it only once. For example, you need to charge your client's credit card. But you need to charge it just once, because noone likes paying twice for the same service. And also you need to check that the charge suceeded, in order to start providing the service, or start shipping merchadise.

This inevitably forces a synchronization between your application and the payment gateway. You would love to have transaction semantics there, but the payment gateway guys just refuse to use your database as their main database. That was kinda expected. Still that doesn't change the fact that you now have to deal with potentially failed transactions. Let's see how this would work:

Your system calls the payment gateway, and tells it to charge $XXX to the client.
Your system hands over control to the payment gateway.
The payment gateway does its thing with the client, charging $XXX.
The payment gateway returns back to your system, telling you the payment succeeded.
Your system starts to provide the service paid for.

So far so good. But what if something fails in the middle? Being more explicit: what if communication fails in the middle? There are way too many cases, let's go over them.

Case 1: your system fails to call the payment gateway to tell it to charge the client

Ok, this is one of the easy cases. The payment provider is down, or the connection to it fails. Your system is unable to call it, so the charge is never initiated, and your system finds that the payment provider never tells you that the charge has been initiated. Easy. We just show the "please come back" error, and everything is in proper sync. On your system, the client has not been charged; and on the payment provider, no charge exists for the client. Furthermore, no charges are shown on the credit card of the client. This is a happy sad path.

Case 2: your system fails to hand control to the payment gateway after telling it to charge the client

The payment gateway knows it has to charge the client, but the actual charge was never initiated. Something just failed when the client was being sent to the payment provider. Your system has a record on the payment being initiated, and so does the payment gateway. But it never succeeded. the client's credit card was not charged, so we are still somewhat well. The synchronization failures start to show up: your system says "charge initiated", the payment gateway says "charge initiated", but the client never saw the payment gateway, and may not understand that he has to do something such as completing or rejecting the payment. Your system also doesn't know that the client will never complete their part of the deal: approving or rejecting the payment. So it may just wait forever for a confirmation it will never arrive. Depending on how you implement your system, this may be a happy sad path, or just a sad sad path.

Bear in mind that this step says "your system fails to hand control to the payment gateway", but there are many other situations that are identical to this one, such as:

After seeing the payment gateway page, the client closes the browser tab and goes to do something different.
While loading the payment gateway page, the client's mobile loses signal and is unable to load the payment gateway page.
The client is a shop, and is doing credit card payments with a tablet, and the Wifi is lousy some times, and terrible other times. They keep losing connection all the time, and all they tell you is that your system doesn't work.

Case 3: the payment gateway fails while charging the client

The client put their credit card details, clicked "Pay now!", and then something went wrong. Maybe he was charged, maybe he wasn't charged. This is bad, as the payment gateway may not know if it succeeded or not, and this means your system also doesn't know if it succeeded or not. But, at least, this happened completely within the payment provider, so, in some sense, it's their problem.

Case 4: the payment gateway fails to return back to your system to tell about the success of the payment

Now we start to go into nasty territory. So your system told the payment gateway to charge the client, the payment gateway charged the client, but now your system doesn't know about the charge. There are multiple options here:

Your system retries asking the payment gateway to charge the client. Client gets charged twice. Client becomes angry, issues chargeback, and proceeds to screw up your company's reputation with the payment gateway.
Your system claims the client didn't pay. But the client paid. Client becomes angry, issues chargeback, and proceeds to screw up your company's reputation with the payment gateway.
Your system asks the payment gateway if the payment succeeded. Which hopefully can be completed, considering that the exact same operation just failed a second ago. I mean, this is the premise of this step.
Your system assumes the client paid, and starts providing the service. Hopefully the client actually paid.

As you can see, there are two terrible options, and two bad options. No good option to be seen here. This is the situation where distributed systems become messy, because they are messy in reality. Of course, someone at the end of the room will shout "that doesn't happen in reality". Well, I just had a 4 hour netsplit in a cluster of 3 VMs in the same datacenter, and you are telling me that I don't have to worry about network failures between two unrelated systems in two different datacenters. I'm just not going to believe you.

And, again, if you are inclined to say this will never happen, let me show you other situations that are identical to this:

The payment gateway changed the format of the message that is sent to your system, and now your system throws an exception instead of ingesting it.
The payment gateway depends on their frontend to redirect back to your website to send the success message, and, again, mobile/wifi signal lost.
The payment gateway sends the message via calling an endpoint somewhere on your system, and the last infrastructure changes changed the listening address.
And my favorite: the pod that is expected to receive the message failed, and the system was in the process of restarting it when the message arrived.

Case 5: you fail to provide the service after confirming payment

Well, if you are in this case, you should pivot to become a state-sanctioned internet provider. Jokes aside, this failure mode is all in your side, so no cosmic law prevents you from knowing that it is happening.

This is way more complex than a database transaction

This is way more complex than a database transaction. The fact that we have all these steps, and surprising failure modes, guarantees that it is not as simple as you would like it to be. These failure modes are nothing more than multiple systems interacting with each other, and failing to interact at some step, which is something that Microservices has to deal with all the time.

Shall we adopt the Microservices architecture to deal with this problem?

The standard approach for Microservices doesn't deal with these failure models. Most times it pretends it doesn't happen, and when it happens it calls it an unexpected glitch in the system. So no, Microservices will not fix this problem. But also, if you still adopt Microservices, you will have this problem in your system all the time as different parts of it fail to interact with other different parts. You just made the problem worse.

What can we actually do to deal with this problem?

You are starting to understand the problem. This is already an improvement. Then you should start designing a state machine that allows you to model how the external piece is working, and noting the tradeoffs we have to do when each step may be interrupted. You may look for properties, such as idepotence, to help you, but in any case a good solution will take some time and good design.

Conclusion

Unless you have a very good reason, you should be trying to prevent your software from becoming a distributed architecture, because distributed architectures have many extra failure modes. But, every now and then, the universe will throw parts of a distributed architecture in your direction, and the best you can do is understand how they work and how they fail in order to construct a decent solution.

Javier Casas

A random walk through computer science