A transaction is a unit of work that needs to be processed in a coherent way. Traditionally, transactions are seen as:
- a storage issue, particularly for databases, with little to do with UI, networking or processing, and
- a technical issue for developers and DBAs, and not particularly interesting for BAs or end-users.
I’m going to challenge both of those premises and suggest we need a new way of looking at transactions. We have placed the ACID model for transactions on a pedestal and, without a doubt it has been incredibly useful in the 50-odd years since it was invented. However, I have yet to see a reason why this, of all transaction models, is either necessary or sufficient. Most authors today just assume that each of its four properties is self-evidently virtuous. However, I have never built nor seen a system that uses ACID. In 1992 ANSI sanctioned four SQL ‘isolation levels’ (see this for a lucid description). Their highest level was SERIALIZABLE, which actually is ACID, down to READ_UNCOMMITTED, which is just Durable. ANSI effectively said, “ACID has no practical implementation, so you can selectively turn it off“. Most tellingly, the textbook example for ACID is transferring money. However, in real banks, that is never ACID, instead, it uses a complex hierarchy of compensating actions.
Undoubtedly ACID does prevent one definition of database inconsistency. But that’s not sufficient – so can BASE, SALT, optimistic locking, semaphores, compensations and such. Nor is it necessary, only the Durable property seems universally desirable. All the transaction technologies are flawed – ACID has no performant implementation, BASE is a weak statement, SALT has high overhead, optimistic locking allows inconsistent reads, semaphores have no fault handling … I don’t have a problem with that – ALL useful technology is flawed. I feel that, as architects, we have tended to over-focus on the virtues/limitations of transaction technology.
I propose to return to the ultimate judge of value, the users. I am offering a high-level definition of a user’s view of transactionality.
Transactions let a user move on with their life after they use our system.
If I successfully book a taxi, I will wait for it, if the booking is declined, then I will find other transport. I cannot handle a non-atomic booking, such as a car but no driver. I get annoyed by “your request is in a queue” and angered by a non-durable booking via a follow-up SMS saying, “We apologise, but your booking has been cancelled”. Similarly, an isolation failure, when my taxi arrives with another customer inside. I used the language of the ACID model here but applied it to the user experience, not to the database.
In implementation terms, this means that the other parts of the system have to be included, the UI device (such as a mobile app), multiple databases (the rider’s database and the driver’s database in a microservice), and the external systems that you use (such as a payment provider). And it has to apply when parts of the system fail, including during a transaction.
I want to emphasise that the user is often not particularly committed to our system – there are other taxi companies, trains, and walking. They would like an OK response but can handle rejection. We must remember that our transaction is nested in the higher-order transactions that the user runs with their life. They can execute retries (Oh, taxis are busy, I’ll wait till the morning), compensating actions (I’ll cancel that appointment), or other transactions (I’ll walk). I propose we design our systems so that the users can effectively execute their life’s higher-order transactions. Sometimes, they want an ACID result, but it’s never the only transactional properties they want.
I’m calling the new model “LIFE” transactions (avoiding the corrosive acronyms used by previous authors), and I’ll start with a user’s view.
LIFE transactions – user model
Transactions let a user move on with their life after they use our system. Let’s start with standard LIFE user stories, As a User…
- Keep me informed. I need to know where the transaction is up to so I don’t get frustrated. You have my precious attention, and my life is now on hold. I know that systems sometimes stall or get busy or break, just tell me and don’t lie about it, and don’t give me some meaningless pacifier.
- Let me cancel. I want to cancel an uncommitted transaction at five seconds’ notice so I can do something else with my life other than wait for you. I thought we had a deal, and I want out when I think you broke it. If you don’t deliver, then I’m annoyed, if you keep me trapped, I’m angry.
- Let me retry. I want to retry without consequences, so if I lose contact with your system, I can choose to let you have another slice of my life. Please don’t spend my time heroically retrying for me, that was not part of the deal.
- Give up fast. If you cannot process in a reasonable time (say 10 seconds), then fail fast so that I don’t have wait and then cancel. I don’t need you to heroically deliver, as I have alternate transactions I can execute.
- Give me support. I understand the world is imperfect, so if you fail the LIFE transactions, give me a contact so I can get you to sort it out so I don’t have to spend my life sorting out your mess.
I want to emphasise the UI here. From the user’s perspective, the transaction starts and ends in the UI. If they click ‘Go’ and the Internet loses that AJAX call, that’s an Atomic failure. If the service is delivered, but the UI is not updated, then that transaction has a Consistent failure. Because LIFE is defined in terms of the whole system from the user’s perspective, it’s not surprising that little of it is directly supported by DB-level ACID, which:
- says nothing about the external systems or the UI,
- does not mention Cancel (however, RDBMS do support this to some degree),
- is silent on idempotent retry (some SQLs support INSERT OR UPDATE),
- pretty well forbids Keep me informed,
- has no insight into operation times.
So, let’s bust the myth that user transactions map to DB transactions (in most systems). It’s not that DB-level ACID is useless Note 1 any more than code-level locks are useless. No, it’s saying that DB ACID is a small part of a modern user transaction. Developers have to build out the other bits. That’s why we need a new model and a new implementation pattern.
Is BASE better? Maybe. At least BASE recognises that all DB storage involving two data centres is eventually consistent (think about replication), whereas RDBMS vendors try to hide that fact. But at the core, the BASE is still just a storage transaction.
What about XA? That gives hope to include back-end systems as part of the LIFE transaction. Sadly, XA has issues of global locking, undefined behaviour during failures, and tight inter-system coupling. As a more practical issue, I’ve never seen a SaaS backend service that supports XA. In principle, the UI could also be part of the XA transaction too, but entirely impractical.
Let’s now move from a user’s view to an implementation view of the LIFE model.
LIFE transactions – implementation model
A premise of ACID is that the transaction either succeeds or fails. LIFE takes a more nuanced view and defines six results:
Result | Meaning | User’s life action | Service’s action |
SUCCESS | Atomic durable success | Move on and enjoy. | Service fulfilled and consistent in internal storage and external systems. |
REJECT | Atomic durable rejection | Move on and do something else. | The transaction could not be completed for business reasons, or Defects meant that the system could not reliably produce SUCCESS. |
BUSY | Atomic soft rejection | Move on OR try again later (user’s choice). | Compensation or replication is in progress, or Defects meant that the system could not reliably produce SUCCESS, or A conflicting operation is in progress, or Insufficient resources to process. |
ERROR | Non-atomic |
Retry to get SUCCESS or REJECT, but if that does not work, get help from the vendor. |
Defects in the system mean that SUCCESS, REJECT or BUSY cannot be produced. The user needs to know this. |
TIMEOUT* | No results | Retry | Service might or might not have happened. |
This leads to the six implementation properties of a LIFE transaction:
- Progressive. The user only experiences SUCCESS, REJECT, BUSY, TIMEOUT* or ERROR results. This means the user can progress with their other life transactions, including higher-order compensations.
- Idempotent. The user can retry any operation (including during the operation), and will experience no side effects, and will receive a progressive result. This means that a user who loses track of the operation through network failure or being over-eager can simply re-establish the state of the operation without performing complex compensating actions (either via code or UI).
- Gracious. The system favours fast REJECT or BUSY, over ERROR or even delayed SUCCESS. This means that the system makes design choices that allow the user to progress their life, rather than heroically trying for SUCCESS and keeping the user tied up. Such a system will produce more REJECT results and happier users than a heroic system.
- Compensated. Supporting the gracious property means the system will spend the effort to compensate for internal errors. The system will execute compensations to favour REJECT or BUSY responses. The compensation happens after the operation returns and might be temporarily detectable by the user (such as a payment made and then refunded).
- Prioritised. Transactional is a high-priority user requirement, and when the system is under stress, less important features should be dropped, and this is a conscious act of analysis and design. For example, it could be acceptable to have inconsistent reports under stress, but it would be unacceptable to double bill.
- Controlled. Each SET operation has a GET operation that allows the user to see the current state, including the existence of state-changing operations. If there is a possibility of slow implementations, each SET will have a CANCEL.
If our system exhibits these LIFE properties through its API, then the UI can pass them to the user (assuming we are all building UI-over-API these days). Remember, in LIFE, the user is part of the transaction and will execute compensations. We just need to let them do that. For example, the old payment message “While the payment is in progress, do not press the back button, or close your browser” message is banished and replaced with “If you think the payment has not worked, just press the PAY button again”.
How to build it
The user stories seem to result in happy users, and the implementation model’s six properties support those user stories. The obvious remaining question is, “How easily can it be built?”. In another article, I will give details about a general implementation of the six implementation properties of LIFE for a microservice architecture relying on external systems. That’s the hard case. By general solution, I mean reusable code that can be built as infrastructure outside the application code. In yet another article, I will show that solution essentially eliminates transactional failures, even when the microservices and external services are very flaky.
As a quick sketch of the design:
- All system operations are broken down into internal operations that SET or GET state, but not both.
- SET operations first store a compensating action, then do their real function, and if it works, remove the compensation.
- Compensation is always asynchronous (i.e. after the operation has returned to the user).
- If a service fails or an operation takes too long, the inflight operations are terminated back to the user, and the compensation cleans up.
Using an example of a simplified ride-booking service with two micro-services, one to manage riders and one to manage drivers, and a link to a bank for payments. As per good microservice design, each has its own DB, and no DB transactions extend outside each service. This results in a pretty standard microservice flow, with the extensions of:
- Steps 3 and 4 write a protective rollback and response into the Riders DB before any processing is done.
- Step 13 removes the rollback on success.
- The DoBookRide operation returns no data at step 14 other than a SUCCESS indicator, which prompts the API Gateway to issue a GET for the result.
Now let’s see what happens when something goes really wrong. Say the Rider microservice crashes while the payment is being processed.
In this flow, the APIGateway gets a TIMEOUT (Step 12), so it does what it always does and issues the GetBookedRide operation. This returns the pre-emptive rejected response set in Step 4. Sometime later (say 20 seconds), Riders triggers a rollback scan, which finds the rollback record written in Step 3. This releases the driver and refunds the money.
This flow supports all the LIFE user stories by implementing six LIFE transaction properties, most notably:
- A catastrophic internal error has been hidden from the user by turning an ERROR into a REJECT, as required by the Gracious property.
- The user is quickly given a REJECT that looks atomic to the user, as required by the Progressive property.
- The compensation will make it atomic within a short period, as specified by the Compensated property.
It also leaves the two databases and the bank consistent. In contrast, for a microservice architecture that ignores transactions, the money will be taken, the driver allocated, but no ride will happen, and the user’s booking app will eventually timeout. This results in an angry user, and angry users are not good.
In summary
- LIFE is a transaction model focussing on a happy user, not consistent storage (though it does that too).
- LIFE handles transactionality from the UI through the server processing and storage out to your external service providers.
- There are six properties of a LIFE transaction that have to be built. That can be provided as infrastructure and taught to teams.
- The LIFE transaction model can be incorporated into your existing micro-service architecture, you don’t need exotic tech.
In subsequent articles, I will show more detail on implementing LIFE in a simple and low-impact way. I will also show that it effectively provides a great UX, even when there are really high levels of disruption in your tech stack.
Notes
Note 1
Before the 1980’s transactionality was primarily a problem to be solved by the architects of each system. Those architects had some underlying theory, such as Petri’s nets, Lampson and Sturgis’ two-phase commit, and Dijkstra’s semaphores, and manifestations of those in tooling, most notably IBM’s CICS and IMS. There were (and still are) specialised solutions outside mainstream development, such as MUMPS for medical records, outside mainstream development.
Härdner and Reuter’s seminal 1983 paper lucidly described a general approach with the ACID model in the scope of database transactions. Many other researchers contributed in the public domain, such as Reed describing multi-version consistency, and proprietary research occurred within database vendors. All this was in the scope of a single database active at a single location – which was the leading technology of that time.
Note 2
There is one other result, namely FAILURE. This is not part of the model but indicates a system design error in that one of the transactional results did not happen. It differs from ERROR, meaning the user knows that the system is confused and needs help. ERROR is an undesirable outcome, but humans can tolerate it if it is rare. FAILURE is the user and the system converging to a different view of the outcome (for example, the buyer thinks the payment worked, but their bank thinks it did not) and is not tolerable.