Data extraction from microservice architectures

The good

Over the last decade, microservices have emerged as the dominant design pattern for large-scale systems:

  • It’s allowed dozens of dev teams to work efficiently on a large code base without the grinding boredom of N2 communication.
  • It’s shortened time to market (TTM) through smaller, more frequent drops of lower-risk code.
  • It lets us pace-layer, where each part of the system can evolve at its own rate.

Freeing us from the database of a thousand tables, the codebase of a million lines, and the endless meetings have made development more fun.  The architect’s role has changed from “how to make it work together” to “how to keep it working apart“.   It’s not all roses, microservices use more resources (compute, memory, LAN) – enabled by Moore’s Law and optic fibre, but paid for by global warming.   

In this article, I will show how to do (and not do) microservice ETL and reporting, and how to implement the CQRS pattern (if it applies to your problem).

The bad

First I go over where the central database wins hands down over microservices – presenting a consistent data store to your company:

  • Reporting How do I write an ad-hoc report about some business data?

Sally Saleslead wants a report from OldLegacySystem on the customers who bought more than 1000 widgets last year grouped by reseller.  Oliver Orikle writes the 200 lines of SQL that join the dozen tables with three subqueries.  He is finished before lunch, runs it on the read-replica DB, and Sally is smiling that afternoon.  For SuperMicroSystem, it’s a dev project.  On the backlog.

  • ETL How do I get data out of the system to be merged with other data?

Danny Datalake wants to pull the data from SuperMicroSystem and merge it with OldLegacySystem.  With OldLegacySystem, Oliver can use Azure Data Factory, AWS Glue, Tibco, etc.  For SuperMicroSystem, we have MuleSoft or Stich and a development project.  On the backlog.

The ad-hoc reporting mindset is so prevalent it’s often not included in the requirements – business owners assume it just works.  And from the mid-1970s up till about now, it did sort of did “just work”.  Well, if we spent mega-effort on 3NF (3rd normal form).  Maybe it limped along.

Why is microservice so bad at ETL?

A microservice is lord of its data store – the technology, table structure, and degree of normalisation.  A service can have one table with a primary key, maybe a couple of search keys, and a column for all the data in a JSON structure (including arrays).  That’s a key-value (KV)  store, somewhere between zero and first normal form, not the third normal form that SQL reports need.  Large tables often store data in different versions, with the code mediating between the old and new versions.  Services can use different technology – the customer data in Oracle and the sales records in Cassandra.

Let’s look through these things:

Freedom Why it’s good for developers Why it’s bad for data extraction.
Multiple DB tech Pick the best tech for the job.  RDBMS for transactionality, KV+ stores (Cassandra, Dynamo, Cosmos) for scale, and indexes (Elastic, Redshift, Synapse) for searching. No overarching tool to read and link the various technologies.
KV store No expensive table joins.  Lower complexity to roll into production and roll-back on failure.  Developers need lower SQL skills. Foreign keys are typically buried in the (undocumented) JSON values (though modern SQLs can see into JSON).
Stores owned by service This is the tenet underlying modern CD/CI practice.  Each dev team can make changes without reference to a central authority.  This enables productivity and time-to-market. The DBA sees a sea of semi-meaningful tables.  Each table may store a common concept in a different form.   Tomorrow that structure might be different.
Versioned data It’s impractical to rewrite data in large tables – risky, slow, stops rapid releases,..  Better is to version the value and get the code to interpret the version. To join four tables, each holding two versions, gives 16 join statements.  Tomorrow that might be different.

This dichotomy boils down to the different value propositions of the 3NF approach vs the micro-service approach:

  • 3NF.  The DB is the integration point (both for the code and for other systems).  Respect and manage the schema, the code follows.
  • Microservice.  APIs are the integration point (both for the code and for other systems).  Respect and tend the APIs, the storage follows.

I think there are two reasons (and hope for a third) why we have shifted from the 3NF to the Microservice model:

  • Decomposition.   3NF never developed a decomposition approach, it was always subject to central management and control.  I remember walking into a DBA office (back when people had offices) and seeing the printed schema (remember printing code?) covering three of the four walls, with updates written in coloured pen.  Ten developers were waiting on a change request, and the DBA debate was to preserve 3NF and force re-coding by a hundred developers, or break 3NF to let the ten advance.  We should not have to make such choices.
  • Verbs.   3NF does not support verbs.  In the heyday of 3NF, systems just took tabular input, stored it, and produced reports.  No sophisticated interaction with the world.  The system was data, and data are nouns and adjectives.  We now build systems that do stuff with the world.  You cannot make a payment, dispatch goods, or drive a car by changing DB entries.
  • Privacy.  3NF in and RDBMS gives personal data to anyone with a password.  The disgruntled employee, the malware downloaded by a negligent employee, the over-zealous salesperson coupled with the over-trusting report author, the nitwit with the DB dump on their laptop, the attackers who found that ‘qwerty123’ actually is a password…   I discuss this in-depth the article Personal Data safe at rest.

Microservice solves the first two and is part of the last.  But we lost simple reporting, and we lost ETL.  At this point, I want to express my greatest respect for Edgar Codd and his invention.  It was our workhorse for three decades and got us where we are today.  But like steam engines …

Bad ideas

Just so you don’t re-invent them, I’m going to describe two Bad Ideas and some Defence Against Bad Ideas.

Bad IdeaETL from your service data stores.   DBAs can use the traditional ETL tools, they are powerful enough to prise apart your JSON  structures.   The problem is now your DB and JSON structures have suddenly become external interfaces to the system.  You’ll need to regression test and change manage them.  As for a PII, HIPPA, or PCI  audits …  
Another Bad Idea – Keep a global 3NF model and then run code-only pseudo-microservices on that.   Because you must maintain global control of the database schema (to keep 3NF), all services must release in lock-step on schema changes.  You lose TTM, pace layering, and mire your teams in the mud of coordinated code changes.  All the costs of microservices and none of the benefits.  

The pernicious problem is both these Bad Ideas are cheaper in the short term.  Let’s listen in on a conversation between Anna Architect, Polly Project Manager, and Oliver Orikle about the first Bad Idea.

Anna:   We’ve estimated the reporting and billing at 100 Story Points, and we’ll build it in Iteration 6.

Polly:  (Horrified) But that puts us out by four weeks, and we’ll burn through another $150k of our Round Two funding.

Oliver:  (Confidently) Anna will have the schemas pretty stable at the start of Iteration 4.   So I can get my team to configure SuperETL to pull the data, no coding, and go live four weeks earlier!  SuperETL is fantastic, the sales rep gave me a demo last week, and I’ve got my team onto it.  Anna, Polly, you should come and see it!

Anna: (Worried) How will you test it and stop it from breaking when our teams change the code?

Oliver: (Smuggly) I don’t need that heavy testing, it’s not code, it’s just configuration.  SuperETL has this drag-and-drop UI builder and it can decode JSON and XML.  We think there are only 300 rules to build, and we can do that in a month.

Anna: (Back on firm ground) But some of the billing data is PII and encrypted!

Oliver: (Triumphantly) No worries, I did a bash script that uses OpenSSL to decrypt the data, and the ops people are happy to RDIST your AES key so we can both use it.  The sales rep was super helpful.

Polly: (Helpfully) And I can put in a quality gate so that Anna’s people have to give Oliver’s team notice when they change the JSON.  And I can coordinate Oliver’s changes with yours at each code drop!  We can defer Anna’s gold-plated solution to next quarter when we have more time.

I’m sure that Anna will argue testing, fragility, key security, key rolling, coordinated CD/CI changes, low velocity and the other dozen problems with this.  But in the end, another project will make a decision that gets them to market four weeks early.  Unsurprisingly next quarter, they did not have time to redo it, and next year they lost MegaCustomer because their bill was wrong for the third straight month.

DB scraping, UI scraping, and Log scraping are short-term solutions with terrible long-term consequences.   I think the only approach is to build billing and reporting early.  Why?  Billing is the most important activity for any business, and checking the usage of the product by the customers is a close second.  The modern architect’s skill is to win such battles.

So let’s look at how we can extract data without destroying the microservice value prop.

The data warehouse approach

The essential problem is that the service’s store cannot be used as an extract point – the schema is volatile and opaque.  The only robust way is to use the services’ API, preferably the system’s API.  So the data warehouse model works like this:

  • The Transacting system is an uncompromised microservice model.
  • The Data warehouse is a repository for the system state in a form suitable for reporting or other analytics.  It could be built with ELKTableau, AWS Redshift, … or a roll-you-own.
  • After a transaction happens (or potentially as part of the transaction), Extractor pulls the data from the Transacting system via its API and pushes it into the Data warehouse.

There are a lot of good things about this approach:

  • You retain all the microservice benefits – TTM, team velocity, pace layering, and fun.
  • For most systems, the API will have GET operations that come for free.  This article gives some good reasons why you should do this.
  • If Extractor is not in the transaction path, then the Reporting system won’t compromise the performance or reliability of the Transacting system.  This drops your test footprint at each code drop.  In most systems, it’s OK for the Reporting functions to offer a lower SLA than the Transacting functions.
  • The Extractor can apply security to protect PII.  For example, the customer’s national number can be stored twice, once masked as XXXXXX1234 for general access and once unencrypted for privileged access.  Alternatively, PII can be stored as an HMAC, which allows searches to find records that contain the PII without storing the PII.  More aggressively, the reporting system could be declared “PII Free” to substantially drop compliance costs, such as GDPR.

The warehouse also lets you build the Command/Query Responsibility Separation (CQRS) model.  It’s not for all cases, but if you have a relatively simple data model for doing stuff (Commands) and a complex model for searching and reporting (Query), separating this into two distinct data flows, or even sub-systems makes sense.  For example, the Command might be “Book a ticket QF507 on 2023/1/15  for Fred Bloggs…”, but imagine the myriad ways that people and systems can Query that data.  Your Transacting System does the Command part and the Reporting System does the Query part – each optimized for its own purpose.

You might also work out that you can monetise the capability you have built by selling search data to your customers.  Maybe their data, or anonymised across a customer segment – depending on your data ownership situation.   Hence the dotted purple line.

How to get the data out of the micro-service?

Let’s look at some of the complexity of the green line from API Gateway to Extractor, and illustrate it with this example.  Assume that the goal is to extract the Drivers allocation data and the Riders trip data, and there is a GetRide API call that is passed the rideId.  Normally Drivers and Riders change in response to BookRideAPI calls (orchestrated by Riders, with delegation to Drivers).  However, if a driver takes too long to arrive, the system can transfer the ride to another driver (orchestrated by Drivers, with delegation to Riders).

Below are some design responses (in all cases, Extractor transforms the GetRide response and puts it in Warehouse).

 

Three ways forward

Inline Queued High Water
After each change, Riders calls GetRide(rideId) (5) and passes the response to Extractor via an internal API call (6).
Variant.  As above, but API Gateway calls GetRide after each state-changing operation.
After each change, Riders and Drivers send RiderId to Extractor via a message queue (5).  When Extractor gets the message, it calls GetRide(rideId) (7).
Variant.  As above, but rideId is sent by the API Gateway after each state-changing operation.
Periodically Extractor calls Drivers and Riders and asks GetChangesSince(time) (Poll1).  Each responds with a list of rideIDs.  Extractor then calls GetRide(rideId) (Poll2) for each.

 

 

The strength and weakness of the Inline approach is the close coupling of the Transacting and Reporting subsystems.  This is a good choice if your reporting demands are for immediate and guaranteed delivery.  You can even weave the Reporting flow into your transactionality model (see this for one design approach).  The price for close coupling is significant – your Reporting subsystem must match the performance and reliability of your Transacting subsystem.  Slow Reporting is a slow system, and crashed Reporting is a crashed system.  Practically, this means complete testing of Reporting is part of each software drop, and that’s bound to drop your velocity and increase TTM.  It will also limit your choice of tools for Extractor and Warehouse  – many COTS reporting tools just don’t support real-time SLAs.

In contrast, both Queued and High Water approaches are decoupled, so Reporting can present much lower reliability and throughput without impacting your Transacting SLAs.  In both approaches, Extractor becomes aware that a datum has been created/changed (via queue or poll), and it then extracts by calling API Gateway.  This means Extractor is in control of when it extracts, and so it can (and should) flow control:

  • delay a few seconds (to allow the Transacting system to complete),
  • collapse multiple changes of one resource onto one extraction call,
  • throttle or peak-lop the GetRide call rate, to prevent DOSing the Transacting system with GetRide calls.

Anonymity

I want to note a little sleight-of-hand that I have done here.  In a microservice architecture (especially one using choreography), no service understands the entire state of a business-level concept.  In this example, both Drivers and Riders contribute to the state of a Ride, and both can change that state.  Another way of saying that is that services are unaware of their clients.   For example, in a future release, when the company now does parcel delivery, Drivers will get a new client, Delveries.  Using the Queued model as an example, the 5b flow should now trigger a GetDelivery call rather than a GetRide call.  So how does that work?

  • One solution is the API Gateway does the enqueue, and you make sure that all state changes (including callbacks and time-based) go through the API.  The API clearly does have the business knowledge to trigger the right Get call in Extractor.
  • Another solution is to pass the business function to each service so that it can enqueue the right call.  This generally seems to make for a more maintainable system.  For example, the AllocateDriver call on Drivers can include which GET operation and its id argument as additional arguments.

Eventual consistency

Another thing to be really careful about is the integrity of the data in the Warehouse.  It usually does not matter if your Data Warehouse does not track every state change of each datum, but it must be eventually consistent.  In the queued model, data is irretrievably lost if there is a Warehouse write failure, Extractor crashes during an extract, a code error in the queue sending code, or a DB restoration of the Warehouse.  Even if your queue itself never loses data.

On that happy note, I want to introduce the other decoupled model, High Water.  This is more complex but can guarantee eventual consistency and is robust with respect to the above failure modes.

In High Water, Extractor polls each service for recent changes, advancing a high water mark on each poll.  I’ll illustrate this with the polling loop for Riders.

  1. Extractor reads the last HighWaterTime from the Warehouse  (say 1675064676 ).
  2. Extractor calls Riders.getSince(1675064676), which responds with RideId=[549182018, 685549055] and t=1675064781.
  3. Extractor calls Api.getRide(549182018) and then Api.getRide(685549055) and stores into Warehouse those two responses and the new HighWaterTime=1675064781.

So Extractor stores and presents the high water mark time, but Riders supplies it.  The system is bootstrapped by starting with HighWaterTime=0.  There are a lot of subtleties in a robust implementation of this algorithm, but at the core:

  • Any coding errors can be erased by rolling back the high water.  For example, if last week’s release was found to have a buggy transform, you can rescan from last week’s high water.
  • Restoring the Warehouse from backup just works – as it also restores the high water mark.
  • Errors in writing to Warehouse are automatically compensated, as the high water mark won’t be written.
  • Many sins can be erased by running two scanning passes, one close to real-time, and one delayed by say one hour.

Other thoughts

  • Keep your raw API responses.  Eventually, you will want to change the data model in the Warehouse-  business changes, regulatory changes, mistakes etc.  If you keep your API responses, you just have to redo the transform and load phases based on the already extracted data.  In a cloud, creating a new warehouse, deploying a new ETL instance, and waiting are pretty easy.  It might take a few weeks to redo 10 billion records, but it won’t interfere with the real-time extraction or your transaction performance.  If you are certain that you will never change the warehouse data model, you are wrong.  Just build it – it’s a KV store with app-level encryption, you can build the extract chain later.
  • Consider modern ETL techApache Kafka, Aws Kenesis, Google Dataflow.  These are worth looking at if you are going down the queued path.
  • Consider cloud warehouse techAzure Cognitive Search, AWS Redshift, Google BigQuery.  My feeling is that the cloud vendors will overrun all the specialist warehouse vendors on price, function, TTM and cost of ownership.
  • Sell it better.   The 3NF solution merely lets you do SQL, it’s a bit yesterday. The data warehouse approach offers advanced analytics, data mining, AI, and monetising data. You should sell that idea to your product owners.
  • Don’t ask.  Don’t ask your product colleagues what reports they want, they don’t know.  Build a generic capability – developers or DBAs don’t code reports in 2023.

In conclusion

Reporting and ETL don’t “just work” in a microservice model like they do in a DB-centric design.  Applying the traditional DB-centric ETL approaches will lose your microservices benefits.  The solution is:

  • make a data warehouse a key component of your architecture,
  • design an ETL model (inline, queued, high water …) that matches your business needs,
  • build it with modern cloud tech, and open up new possibilities and lower dev times,
  • build it from the start of the project, not at the end,
  • compliance is a first-class requirement, so build in data security, destruction and right-to-forget,
  • build it generic – the users are going to write the reports.

Happy reporting.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top