Aug 2, 2024 6 min read

IT Architecture and data

This week, I spent much time thinking about internal IT architectures. To clarify my thinking, I will use this article as an excuse to consider what good looks like and how to achieve it. This is tangentially related to some work I'm doing for a client, but for obvious reasons, I will not mention them or their specific concerns or use cases.

The old: point-to-point connectivity

A truly archaic IT architecture has a set of wholly isolated systems that work off different data sets and concern themselves only with their domain.

I don't think many businesses are there any more, and instead, I would argue most have moved to what I call point-to-point connectivity. I've shown what I mean by this schematically in the diagram below:

In this setup, systems talk directly to other systems. For example, in the diagram above, the CRM talks to the IT systems (perhaps for access management) and Customer Support (possibly to push customer details). Often, these connections are over some pre-built API that comes with the systems; these pre-built connectors are a big part of how this sort of setup develops, as they allow the owners of the systems to bypass any active design choices. However, it may also use (in order of increasing awfulness) custom APIs, scheduled file transfers, or manual file transfers.

These connections do offer some benefits by keeping data in sync; for example, the Finance team no longer needs to maintain its own copy of the business's headcount data and can instead use the copy from the HR system. However, the result is a network of tightly coupled systems, all dependent on each other. Changing how data is structured in one system can have a knock-on impact on different systems, and that impact is very likely only understood by people with expertise in both. Worse, some impacted systems might not connect to the changed system but be indirectly affected by changes propagating through an intermediary system.

The target: decoupling

There are two reasons to create integrations between systems:

The first is that they might genuinely need to be kept in sync for operational reasons. For example, IT needs to get a list of employees from HR to give the right people access to the right systems.
The second is for reporting or other use cases using that data. If Finance and HR give different answers to the same question, it's not only embarrassing but actively limits what decision-makers can do with the information.

Separating these two things in practice is surprisingly tricky because most functions will want a complete data set to do their jobs and will argue that their connections fall into the first bucket. The architect responsible for the work must challenge this, and systems should be kept in sync directly only where necessary. The alternative is to build reporting on top of the data lake/ warehouse, to be discussed later.

Keeping systems in sync

As I said above, the first bucket is where systems genuinely need to be kept in sync for operational reasons. A business should limit the instances where this is the case as far as possible, but it likely won't be able to eliminate all of them (particularly at the beginning).

This synchronisation is best achieved by introducing a decoupling layer over the top of all the systems, which can then be used to manage the connections between them. This is shown schematically in the diagram below:

This layer can manage the connectivity between the different systems in a centralised way that can be policed by the relevant internal teams (likely a technology or IT function) to ensure it meets any architectural principles or guardrails they have put in place. This layer also makes all API connections across the business immediately visible: in principle, anyone can log in and see a list or visualisation of all data flowing between systems.

Making data available for use cases

Our second bucket was for reporting or use cases that need to use this data. This comes from adding another layer to the diagram, the data layer.

The data layer stores the 'single source of truth' data, which should be used across the business for reporting, decision-making, and other user cases. The data warehouse contains the structured data, which is typically where data from backend systems ends up. The data lake then contains all the unstructured data.

Tech capabilities

The final layer to consider is the tech capabilities layer. The data in the data layer is valuable. Still, a set of reusable tech components is needed to put that data into action quickly and efficiently, as shown in the image below.

Data analytics/ visualisation, for example, is the ability to turn data into graphs and other visualisations. This capability will be necessary for several use cases, mainly reporting use cases, so it makes sense for it to exist as a reusable component. This saves teams from thinking about what tool to use and how to use it whenever they want to visualise data, allowing them to move significantly faster.

Getting to the target state - a use case-driven approach

That all sounds like a lot, and it is if you were to try to build it all at once. That can be a tempting approach; technology and IT teams often want to get all of these components in place and fully operational, usually over many months, before they start to deliver anything with them. A much better approach is an incremental, use case-driven approach, represented in the image below:

In the example above, we want to create a headcount report. That headcount data needs to have several dimensions, such as function, tenure, gender, etc. I've assumed all this data is available in the HR system.

All the systems traced by the blue line need to be created to deliver this use case, but nothing else. In other words, what needs to happen is:

An API management platform should be added and connected to HR data only.
A data warehouse should be added, and the API management platform should push data into it.
An analytics/ visualisation capability must be added and connected to the data warehouse.
The teams then configure the analytics platform to pull, cut and display the relevant data.

The following use case might then use some of these existing components and add a new data set, like Finance data. Therefore, the second use case can be delivered much more quickly while still making some progress towards the target architecture.

Separation of responsibilities

The tech and business teams will be inseparable for many early use cases. The tech teams only deliver capabilities and data sets that are actively needed by use cases and are closely integrated into those delivery teams. This avoids the risk of them delivering functionality or data that isn't required.

However, as the part of this map below the dotted line matures over time, it becomes possible to deliver some use cases without requiring changes, and the separation becomes more significant. At this point, the technology function can focus on two things beyond just the capabilities needed for a particular use case, namely:

The speed with which use cases are delivered.
The overall cost of the architecture.

They should be incentivised to improve both and rewarded accordingly.

Meanwhile, the business functions that deliver use cases can focus on (and be rewarded in line with) the absolute value they provide to the business by delivering their use case. This value is not always monetary and, therefore, not always easy to measure directly, but it should be assessed by the impacted stakeholders of each use case.

Conclusion

This incremental approach to value delivery, combined with a clear target architecture and a set of principles underlying it, allows the IT architecture to evolve towards the target state without requiring a significant upfront investment in capabilities which may or may not add value down the line.