The Curse of Conway and the Data Space | by Jack Vanlightly | Oct, 2024 | Towards Data Science

Oct 27, 2024

Jack Vanlightly

Towards Data Science

Listen

This article was originally posted on my blog https://jack-vanlightly.com.

The article was triggered by and riffs on the “Beware of silo specialisation” section of Bernd Wessely’s post Data Architecture: Lessons Learned. It brings together a few trends I am seeing plus my own opinions after twenty years experience working on both sides of the software / data team divide.

Conway’s Law:

“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” — Melvin Conway

This is playing out worldwide across hundreds of thousands of organizations, and it is no more evident than in the split between software development and data analytics teams. These two groups usually have a different reporting structure, right up to, or immediately below, the executive team.

This is a problem now and is only growing.

Jay Kreps remarked five years ago that organizations are becoming software:

“It isn’t just that businesses use more software, but that, increasingly, a business is defined in software. That is, the core processes a business executes — from how it produces a product, to how it interacts with customers, to how it delivers services — are increasingly specified, monitored, and executed in software.” — Jay Kreps

The effectiveness of this software is directly tied to the organization’s success. If the software is dysfunctional, the organization is dysfunctional. The same can play out in reverse, as organizational structure dysfunction plays out in the software. All this means that a company that wants to win in its category can end up executing poorly compared to its competitors and being too slow to respond to market conditions. This kind of thing has been said umpteen times, but it is a fundamental truth.

When “software engineering” teams and the “data” teams operate in their own bubbles within their own reporting structures, a kind of tragic comedy ensues where the biggest loser is the business as a whole.

There are more and more signs that point to a change in attitudes to the current status quo of “us and them”, of software and data teams working at cross purposes or completely oblivious to each other’s needs, incentives, and contributions to the business’s success. There are three key trends that have emerged over the last two years in the data analytics space that have the potential to make real improvements. Each is still quite nascent but gaining momentum:

After reading this article, I think you’ll agree that all three are tightly interwoven.

Data engineering has evolved as a separate discipline from that of software engineering for numerous reasons:

The dust has largely settled, though technologies are still evolving. We’ve had time to consolidate and take stock of where we are. The data community is starting to realize that many of the current problems are not actually so different from the problems of the software development side. Data teams are writing software and interacting with software systems just as software engineers do.

The types of software can look different, but many of the practices from software engineering apply to data and analytics engineering as well:

It’s time for data and analytics engineers to identify as software engineers and regularly apply the practices of the wider software engineering discipline to their own sub-discipline.

Data contracts exploded onto the data scene in 2022/2023 as a response to the frustration of the constant break-fix work of broken pipelines and underperforming data teams. It went viral and everyone was talking about data contracts, though the concrete details of how one would implement them were scarce. But the objective was clear: fix the broken pipelines problem.

Broken pipelines for many reasons:

I recently listened to Super Data Science E825 with Chad Sanderson, a big proponent of data contracts. I loved how he defined the term:

My definition of data quality is a bit different from other people’s. In the software world, people think about quality as, it’s very deterministic. So I am writing a feature, I am building an application, I have a set of requirements for that application and if the software no longer meets those requirements that is known as a bug, it’s a quality issue. But in the data space you might have a producer of data that is emitting data or collecting data in some way, that makes a change which is totally sensible for their use case. As an example, maybe I have a column called timestamp that is being recorded in local time, but I decide to change that to UTC format. Totally fine, makes complete sense, probably exactly what you should do. But if there’s someone downstream of me that’s expecting local time, they’re going to experience a data quality issue. So my perspective is that data quality is actually a result of mismanaged expectations between the data producers and data consumers, and that is the function of the data contract. It’s to help these two sides actually collaborate better with each other. — Chad Sanderson

What constitutes a data contract is still somewhat open to interpretation and implementation regarding actual concrete technology and patterns. Schema management is a central theme, though only one part of the solution. A data contract is not only about specifying the shape of the data (its schema); it’s also about trust and dependability, and we can look to the REST API community to understand this point:

In software engineering, when Service A needs the data of Service B, what Service A absolutely doesn’t do is just access the private database of Service B. What happens is the following:

Why does the team of Service A do this for the team of Service B? Is it out of altruism? No. They collaborate because it is valuable for the business for them to do so. A well-run organization is run with the mantra of #OneTeam, and the organization does what is necessary to operate efficiently and effectively. That means that team Service A sometimes has to do work for the benefit of another team. It happens because of alignment of incentives going up the management chain.

It is also well known in software engineering that fixing bugs late in the development cycle, or worse, in production, is significantly more expensive than addressing them early on. It is disruptive to the software process to go back to previous work from a week or a month before, and bugs in production can lead to all manner of ills. A little upfront work on producing well-modeled, stable APIs makes life easier for everyone. There is a saying for this: an ounce of prevention is worth a pound of cure.

These APIs are contracts. They are established by opening communication between software teams and implemented when it is clear that the ROI makes it worth it. It really comes down to that. It generally works like this inside a software engineering department due to the aligned incentives of software leadership.

The term API (or Application Programming Interface) doesn’t quite fit “data”. Because the product is the data itself, rather than interface over some business logic, the term “data product” fits better. The word product also implies that there is some kind of quality attached, some level of professionalism and dependability. That is why data contracts are intimately related to data products, with data products being a materialization of the more abstract data contract.

Data products are very similar to the REST APIs on the software side. It comes down to the opening up of communication channels between teams, the rigorous specification of the shape of the data (including the time zone from Chad’s words earlier), careful evolution as inevitable changes occur, and the commitment of the data producers to maintain stable data APIs for the consumers. The difference is that a data product will typically be a table or a stream (the data itself), rather than an HTTP REST API, which typically drives some logic or retrieves a single entity per call.

Another key insight is that just as APIs make services reusable in a predictable way, data products make data processing work more reusable. In the software world, once the Orders API has been released, all downstream services that need to interact with the orders sub-system do so via that API. There aren’t a handful of single-use interfaces set up for each downstream use case. Yet that is exactly what we often see in data engineering, with single-use pipelines and multiple copies of the source data for different use cases.

Simply put, software engineering promotes reusability in software through modularity (be it actual software modules or APIs). Data products do the same for data.

Shift Left came out of the cybersecurity space. Security has also historically been another silo where software and security teams operate under different reporting structures, use different tools, have different incentives, and share little common vocabulary. The result has been a growing security crisis that we’ve become so used to now that the next multi-million record breach barely gets reported. We’re so used to it that we might not even consider it a crisis, but when you look at the trail of destruction left by ransomware gangs, information stealers, and extortionists, it’s hard to argue that this should be business as usual.

The idea of Shift Left is to shift the security focus left to where software is being developed, rather than being applied after the fact, by a separate team with little knowledge of the software being developed, modified, and deployed. Not only is it about integrating security earlier in the development process, it’s also about improving the quality of cyber telemetry. The heterogeneity and general “messiness” of cyber telemetry drive this movement of shifting processing, clean up, and contextualization to the left where the data production is. Reasoning about this data becomes so challenging once provenance is lost. While cyber data is unusually challenging, the lessons learned in this space are generalizable to other domains, such as data analytics.

The similarity of the silos of cybersecurity and data analytics is striking. Silos assume that the silo function can operate as a discrete unit, separated from other business functions. However, both cybersecurity and data analytics are cross-functional and must interact with many different areas of a business. Cross-functional teams can’t operate to the side, behind the scenes, or after the fact. Silos don’t work, and shift-left is about toppling the silos and replacing them with something less centralized and more embedded in the process of software development.

Bernd Wessely wrote a fantastic article on TowardsDataScience about the silo problem. In it he argues that the data analytics silo can be so engrained that the current practices are not questioned. That the silo comprised of an ingest-then-process paradigm is “only a workaround for inappropriate data management. A workaround necessary because of the completely inadequate way of dealing with data in the enterprise today.”

The sad thing is that none of this is new. I’ve been reading articles about breaking silos all my career, and yet here we are in 2024, still talking about the need to break them! But break them we must!

If the data silo is the centralized monolith, separated from the rest of an organization’s software, then shifting left is about integrating the data infrastructure into where the software lives, is developed, and operated.

Service B didn’t just reach into the private internals of Service A; instead, an interface was created that allowed Service A to get data from Service B without violating encapsulation. This interface, an API, queue, or stream, became a stable method of data consumption that didn’t break every time Service A needed to change its hidden internals. The burden of providing that interface was placed on the team of Service A because it was the right solution, but there was also a business case to do so. The same applies with Shift Left; instead of placing the ownership of making data available on the person who wants to use the data, you place that ownership upstream to where the data is produced and maintained.

At the center of this shift to the left is the data product. The data product, be it an event stream or an Iceberg table, is often best managed by the team that owns the underlying data. This way, we avoid the kludges, the rushed, jerry-rigged solutions that bypass good practices.

To make this a reality, we need the following:

We see a lot happening in this space, from catalogs, governance tooling, table formats such as Apache Iceberg, and a wealth of event streaming options. There is a lot of open source here but also a large number of vendors. The technologies and practices for building data products are still early in their evolution, but expect this space to develop rapidly.

You’d think that the majority of data platform engineering is solving tech problems at large scale. Unfortunately it’s once again the people problem that’s all-consuming. — Birdy

Organizations are becoming software, and software is organized according to the communication structure of the business; ergo, if we want to fix the software/data/security silo problem, then the solution is in the communication structure.

The most effective way to make data analytics more impactful in the enterprise is to fix the Conway’s Law problem. It has led to both a cultural and technological separation of data teams from the wider software engineering discipline, as well as weak communication structures and a lack of common understanding.

The result has been:

The barriers to achieving the vision of a more integrated software and data analytics world are the continued isolation of data teams and the misalignment of incentives that impede the cooperation between software and data teams. I believe that organizations that embrace #OneTeam, and get these two sides talking, collaborating, and perhaps even merging to some extent will see the greatest ROI. Some organizations may already have done so, but it is by no means widespread.

Things are changing; attitudes are changing. Data engineering is software engineering, data contracts/products, and the emergence of Shift Left are all leading indicators.

Previous: Deeper Connect Air takes VPN security on the road | Mashable Next: The best 5G Windows laptops in 2024 with LTE support | Windows Central

Send inquiry

Send