Were you unable to attend Transform 2022? Check out all Summits in our on-demand library now! Look here.
The world is full of situations where one size doesn’t fit all – shoes, health care, the number of sprinkles you want on a fudge sundae, to name a few. You can add data pipelines to the list.
Traditionally, a data pipeline takes care of connectivity for business applications, directing the demands and flow of data across new data environments, and cleansing, streamlining, and delivering a sophisticated data product to consumers inside or outside the enterprise walls. and manages the steps required for submission. These results have become indispensable in helping decision makers move their business forward.
Lessons from Big Data
Everyone knows the big data success stories: how companies like Netflix build pipelines that manage more than a petabyte of data every day, or how Meta analyzes more than 300 petabytes of clickstream data in its analytics platform. does. It’s easy to assume that once we get to that scale, we’ve already solved all the difficult problems.
Unfortunately it’s not that simple. Ask anyone who works with pipelines for operating data – they will be the first to tell you that one size definitely does not fit all.
Metabeat will bring together thought leaders on October 4th in San Francisco, California to educate you on how all industries communicate and do business.
When it comes to operational data—that is, the data underlying core business functions such as finance, supply chain, and human resources—organizations routinely fail to deliver value from analytics pipelines. This also applies if they are designed similarly to big data environments.
Why? Because they’re trying to solve a fundamentally different data challenge with essentially the same approach, and it doesn’t work.
The problem is not the size of the data, but how complex it is.
Large social or digital streaming platforms often store large data sets as a series of simple, serialized events. A row of data is captured in the data pipeline for a user watching a TV show, and everyone else records each Like button clicked on the social media profile. All of this data is processed through data pipelines using cloud technology at tremendous speed and scale.
The datasets themselves are large, and that’s okay because the underlying data is very well organized and manageable from the start. The highly organized structure of clickstream data means billions of records can be analyzed in a short amount of time.
Data pipeline and ERP platforms
For operational systems such as ERP (Enterprise Resource Planning) platforms, which most companies use to control their essential daily processes, it is a completely different data scenario.
Since their inception in the 1970s, ERP systems have evolved to optimize every ounce of performance to capture raw transactions from the business environment. Every sales order, financial ledger entry and inventory item in the supply chain must be captured and processed as quickly as possible.
To achieve this feat, ERP systems were designed to manage thousands of individual database tables that track business data items and, even more so, the relationships between those objects. This data architecture effectively ensures that customer or supplier records are up-to-date.
But as it turns out, what’s great for transaction speed within this business process isn’t usually so great for analytics power. Instead of the clean, uncomplicated, and well-organized spreadsheets that define a modern online application, there’s a spaghetti-style jumble of data spread across a complex, mission-critical, real-time application.
For example, analyzing a single financial transaction in a company’s ledgers may require data from more than 50 different tables in a backend ERP database, often involving multiple searches and calculations.
To answer questions involving hundreds of tables and relationships, business analysts must write increasingly complex queries that often take hours to produce. Unfortunately, these questions are never answered in a timely manner and blind the company at a crucial moment in decision-making.
To counteract this, companies are trying to overhaul the design of their data pipelines so that data can be routed into increasingly simplified business views that simplify their execution and reduce the complexity of various queries.
This could theoretically work, but there is a cost associated with monitoring the data. Rather than allowing analysts to ask and answer questions about the data, this approach often condenses or reshapes the data to improve performance. That means analysts get answers to predefined questions faster and have to wait longer for everything else.
With inflexible data pipelines, asking new questions means going back to the source system, which is time-consuming and quickly becomes expensive. If something changes in the ERP application, the pipeline collapses completely.
It’s important to design this layer of connectivity from the start, rather than implementing a static pipeline model that may not respond effectively to more connected data.
Instead of making piping smaller and smaller to solve the problem, the design should include these connections instead. In practice, this means addressing the root cause behind the pipeline: making data accessible to users without the time and expense associated with expensive analytical queries.
Each joined table in a complex analysis puts additional pressure on the underlying platform and on those working to maintain business performance by fine-tuning and optimizing those queries. To redefine the approach, one should look at how everything is optimized when the data is loaded – but especially before any queries are executed. This is commonly referred to as query acceleration and provides a useful shortcut.
This approach to query acceleration offers multiple times the performance compared to traditional data analysis. This is achieved without the need to prepare or model the data in advance. By scanning the entire dataset and preparing that data before running the queries, there are fewer limits to answering questions. It also improves query usability by distributing the full scope of raw business data available for investigation.
By challenging the fundamental assumptions about how we receive, process, and analyze our operational data, it is possible to simplify and streamline the steps required to move from expensive, fragile data pipelines to quick business decisions. Remember: one size doesn’t fit all.
Nick Jewell is senior director of product marketing at Incorta,
data decision maker
Welcome to the VentureBeat community!
DataDecisionMakers is the place where experts, including technical staff, working with data can share data-related insights and innovations.
Visit us at DataDecisionMakers for innovative ideas and the latest information, best practices, and the future of data and data technology.
You might even consider contributing an article of your own!
Read more from DataDecisionMakers