Comparing enterprise data storage infrastructure


Date: 28 June 2022

IT engineer managing a data centre

Data storage is a critical task for every business these days. Fast querying needs combined with large data volumes have made data storage a tough nut to crack. Traditional solutions such as data warehouses can host large volumes of data, but they are rigid and rarely support quick querying. They also struggle to host unstructured data.

On the flip side, newer storage solutions struggle to cope with large data volumes, leading to fast querying but limited scope. The best organisations combine both solutions to create a data-driven process. The concept of a data mesh has gathered momentum, further diversifying opinions on all matters storage.

So which storage option offers the best solution for small businesses? Are data meshes the future? Here are some key considerations when choosing a data storage solution.

Agility versus scale

Data warehouses come in all sizes and offer a range of functionalities. You could choose a time series database or a hybrid option (such as Clickhouse vs Druid) depending on your needs. Whatever you decide, there's no getting away from the fact that a data warehouse or lake centralises your data.

Your engineers will have to go through disparate information and organise each time you initiate a data dump. While this scenario sounds like a nightmare, the fact is you can store huge quantities of data with either of these options. A centralised database also gives you a firm repository of data you can rely on.

Security and other downstream processes also become easier to design since you won't have to account for multiple data sources. By contrast, a data mesh decentralizes data by segregating it based on domain usage. Every team can load its individual data and access it quickly without waiting for a central database to respond.

This means they can enhance or create data products faster. Data meshes also democratise data analytics better than lakes. Lakes technically achieve this goal, but you would need to move data from one lake to another to give their teams the access they need.

A mesh removes this need. However, it isn't a simple solution. Given the complex network of warehouses, lakes, and other storage infrastructure within your mesh, you'll need strong data governance and ETL (extract, transform, load) processes. Collaboration guidelines and access controls will need to be kept under constant review since the mesh introduces the possibility of configuration errors.

These negatives can be handled with careful planning. However, it's important to understand the pros and cons of both choices before jumping in.

Ownership silos

Data ownership is a tricky subject for most enterprises. Who owns data when everything is redirected to a centralised location? Access is usually determined via risk parameters and needs. However, they don't define ownership.

Most organisations create a central data team and transfer responsibility to them. This team cleans, monitors, and transforms data as needed. However, they're placed away from the business and often lack context so errors can occur due to a lack of understanding.

Data meshes solve this issue by localising domain-based data. End-users and business specialists are closer to their data and can offer quick insights. This boosts analytics speed, and insights are delivered faster at a local level. However, issues arise when rolling out analytics to an organisational level.

Thanks to shared data, duplicate issues arise. One team doesn't have full visibility over another team's data, leading to rollup issues that need to be corrected by an engineering team. So, even if an enterprise localises data, they still need an engineering team.

A central solution such as a data warehouse doesn't offer the same agility. However, it simplifies querying. The engineering team becomes the central conduit through which all data is accessed and analysed, limiting potentially false analyses. Collaboration is also simpler since the central team offers data access based on needs.

This means there's a trade-off in terms of data ownership. Centralised solutions offer clear-cut ownership but less agility. A mesh offers higher agility but might introduce errors into analytics.

Alignment with DevOps

DevOps is a culture, not just a set of processes. Collaboration is a central pillar, and many proponents of a data mesh stress it's the best solution to building a DevOps culture. However, as you've already learned, a data mesh can reduce collaboration too.

There's no doubt that distributed data ownership makes it easier to share and maintain data integrity. However, there's also the risk of data silos forming. The key is to build the right processes that prevent this scenario from occurring.

Faster data product delivery times can also aid in CI/CD pipeline goals, something data warehouses struggle with. However, if carefully controlled, data warehouses can enhance collaboration, even if insights are slower.

A wide range of pros and cons

As you can see, decentralised data meshes and centralised data warehouses offer multiple pros and cons. At the end of the day, a data mesh will leave you with a more agile posture. However, you must create processes that help you avoid the common pitfalls in this architecture.

Copyright 2022. Featured post made possible by Rene Mulyandari.

What does the * mean?

If a link has a * this means it is an affiliate link. To find out more, see our FAQs.