Building a Modern Data Platform for the Artificial Intelligence Era

A blueprint of an agile Big Data platform based on Data Lakehouse architecture following state-of-the-art technology (DataOps, Kubernetes, and a Cloud-Native ecosystem) providing the base for Machine Learning and Artificial Intelligence.
Building a Modern Data Platform for the Artificial Intelligence Era
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Explore the Research

SpringerLink
SpringerLink SpringerLink

Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem - Discover Applied Sciences

In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data management architectures has been accompanied by changes in storage formats, particularly open standard formats like Apache Hudi, Apache Iceberg, and Delta Lake. With many alternatives, organisations are unclear on how to combine these into an effective platform. Our work investigates capabilities provided by Kubernetes and other Cloud-Native software, using DataOps methodologies to build a generic data platform that follows the Data Lakehouse architecture. We define the data platform specification, architecture, and core components to build a proof of concept system. Moreover, we provide a clear implementation methodology by developing the core of the proposed platform, which are infrastructure (Kubernetes), ingestion and transport (Argo Workflows), storage (MinIO), and finally, query and processing (Dremio). We then conducted performance benchmarks using an industry-standard benchmark suite to compare cold/warm start scenarios and assess Dremio’s caching capabilities, demonstrating a 12% median enhancement of query duration with caching.

Artificial Intelligence without data is just Artificially! Having all those recent AI breakthroughs without proper data management would be impossible. Since across various domains, data is the first gate to all data-driven activities, such as Artificial Intelligence, Machine Learning, and Business Intelligence. Effectively managing data to support AI was the main goal of the paper “Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem”, published on 22 February 2025 (just after the Love Data Week 2025, isn’t it lovely?)

Paper Highlights

  • The research creates a blueprint using DataOps, Kubernetes, and a Cloud-Native ecosystem to build a resilient Big Data platform following the Data Lakehouse architecture, providing the base for Machine Learning and Artificial Intelligence.
  • Using an iterative approach, we architectured and implemented the core of the platform, which is composable and cloud-agnostic. This avoids vendor lock-in and enables seamless deployment across environments from any Cloud provider or on-premises.
  • The initial benchmarking showed that the platform could efficiently handle massive amounts of records, benefitting from Apache Iceberg format features.

Personal Background

This paper is based on my dissertation for the Master of Science degree in Data Engineering at Edinburgh Napier University, which was a unique learning experience on many levels.

My intention to enrol in a master’s program was not recent; it began in my final college year. In 2010, I heard that a university-wide research competition had started; hence, I formed and led a research group of three members and participated in the competition. Our research achieved third place, with only four marks behind the project in first place.

At that time, I knew I wanted to have a similar experience again, but I decided first to gain hands-on experience in the technology industry. Later, in 2020, after a decade, I enrolled in the Data Engineering master's program at Edinburgh Napier University, where I experienced significant growth in my personal, academic, and professional skills. During the dissertation, Dr. Peter J. Barclay, the supervisor of my dissertation, suggested creating a peer-reviewed paper based on it.

Later, in 2024, I started working with Dr. Peter J. Barclay, Dr. Christos Chrysoulas, and Dr. Nikolaos Pitropakis on a peer-reviewed paper summarizing the dissertation findings to create a blueprint using DataOps, Kubernetes, and the Cloud-Native ecosystem to build a resilient Big Data platform following the Data Lakehouse architecture, for use as a base for Machine Learning and Artificial Intelligence.

In that journey, I faced many challenges, however, it presented a chance to step up and be a better version of myself. At this point, I persevered and achieved my goals. Therefore, the research group chose these quotes from our respective languages/cultures to emphasize the importance of perseverance and diligence:


عِندَ الصَّباحِ يَحمَدُ القومُ السُّرَى
(In the morning, the people praise the night’s journey)
- Arabic Proverb

Αρχή ήμισυ παντός
(The beginning is half of everything)
- Greek Proverb

Is obair latha tòiseachadh
(Beginning is a day's work)
- Scottish Gaelic Proverb


Motivation

Data sizes and types have changed dramatically in recent years, and the old methods and architectures cannot cope with that kind of data, which is different in both quantity and quality. Data Warehouses started in the 1980s, Data Lakes started around 2010, and finally, the Data Lakehouse, a relatively new hybrid architecture shown for the first time in 2020, combines the capabilities of a Data Warehouse and a Data Lake.

Still, based on the reviewed related work, no prior study covered building a generic data platform focusing on specific aspects like openness, portability, averting vendor lock-in (cloud-agnosticism), and emphasising flexibility and extensibility. Hence, our paper fills this gap by proposing and evaluating a data platform that leverages modern practices and technology and helps to create a data-centric business, where different personas smoothly interact with the data platform while managing and benefitting from Big Data.

In short, the motivation for this work was the lack of any reference for building a data platform that uses state-of-the-art technology to handle AI and Big Data challenges.

Challenges

The main challenges could be narrowed down to:

  • Fast-paced changes in the data landscape.
  • Massive dataset size.

First, writing an academic paper in a highly paced domain like Data Engineering and Artificial Intelligence is challenging as the landscape shifts quickly all the time. Staying ahead requires continuous monitoring of emerging trends, preprints, and cutting-edge developments. This dynamic environment makes impactful academic papers an uphill battle; however, with the industrial and academic skills available within the research team, it was possible to produce a high-quality paper.

Second, another Big Data challenge is handling massive datasets, which is challenging for academic research as it requires powerful computational resources. Also, working with that kind of data requires different types of systems to deal with storage, processing, and scalability, all while ensuring accuracy and reproducibility, which limits the ability to iterate quickly. Leveraging Cloud Computing capabilities and a Cloud-Native ecosystem helped to optimize costs and speed up the processing time by accessing computational resources on demand.

Conclusion

Undoubtedly, “data is the new gold” as everything now revolves around data. Therefore, modern data platforms must prioritize openness, portability, agility, and extensibility, while avoiding vendor lock-in through a cloud-agnostic approach to keep up with the dynamic changes in the data landscape.

Resources

All technical resources are available on the paper's GitHub repository, “modern-data-platform-research-paper”. This repository includes but is not limited to: Cloud infrastructure as code, Kubernetes manifests, data pipelines, TPC-DS data generation scripts, and a benchmarking Jupyter Notebook.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Big Data
Mathematics and Computing > Statistics > Data Analysis and Big Data > Big Data
Cloud Computing
Mathematics and Computing > Computer Science > Computer Engineering and Networks > Cloud Computing
Artificial Intelligence
Mathematics and Computing > Computer Science > Artificial Intelligence
Machine Learning
Mathematics and Computing > Computer Science > Artificial Intelligence > Machine Learning
Data Engineering
Technology and Engineering > Mathematical and Computational Engineering Applications > Computational Intelligence > Data Engineering

Related Collections

With collections, you can get published faster and increase your visibility.

Engineering: Energy Management System

Power system energy management is performed at different levels with different aims. The most important goals of optimal energy management of the power systems are cost reduction, reduce emissions, improve reliability and power quality indexes, increase profitability, etc. Innovative methods can be used to provide an energy management plan, or energy management can be defined as an optimization problem and optimization algorithms can be used to solve it. Uncertainties are also very important issues in power system operation programs, especially in the lower levels of the power system (microgrids). The electric vehicle parking lots or aggregators can also be used in the microgrids to store energy, in which case the issue of energy management takes on a new face. Therefore, to provide a good energy management program for the power system, one must have a detailed knowledge of its components and the technical constraints of the power system. Therefore, to provide a good energy management program, a comprehensive model of the power system with detailed data should be provided.

Publishing Model: Open Access

Deadline: Jun 30, 2025

Chemistry: Applied Phytochemistry: Characterization, Extraction, and Applications in Food, Pharmaceutical, and Agriculture

Phytochemicals are secondary metabolites distributed in the plant kingdom, and some can also be obtained from animal and fungal sources. These metabolites can be classified into alkaloids, saponins, flavonoids, phenolic acids, terpenes, and essential oils. Many of them have been associated with health-promoting effects in humans, as potential biocide agents to mitigate the use of chemical pesticides in agriculture, or as biostimulants in different crops. Due to the many applications of phytochemicals new techniques and optimization strategies are being studied. This topical collection deals with articles related to the characterization techniques and emerging and new extraction methodologies of phytochemicals from different sources. It also comprises the application of phytochemicals in the formulation of functional foods, nutraceuticals, bio-pharmaceuticals, and agriculture as biocides or biostimulants.

Publishing Model: Open Access

Deadline: May 01, 2025