Context
The global data lake market size is projected to triple from 2019 to 2024 reaching $20.1 billion [1], and the European share is about one quarter of this amount [2]. Although data lakes are more than a promising approach, today’s solutions do not properly unleash the potential of data analysis, especially at a large cross-organization scale for several reasons. Firstly, data lakes are usually deployed and managed by a single party, and a centralized approach can lead to failure due to the complexity of the diverse data sources [3]. Secondly, the computing continuum (i.e. the resources located at the edge, fog, and cloud) is not fully exploited. To minimize impact of data transfer, data should be processed where they are generated, but at the same time security/privacy and governance concerns arise. Thirdly, data sovereignty [4] must be preserved, thus personal data or business sensitive data cannot leave the boundaries of the organization unless a proper data transformation is performed compliant with the organization’s policies and general norms, which often limits the data sharing. Finally, data lake implementations are not sustainable: because of the illusion created by low-cost storage devices and the assumption that all data have a huge value for companies [5], operators do not discriminate if a data is already stored or if it is useful, resulting in data duplications and storage of unused data.
Stretched Data Lakes aim to leverage Data Mesh and Data Fabric [6] concepts to address these challenges by enabling trusted, verifiable, and energy-efficient data flows across the edge-cloud continuum. They are based on a shared but decentralized approach for defining, enforcing, and tracking data governance requirements with specific emphasis on privacy/confidentiality. Moreover, by applying the principles of circular economy to data governance, i.e., to reuse data, application, and computation resources, Stretched Data Lakes will enable the creation of platforms for more energy-efficient and sustainable data analytics.
Topic
The overall aim of this thesis is to propose, implement and evaluate novel AI-based strategies for trustworthy, energy-efficient management of data flows within Stretched Data Lakes. The elaborated approach will enable the definition of gravity and friction-aware data privacy requirements that will drive data governance throughout the data lake. The enforcement of these policies will be explored by leveraging state of the art AI, distributed and federated learning techniques to implement specific data operations that can be seamlessly applied on access or movement of data. Finally, the approach will also optimize the energy footprint of the data flows by exploiting predictive and optimization AI models.
Keywords
Stretched Data Lakes, Data Mesh, Data Fabric, Cloud-Edge Continuum, Trustworthiness, Privacy-aware Data Management, Energy-efficient Data Operations
References
[1] MarketsAndMarkets, BigData Market - Global Forecast to 2025, 2020.
[2] Data Intelligence, Global Data Lakes Market 2019-2026, 2019.
[3] Z. Dehghani, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, https://martinfowler.com/articles/data-monolith-to-mesh.html, accessed in Aug. 2022.
[4] L. Nagel, D. Lycklama, Design Principles for Data Spaces Position Paper, v1.0, DOI: https://doi.org/10.5281/zenodo.5105744, 2021.
[5] F. Lucivero, Big Data, Big Waste? A Reflection on the Environmental Sustainability of Big Data Initiatives, Science and Engineering Ethics. 26, 2020, DOI: https://doi.org/10.1007/s11948-019-00171-7.
[6] A. Woodie, Data Mesh Vs. Data Fabric: Understanding the Differences, https://www.datanami.com/2021/10/25/data-mesh-vs-data-fabric-understanding-the-differences/, accessed in Aug. 2022.
Responsibilities
Environment
The selected candidate will be jointly mentored by a professor from the Universitat Politècnica de Catalunya (UPC) and a senior researcher from the Distributed AI department of i2CAT during 3 years. After having established a state of the art, the selected candidate will conceive, implement algorithmically and architecturally different strategies for the trustworthy and energy efficient data flow management within stretched data lakes. These strategies will be evaluated through analysis and prototyping of pilot use cases in at least two verticals (Health, Mobility, Agriculture, Industry 4.0 and Energy).
During the PhD program, the candidate must publish his work in scientific conferences and journals, and may contribute to patents. The successful candidate will also contribute to the TEADAL project tasks, meetings, presentations and deliverables.
The PhD position is fully funded by the TEADAL project funded by the European Union’s Horizon Europe research and innovation programme under Grant Agreement No. xxx (TBD).
Application
Who we are:
The i2CAT Foundation is a non-profit research and innovation center that promotes mission-driven R&D activities on advanced Internet architectures, applications, and services. More than 15 years of international research define our expertise in the fields of 5G, IoT, VR, and Immersive Technologies, Cybersecurity, Blockchain, AI, and Digital Social Innovation. The center partners with companies, public administration, academia, and end-users to leverage this knowledge in order to meet real social and business challenges.
The greatest value of i2CAT is the talent of the people who make up our human team. We enjoy a team of people from more than 13 different nationalities and work every day to create and foster a work environment where we all feel comfortable creating, innovating and growing.
Want to know more? Visit our webpage! www.i2cat.net
What will you enjoy?
Where will you do it?
At i2CAT we already have an established ‘work-from-home’ policy for some time. You can work from home or from the office, whichever suits you best. We expect that you attend the office two days per week: one to stay connected with your team and another one to engage with other colleagues
If you decide to come to the office, we are located in Zona Universitària, next to the Campus Nord of the UPC, within a multidisciplinary and multicultural environment. It is a very well-connected area (metro, tram, bus) with bars and restaurants around.
Our offices are designed with an open-office concept where everything is light and transparency. We have a variety of workspaces so that you don't have to be at the same table all day.
i2CAT is an organization committed to equal opportunities. That is why we seek to increase the number of women in those areas where they are underrepresented, and therefore explicitly encourage female candidates to apply.
I2CAT is an organization committed to creating an environment where we celebrate diversity, and where we provide the dedicated support that our employees need, regardless of their disability.
If what you have read sounds good to you... let’s have a coffee and we will tell you more!
In case you liked it, but it is not your job offer, you may know someone else who fits perfectly and whom you would like to recommend!