header-image

PhD Position in Artificial Intelligence for Data Lakes Management


  • Ubicación: Barcelona (España)
  • Tipo de Contrato: Temporal
  • Jornada laboral: Jornada completa
  • Sector: Internet y tecnología
  • Vacantes: 1
  • Disciplina: Innovación
  • Modalidad de trabajo: Híbrida

Fundació i2cat

Descripción de la oferta

Context

The global data lake market size is projected to triple from 2019 to 2024 reaching $20.1 billion [1], and the European share is about one quarter of this amount [2]. Although data lakes are more than a promising approach, today’s solutions do not properly unleash the potential of data analysis, especially at a large cross-organization scale for several reasons. Firstly, data lakes are usually deployed and managed by a single party, and a centralized approach can lead to failure due to the complexity of the diverse data sources [3]. Secondly, the computing continuum (i.e. the resources located at the edge, fog, and cloud) is not fully exploited. To minimize impact of data transfer, data should be processed where they are generated, but at the same time security/privacy and governance concerns arise. Thirdly, data sovereignty [4] must be preserved, thus personal data or business sensitive data cannot leave the boundaries of the organization unless a proper data transformation is performed compliant with the organization’s policies and general norms, which often limits the data sharing. Finally, data lake implementations are not sustainable: because of the illusion created by low-cost storage devices and the assumption that all data have a huge value for companies [5], operators do not discriminate if a data is already stored or if it is useful, resulting in data duplications and storage of unused data.

Stretched Data Lakes aim to leverage Data Mesh and Data Fabric [6] concepts to address these challenges by enabling trusted, verifiable, and energy-efficient data flows across the edge-cloud continuum. They are based on a shared but decentralized approach for defining, enforcing, and tracking data governance requirements with specific emphasis on privacy/confidentiality. Moreover, by applying the principles of circular economy to data governance, i.e., to reuse data, application, and computation resources, Stretched Data Lakes will enable the creation of platforms for more energy-efficient and sustainable data analytics.

Topic

The overall aim of this thesis is to propose, implement and evaluate novel AI-based strategies for trustworthy, energy-efficient management of data flows within Stretched Data Lakes. The elaborated approach will enable the definition of gravity and friction-aware data privacy requirements that will drive data governance throughout the data lake. The enforcement of these policies will be explored by leveraging state of the art AI, distributed and federated learning techniques to implement specific data operations that can be seamlessly applied on access or movement of data. Finally, the approach will also optimize the energy footprint of the data flows by exploiting predictive and optimization AI models.

Keywords

Stretched Data Lakes, Data Mesh, Data Fabric, Cloud-Edge Continuum, Trustworthiness, Privacy-aware Data Management, Energy-efficient Data Operations

References

[1] MarketsAndMarkets, BigData Market - Global Forecast to 2025, 2020.

[2] Data Intelligence, Global Data Lakes Market 2019-2026, 2019.

[3] Z. Dehghani, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, https://martinfowler.com/articles/data-monolith-to-mesh.html, accessed in Aug. 2022.

[4] L. Nagel, D. Lycklama, Design Principles for Data Spaces Position Paper, v1.0, DOI: https://doi.org/10.5281/zenodo.5105744, 2021. 

[5] F. Lucivero, Big Data, Big Waste? A Reflection on the Environmental Sustainability of Big Data Initiatives, Science and Engineering Ethics. 26, 2020, DOI: https://doi.org/10.1007/s11948-019-00171-7.

[6] A. Woodie, Data Mesh Vs. Data Fabric: Understanding the Differences, https://www.datanami.com/2021/10/25/data-mesh-vs-data-fabric-understanding-the-differences/, accessed in Aug. 2022. 

Responsibilities

  • Realize a state-of-the-art study on AI-based privacy awareness and energy efficiency in data operations in the edge-cloud continuum.
  • Identify, document and prioritize a set of scientific and technical challenges affecting the viability of privacy-aware energy-efficient data operations in the continuum.
  • Propose and implement innovative approaches overcoming identified challenges,
  • Develop a validation prototype and conduct evaluation of the proposed approaches.
  • Participate in the TEADAL project tasks (meetings, deliverables, integration, etc.).

Environment

The selected candidate will be jointly mentored by a professor from the Universitat Politècnica de Catalunya (UPC) and a senior researcher from the Distributed AI department of i2CAT during 3 years. After having established a state of the art, the selected candidate will conceive, implement algorithmically and architecturally different strategies for the trustworthy and energy efficient data flow management within stretched data lakes. These strategies will be evaluated through analysis and prototyping of pilot use cases in at least two verticals (Health, Mobility, Agriculture, Industry 4.0 and Energy).

During the PhD program, the candidate must publish his work in scientific conferences and journals, and may contribute to patents. The successful candidate will also contribute to the TEADAL project tasks, meetings, presentations and deliverables.

The PhD position is fully funded by the TEADAL project funded by the European Union’s Horizon Europe research and innovation programme under Grant Agreement No. xxx (TBD).

Application

  • A curriculum vitae, highlighting research experience and education.
  • The official Transcripts of Records of all undergraduate and graduate studies, if possible with ranking.
  • Optionally, 1 - 2 recommendation letters.

Who we are:

The i2CAT Foundation is a non-profit research and innovation center that promotes mission-driven R&D activities on advanced Internet architectures, applications, and services. More than 15 years of international research define our expertise in the fields of 5G, IoT, VR, and Immersive Technologies, Cybersecurity, Blockchain, AI, and Digital Social Innovation. The center partners with companies, public administration, academia, and end-users to leverage this knowledge in order to meet real social and business challenges.

The greatest value of i2CAT is the talent of the people who make up our human team. We enjoy a team of people from more than 13 different nationalities and work every day to create and foster a work environment where we all feel comfortable creating, innovating and growing.

Want to know more? Visit our webpage! www.i2cat.net

What will you enjoy?

  • Work from our offices or from home, whichever works best for you. We ask for two days in person at the office to coordinate with the rest of the team.
  • This is a full-time vacancy
  • We have a flexible work schedule respecting your work-life balance
  • Reduced working hours on Fridays and in July and August
  • Fix + variable salary
  • Optional benefits: Travel pass, restaurant vouchers, nursery services support, medical insurance
  • Annual leave of 27 working days
  • We have fruit in the office to promote a healthy lifestyle
  • If you are interested, you can participate in events of your sector.
  • You will work with a laptop. You can choose your operative system, Mac, Linux or Windows.
  • Company social and team-building events (virtual & in-person)
  • You can develop your own and personal training programme with our support
  • We will work so that you have a career plan to promote your growth and development

Where will you do it?

At i2CAT we already have an established ‘work-from-home’ policy for some time. You can work from home or from the office, whichever suits you best. We expect that you attend the office two days per week: one to stay connected with your team and another one to engage with other colleagues

If you decide to come to the office, we are located in Zona Universitària, next to the Campus Nord of the UPC, within a multidisciplinary and multicultural environment. It is a very well-connected area (metro, tram, bus) with bars and restaurants around.

Our offices are designed with an open-office concept where everything is light and transparency. We have a variety of workspaces so that you don't have to be at the same table all day.

i2CAT is an organization committed to equal opportunities. That is why we seek to increase the number of women in those areas where they are underrepresented, and therefore explicitly encourage female candidates to apply.

I2CAT is an organization committed to creating an environment where we celebrate diversity, and where we provide the dedicated support that our employees need, regardless of their disability.

If what you have read sounds good to you... let’s have a coffee and we will tell you more!

In case you liked it, but it is not your job offer, you may know someone else who fits perfectly and whom you would like to recommend!

Requisitos

  • Minimum requirements:
    • MSc or equivalent in Computer Science, AI, Telecommunications or relevant topics.
    • Sound understanding of the concepts and main functioning of Data Lakes, ETLs and distributed Big Data architectures.
    • Understanding of basic AI and Machine Learning concepts, particularly classification and optimization problems and distributed / federated learning.
    • Proven experience with at least one high-level programming language: Java / Scala, Python, C/C++ or similar.
    • Proven experience with Big Data tools such as Hadoop, Spark, Kafka, Flink or similar.
    • Excellent analytical, technical, and problem solving skills.
    • Good technical writing, communication and presentation skills.
    • Excellent spoken and written English.
    • Curiosity, autonomy, proactivity and open mindedness.
  • Desired requirements:
    • Knowledge about Data Mesh, Data Fabric, Data Spaces and / or Edge-Cloud continuum concepts.
    • Proven experience with virtualization technologies such as Dockers, Swarm and Kubernetes.
    • Experience in a research environment such as a university or research center.
    • Experience in extending and contributing to Open Source projects.
    • Existing scientific articles and publications (include references with application).
Posición cerrada

  • Ubicación: Barcelona (España)
  • Tipo de Contrato: Temporal
  • Jornada laboral: Jornada completa
  • Sector: Internet y tecnología
  • Vacantes: 1
  • Disciplina: Innovación
  • Modalidad de trabajo: Híbrida