Matthias Boehm

Technische Universität Berlin

Date: 03 April 2023

Time: 16:00 (CET)

Title: System Infrastructure for Data-centric ML Pipelines - Balancing Automation and Manual Control

Abstract: Data-centric machine learning (ML) pipelines include - besides the training and hyper-parameter tuning of ML models - primitives for data cleaning, data augmentation, data validation, and model debugging in order to construct high-quality datasets with good coverage. Interestingly, state-of-the-art techniques for data integration, cleaning, and augmentation as well as model debugging are often based on machine learning themselves, which motivates their integration into ML systems. In this talk, we make a case for optimizing compiler infrastructure in Apache SystemDS, an open-source ML system for the end-to-end data science lifecycle. However, instead of full automation - which is rather unrealistic - we aim to automate the mechanical aspects of various tasks in data-centric ML pipelines while retaining manual control. As two concrete examples, we discuss SAGA for automatically enumerating data cleaning pipelines, and SliceLine for model debugging with regard to sub-groups of the input dataset.

Speaker Biography: Matthias Boehm is a full professor for large-scale data engineering at Technische Universität Berlin and the BIFOLD research center. His cross-organizational research group focuses on high-level, data science-centric abstractions as well as systems and tools to execute these tasks in an efficient and scalable manner. From 2018 through 2022, Matthias was a BMK-endowed professor for data management at Graz University of Technology, Austria, and a research area manager for data management at the co-located Know-Center GmbH. Prior to joining TU Graz in 2018, he was a research staff member at IBM Research - Almaden, CA, USA, with a major focus on compilation and runtime techniques for declarative, large-scale machine learning in Apache SystemML. Matthias received his Ph.D. from Dresden University of Technology, Germany in 2011 with a dissertation on cost-based optimization of integration flows. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing.


Video Recording