Download my resume or academic CV.
DSTI School of Engineering
Paris, France
Research Fellow & Lecturer
Jul 2024 - Present
Research on AI topics (see also research section)
Teaching "Data Pipeline II" (demo slides)
Code is publically available here
European Central Bank
Frankfurt, Hessen, Germany
Senior Data Engineer @ Supervisory Technologies and Data
Jul 2024 - Present
Agora: a single data lake for European banking supervision that puts all prudential information in one accessible location. It functions as a “one-stop shop” for data-intensive supervisory analyses. It is available to all staff in the Single Supervisory Mechanism, i.e. the system of banking supervision in Europe. Agora operates on Amazon Web Services (AWS) and is developed using Apache Spark, the Cloudera stack, Python, and Kubernetes.
Navi: a network analytics service that enhances interconnected data analysis by providing advanced analytical capabilities. Navi runs on AWS and has been constructed using the Neo4j graph database, React, Python, and Kubernetes.
Software Developer @ Stesss Test Production Engineering
Aug 2022 - Jun 2024
As a software engineer in the agile/SCRUM analytics team, my tasks revolve around developing a complex bespoke SupTech application (data quality checks/risk models) using python, spark and SQL (Exadata).
Parsing business requirements to draft user stories while managing relationships with banking supervision to ensure a shared understanding of the product's vision focusing on business value. Presenting solutions to broad (non-technical) audiences
Troubling shooting failures/unexpected behavior of large-scale data processing pipelines of stress test calculations by analyzing logs and writing complex queries on the relational database to identify issues
Streamlining the deployment process through automation using Gitlab CICD pipeline - Developed a data pipeline to export data from Oracle to Cloudera (DEVO) using cluster computing (spark via AWS Glue) with infrastructure setup in terraform and spark application code deployed through Gitlab. - Automated several data processing tasks using orchestrations and data integration tools (mostly Kubernetes cron jobs and Camunda) - Led a workstream in the STAR cloud migration, focusing on optimizing the execution of analytics code on AWS.
For example, I drafted a technical proposal for an efficient, serverless, and scalable solution for executing risk models using cloud-native technologies. To achieve this, I implemented a proof of concept where I containerized the code using Docker and executed it on AWS Lambda
Data Engineer ∪ Data Analyst @ SRF/MAY Data Center
Apr 2020 - Jul 2022
I refactored the entire Cloudera Hadoop/spark AnaCredit data pipeline by identifying bottlenecks (such as unnecessary data shuffling/joins, and inefficient file formats), troubleshooting (through spark UI), and optimizing performance through tuning data partitioning, avoiding wide transformation where possible, switching to parquet file format, using broadcasting (of small tables to avoid costly data shuffling) and caching (of intermediate tables to avoid multiple reads). These optimizations reduced the runtime from over 12 hours to less than 1 hour. They resulted in a considerably more robust pipeline that occasionally crashed due to OOM errors due to data skew in ever-evolving datasets. Proactively driving improvement of SHS data handling, I initiated the migration of a Stata/network share-based ETL data pipeline to the Cloudera technology stack using Spark and Hadoop (Hive), automating procedures using cloud-native tools (CDSW scheduled jobs+Oozie). I introduced and developed CI/CD pipelines using Gitlab and implemented a new procedure compatible with Apache Airflow to leverage DEVO.
I developed a Python application for the Bank Lending Survey (BLS) that consolidated over 20 VBA scripts used for BLS data validation and dissemination. I replaced the existing data layer with DISC/SPACE to enhance data management.
In my role at DGMF/SRF, I was responsible for significantly upgrading the analysis capacity of the BICG contagion model by replacing bank-borrower data from COREP Large Exposures reporting to the newly available AnaCredit credit register. Confronted with 32GB of memory bottleneck in a Cloudera Data Science Workbench session, I took the initiative to transition our data pipeline from pure R/Python to PySpark on Cloudera CDH. I troubleshooted performance problems using Spark UI caused by the creation of many intermediate tables that resulted in excessive data shuffling. I optimized performance by minimizing the usage of “wide operations”. Overall, this expanded the coverage of the model by a factor of more than 640, while only increasing the runtime by a factor of 2.5. This move to cluster computing allowed us to effectively manage the large volume of data, greatly improving the model's accuracy by extending the coverage to a broader set of entities.
I contributed to the ecb-connectors in-house open-source project, enhancing data interoperability between ECB's Cloudera Hadoop storage (DISC), DARWIN, FAME, and Oracle Exadata databases. My work focused on developing reliable, seamless data transfer mechanisms, facilitating better data accessibility and integration for analytical applications
International Monetary Fund
Washington D.C., United States
Graduate Research Fellow (Fund Internship Program)
Jun 2019 - Sep 2019
Participant of the 2019 Fund Internship Program (FIP) working in the Monetary and Capital Markets (MCM) division. Coauthoring the 2019 Global Financial Stability Report (GFSR) analytical chapter titled “Banks Dollar Funding: A Source for Financial Vulnerability” by providing state-of-the-art data preprocessing and analysis. The GFSR is the flagship policy document related to financial stability and systemic risk published by the IMF.
The internship led me to pursue a follow-up project where I developed a cloud-based web crawler. This application, powered by AWS, extracted data from the Security Exchange Commission's online reporting engine (EDGAR) for the research community. It utilized a serverless, managed spark solution (AWS Glue) for efficient data processing, a NoSQL database (AWS DynamoDB) for storing metadata, and cloud object storage (AWS S3) for persisting large data files. The design followed cloud-native principles, prioritizing lightweight serverless managed services. The resulting application, as well as the data, were open-sourced, promoting collaboration and accessibility. In a related research project, the information on fund ownership structures was modeled in a graph database (neo4j) to study the impact of fund mergers for which accurate information on time-varying organization (ownership) structure is crucial
Bank for International Settlements
Basel, Switzerland
Graduate Research Fellow
Dec 2016 - May 2017
Implementing the entire data ingestion and analysis pipeline for a research project on corporate bond funds:
Using explorative data science techniques to understand and structure a large dataset received from Reuters (Lipper eMAXX) - Constructing a graph database like master/reference dataset (for a subsequent project remodeled in neo4j) on mutual fund ownership structures (mapping fund families, fund portfolios and fund share classes across time) by implementing a python fuzzy string-matching to match fund names and using SEC’s EDGAR API to query reference data
Cleaning the 45GB+ TRACE dataset by setting up the required data manipulation tasks (“Dick-Nielsen deduplication”) on the WRDS cloud using SAS. - Implemented bond liquidity measures according to business specifications. Implemented and ran regression models to support the data analysis of the business side. Ingesting the pre-processed CSV data dumps into a relational database system (Oracle). Drafting technical documentation for the eMAXX database, by considering the potentially different needs and diverse backgrounds of other analysts and economists
The data analysis resulted in a research paper Debt Derisking which is published in Management Science
Goethe University
Frankfurt, Hessen, Germany
Ph.D. Candidate (Research & Teaching Assistant)
Jun 2017 - Jul 2020
Providing research assistance for various policy-oriented projects, e.g. a report for the German federal ministry of finance
Teaching both undergraduate and graduate-level classes
supervision of bachelor and master theses
SAFE
Frankfurt, Germany
Dataroom Supervisor
Aug 2015 - Junge 2016 (part time)
Proving level 1 support for data room users (students/researchers)
Contributing to a natural language processing project based on Java and ANTLR
Coding parts of the basic data infrastructure for the System Financial Risk Platform (SFRP) by creating the using Thomson Reuters Datastream Advance Nightshift Server and Python. Supervising undergraduate and graduate students usage of the SAFE dataroom
German Institue of Economic Research
Berlin, Germany
Internship
Aug 2014 - Oct 2014
Data collection, verification, and examination. Excel programming & report writing
Using this data to estimate Input-Output models with Excel for Impact Analysis of companies in the healthcare sector and the renewable energy industry.