(BIP) Applied Data Modelling: A Case Study in Atmospheric Pollution (3cr)

Course unit code: ICB018AS2AE

General information

ECTS credits: 3 cr

Teaching language: English

Learning objectives

After this course, students have acquired the following skills:
-Students can link theoretical modelling skills to applied data problems in python.
-Students have honed their investigative skills to identify pitfalls of real-life modelling.
-Students can translate applied research questions to specific modelling tasks, can extract answers from given data and evaluate the quality of these answers.
-Students can visualize and present their findings in an appropriate manner to stakeholders.
-Students are well-prepared to handle a data analytics task in the future, e.g., for their thesis.

Concept

This Problem-Based Learning course teaches the fundamentals of applied data analytics and modelling in python. In preparation to the intensive course, students will learn how to use python to handle and visualize data and implement models in asynchronious lessons. Students will review how classic modelling methods such as regression analysis, decision trees, and random forests are used in practice using python. The lecturer provides a comprehensive overview of these topics, as well as selected subjects from exploratory data analytics and diagnostic methods, focussing on maintaining explainable and parsimonious models. This foundational knowledge prepares students for the hands-on portion of the course.

During the intensive course, students are divided into small and diverse groups, emphasizing collaboration between universities. With guidance and support from the lecturer, these groups will apply the discussed methods to a provided dataset. The central research question they aim to address focuses on investigating how human activities influence the intensity and spread of various atmospheric pollutants at different locations within a city. By collaborating, students will utilize their analytical skills to explore the data, identify patterns, and interpret their findings. Finally, students will present their results to each other and discuss the different approaches chosen within each group to identify differences between the provided locations.

The class emphasizes collaborative learning and practical application. Each group will work through the data analysis process, from initial exploration to final interpretation. Students will review their collective results, compare methodologies, and discuss their conclusions. This approach not only reinforces their understanding of data analytics and modelling techniques but also enhances their problem-solving abilities and teamwork skills. The ultimate goal is for students to formulate well-supported answers to the research question, demonstrating their ability to apply theoretical concepts to real-world problems.

Implementation methods, demonstration and Work&Study

Teaching Plan

1) Asynchronious Part: Students are provided with a select number of jupyter-notebooks to work through in their own time, at the latest to be finished prior to the in-person part. These lectures include:

-A Python Warmup
-A Pandas Warmup
-Data Visualization in Python (seaborn)
-Regression Analysis in python (statsmodels)
-Tree-Based Methods in Python (sklearn)
-Explainability Methods (learning to use shap values)

2) In Person Part: Students are provided with the atmospheric_pollution dataset. In the intensive program, they will go through the following steps:

-Build a Team of 2-3 people of different universities.
-Review their provided data with respect to quality (missings, outliers, data understanding)
-Identify relationships between potential features and the responses pm10 and no2. Think of the nature of these relationships and let them influence the modelling.
-Model the response for 2015 - 2019 with respect to the previously identified features. Evaluate and diagnose these models and attempt to improve. This is a cyclic step. Investigate which features influence the response and how.
-Provide prediction for 2020 and see whether a lockdown effect can be identified (Start: March 16th) for either one of the responses.
-Summarize findings, present, and discuss with colleagues to learn from their chosen paths and pitfalls.

Learning materials

??Bibliography
-?Hörmann, S., Jammoul, F., Kuenzer, T., & Stadlober, E. (2021, February). Separating the impact of gradual lockdown measures on air pollutants from seasonal variability. Atmospheric Pollution Research, 12, doi: 10.1016/j.apr.2020.10.011.
-?James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning with Applications in Python. Springer.

Lähtötaso ja sidonnaisuudet muihin opintojaksoihin

Prerequisites:
-Fundamentals in Statistics, Data Analytics or Machine Learning
-Fundamental Programmingskills in python (pandas, matplotlib, seaborn) helpful, but not necessary

Assessment criteria - grade 1

When the implementation type of the course is CONTACT, ONLINE or BLENDED it is required that the student is present during those teaching hours that are marked in the study schedule. If you are absent more than 25 %, your grade will be lowered by one. If you are absent more than 50 %, the course is failed.