FUNDAMENTAL OF DATA ANALYSIS AND LABORATORY
Module FUNDAMENTAL OF DATA ANALYSIS

Academic Year 2023/2024 - Teacher: ANTONINO FURNARI

Expected Learning Outcomes

-->

The objectives of the course are:

  1. Provide a solid understanding of the fundamental principles needed to collect, organize, model, analyze, and interpret data. The course aims to provide this understanding through the presentation of a theoretical-mathematical framework and numerous examples of application of this framework to real data sets.
  2. Guide the student in the acquisition of technical skills for the construction, management, analysis of real data sets in order to build, through the most appropriate techniques, data models and decision support systems.
  3. Provide adequate knowledge for the choice of the most appropriate techniques to solve a problem of data analysis and knowledge extraction, evaluating pros and cons.
  4. Train students for the preparation of complete, rigorous, and visually adequate reports that correctly and effectively communicate to the end user the results of the analysis and exploration of a set of data, clearly justifying the conclusions.
  5. Provide the necessary skills to allow students to update themselves independently on the use of techniques, software, and programming languages useful for data analysis.

Course Structure

Lectures in the classroom.

If the teaching is given in mixed or distance mode, the necessary variations with respect to what was previously stated may be introduced, in order to comply with the program provided and reported in the syllabus.

Required Prerequisites

-->

The course includes the following curricular prerequisites, which must be met prior to taking the exam:

  •  Programmazione I e Laboratorio
  • Algebra lineare e Geometria 
  • Elementi di Analisi Matematica I
  • Strutture Discrete 

Attendance of Lessons

Attending lectures is not mandatory, but strongly recommended.

Detailed Course Content

The course is divided into six main modules:

  • Introduction to data analysis
  • Descriptive and exploratory data analysis
  • Inferential Data Analysis
  • Elements of analysis of causal data
  • Predictive analytics
  • Introduction to Time Series Analysis
The following paragraphs detail the contents of the various modules.

Introduction to Data Analysis

  • Data Analysis Overview, Purpose and Applications
  • Main types of data analysis: descriptive, exploratory, inferential, causal, predictive, temporal data analysis
  • Examples of data analysis and applications (notable examples of data analysis and how these have been useful for solving real problems)
  • Different data types: nominal, ordinal, range, and ratio data
  • Data collection techniques: surveys, experiments, observational studies, sampling
  • Difference between sample and population
  • Data pre-processing techniques: data cleansing, missing data management, data standardization, categorical variable encoding (dummy variables), data noise reduction (filtering, smoothing, outlier removal, normalization)
  • Use of probability for data analysis: basic concepts of probability (joint probability, marginal, conditional, independence and conditional independence), Bayes' theorem and its use in data analysis, discrete, continuous, cumulative probability distributions. Remarkable probability distributions.
Descriptive and Exploratory Data Analysis
  • Measures of central, mean, median and fashion trend
  • Dispersion, variance, standard deviation, quartiles, and interquartile range measurements
  • Gaussian fit to data
  • Covariance, correlation (Pearson, Spearman), use of linear regression (simple and multiple) and logistics (simple and multinomial) to study the relationship between variables
  • Density estimation techniques and cluster analysis: Parzen window, kernel density estimation, Gaussian mixture models (GMM), K-Means
  • Dimensionality reduction techniques: principal component analysis (PCA)
  • Data visualization techniques: pie charts, histograms, boxplot, scatterplot, hexbin, density maps, contour lines, scattermatrix, regression plot
Inferential Data Analysis
  • Point and interval estimation
  • Hypothesis testing, null and alternative hypothesis, p-value and statistical significance
  • Confidence intervals, significance levels and how to interpret them
  • Assess the significance of correlation coefficients
  • Statistical significance of linear and logistic regression
  • Model selection techniques, including stepwise regression and backward elimination
  • Normality Test: Q-Q Plot and Pearson Chi Square Test
Elements of causal data analysis
  • Definition of causality. Difference between correlation and causality and importance of determining the causal relationship between variables.
  • Experiments vs. observations. Differences between controlled experiments and observational studies. Importance of the former to establish causality. Randomized Controlled Experiments.
  • Counterfactuals and confounders.
  • Simple techniques of causal inference: linear regression with control of confounders
Introduction to predictive data analysis
  • Predictive analytics fundamentals: training, validation and test sets, cross validation, and how to use these sets to evaluate the performance of a model. Generative and discriminative algorithms. Parameters and hyper-parameters. Parametric and nonparametric methods. Overfitting and underfitting, bias and variance. Linear and nonlinear models.
  • Regression techniques. Evaluation measures for regression problems: mean root mean error and mean absolute error.
  • Classification techniques. Performance evaluation of a classification model: confusion matrix, precision, recall and F1 score. ROC and AUC curves for evaluating binary classification performance. Discriminating functions. Fisher Discriminant Analysis (FDA), Linear Discriminant Analysis (LDA), Mahalanobis Distance, K-Nearest Neighbor (KNN) as a nonparametric classification method. MAP and Naive Bayes. One vs rest and one vs all classification for multi-class classification. Data balancing techniques.
  • Optimization techniques of model hyper-parameters: grid search.
Introduction to Time Series Analysis
  • Introduction to time series data. Definitions and problems.
  • Decomposition of time series into trends and seasonality
  • Basic techniques and models for time series analysis

Textbook Information

Chapters from these books:

  • Peck, Roxy, Chris Olsen, and Jay L. Devore. Introduction to statistics and data analysis. Cengage Learning, 2015.
  • James, Gareth Gareth Michael. An introduction to statistical learning: with applications in Python, 2023.https://www.statlearning.com
  • Bishop, Christopher M. "Machine Learning. Machine learning, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
  • Hernán, Miguel A., and James M. Robins. Causal inference, 2010. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Teaching material shared through Microsoft Teams (code of the Team: i87g4nb).

Course Planning

 SubjectsText References
1Introduction to the courseTeaching material made provided by the teacher, specific chapters of the recommended nooks.
2Main Data Analysis Concepts Teaching material made provided by the teacher, specific chapters of the recommended nooks.
3Descriptive Statistics and Graphical Representation of dataTeaching material made provided by the teacher, specific chapters of the recommended nooks.
4Uncertainty and data as the observation of random eventsTeaching material made provided by the teacher, specific chapters of the recommended nooks.
5Probability DistributionsTeaching material made provided by the teacher, specific chapters of the recommended nooks.
6Introduction to statistical inference: generalizing to the populationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
7Associations of Two Variables Teaching material made provided by the teacher, specific chapters of the recommended nooks.
8Introduction to causal inferenceTeaching material made provided by the teacher, specific chapters of the recommended nooks.
9Simple causal inference techniques to analyze observational dataTeaching material made provided by the teacher, specific chapters of the recommended nooks.
10Clustering & Density estimationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
11Dimensionality ReductionTeaching material made provided by the teacher, specific chapters of the recommended nooks.
12Predictive Data AnalysisTeaching material made provided by the teacher, specific chapters of the recommended nooks.
13Probabilistic Models for ClassificationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
14Discriminant Functions for ClassificationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
15Series data analysisTeaching material made provided by the teacher, specific chapters of the recommended nooks.

Learning Assessment

Learning Assessment Procedures

-->

The examination is divided into two distinct parts:

  • A project that consists of the analysis of a dataset agreed upon with the teacher. The project will involve the application of the most appropriate data analysis techniques, depending on the dataset considered, as discussed during the lectures.
  • An oral interview for the presentation of the project and the assessment of the knowledge of the course topics.

The assessment of learning can also be conducted remotely if the conditions require it.

The grading is expressed on a scale of thirty points according to the following scheme:

Score 29-30 with honors

The student has a deep understanding of the concepts and techniques of data analysis. They can promptly analyze data analysis problems, identifying the most suitable data analysis techniques for the given problem independently and critically, and indicating the most suitable methodological practices for their application. They have excellent communication skills and language proficiency.

Score 26-28

The student has a good understanding of the concepts and techniques of data analysis. They can analyze data analysis problems, identifying appropriate data analysis techniques for the given problem and indicating suitable methodological practices for their application. They have good communication skills and language proficiency.

Score 22-25

The student has a fair knowledge of the concepts and techniques of data analysis, although it may be limited to the main topics. They can analyze data analysis problems, albeit not always in a linear manner, identifying suitable data analysis techniques for the given problem. They have fair communication skills and language proficiency.

Score 18-21

The student has minimal knowledge of the concepts and techniques of data analysis. They have limited ability to analyze data analysis problems. They have sufficient communication skills, although not always appropriate language proficiency.

Examination not passed

The student does not possess the minimum required knowledge of the main content of the course. Their ability to use specific language is very poor or nonexistent, and they are unable to independently apply the acquired knowledge.

Examples of frequently asked questions and / or exercises

-->

The data analysis project is generally based on medium-large datasets obtainable on the internet.

Examples of typical exam questions:

  • Define the classification problem, discuss the differences with respect to the regression problem and give practical examples.
  • Explain the K-NN algorithm for classification. Discuss the effect of parameter K on algorithm performance. Give graphical examples of how the algorithm works and the effect of K.
  • Discuss evaluation measures for classification problems: accuracy, confusion matrix, precision, recall and F1 score. The pros and cons of the measures considered are discussed, also in relation to the characteristics of the test dataset.
  • Illustrate the main techniques useful for studying the correlation between variables.