FUNDAMENTAL OF DATA ANALYSIS AND LABORATORY
Module FUNDAMENTAL OF DATA ANALYSIS

Academic Year 2024/2025 - Teacher: ANTONINO FURNARI

Expected Learning Outcomes

  • Knowledge and Understanding: The student will gain a solid understanding of the fundamental principles needed to collect, organize, model, analyze, and interpret data. This will be achieved through the presentation of a theoretical-mathematical framework and numerous examples of its application to real datasets. The student will develop a deep understanding of the conceptual foundations of data analysis.
  • Applying Knowledge and Understanding: The student will acquire technical skills for constructing, managing, and analyzing real datasets, with the goal of building models and decision support systems. They will be able to apply the acquired knowledge to solve practical problems using tools and techniques for data analysis.
  • Making Judgements: The student will be able to independently choose the most appropriate techniques for solving a data analysis problem, evaluating their pros and cons. They will be capable of justifying their choices and critically assessing various methodologies for data analysis and knowledge extraction.
  • Communication Skills: The student will be trained to produce complete, rigorous, and visually appropriate reports that effectively and correctly communicate the results of data analysis and exploration. Conclusions will be clearly justified and communicated effectively to both technical and non-technical audiences.
  • Learning Skills: The student will develop the necessary skills to update themselves independently on the use of techniques, software, and programming languages useful for data analysis, ensuring continuous learning even after the course ends.

Course Structure

Lectures in the classroom.

If the teaching is given in mixed or distance mode, the necessary variations with respect to what was previously stated may be introduced, in order to comply with the program provided and reported in the syllabus.

Required Prerequisites

-->

Basic skills in programming, calculus, and linear algebra are required.

Attendance of Lessons

Attending lectures is not mandatory, but strongly recommended.

Detailed Course Content

The course is structured into five main modules:

  • Introduction to Data Analysis
  • Descriptive and Exploratory Data Analysis
  • Inferential Data Analysis
  • Data as N-Dimensional Points
  • Predictive Data Analysis 

The following paragraphs detail the contents of each module:

Introduction to Data Analysis

  • Overview of data analysis: purpose and applications
  • Main types of data analysis: descriptive, exploratory, inferential, predictive
  • Examples of data analysis and applications (notable examples of data analysis and how they have been useful in solving real-world problems)
  • Different types of data: nominal, ordinal, interval, and ratio data
  • Data collection techniques: surveys, experiments, observational studies, sampling
  • Difference between sample and population
  • Data preprocessing techniques: data cleaning, handling missing data, data standardization, encoding categorical variables (dummy variables), noise reduction in data (filtering, smoothing, outlier removal, normalization)
  • Use of probability in data analysis: basic probability concepts (joint, marginal, conditional probability, independence, and conditional independence), Bayes' theorem and its use in data analysis, discrete and continuous probability distributions, cumulative distributions, notable probability distributions.

Descriptive and Exploratory Data Analysis

  • Measures of central tendency: mean, median, and mode
  • Measures of dispersion: variance, standard deviation, quartiles, and interquartile range
  • Covariance and correlation
  • Data visualization techniques: pie charts, histograms, box plots, scatter plots, hexbin plots, density maps, contour plots, scatter matrices, regression plots

Inferential Data Analysis

  • Objectives of Inferential Data Analysis
  • Use of confidence intervals in data analysis, significance levels, and how to interpret them
  • Use of hypothesis testing for data analysis, null and alternative hypotheses, p-value, and statistical significance. Main statistical tests: comparison of means, t-test, chi-square
  • Assessing the significance of correlation coefficients with hypothesis tests
  • Use of linear and logistic regression to study the relationship between variables
  • Statistical significance of linear and logistic regression
  • Regression model selection techniques, backward elimination
  • Introduction to causal data analysis: correlation vs. causality, randomized controlled experiments, observational studies, counterfactuals and confounders, linear regression with confounder control

Data as N-Dimensional Points

  • Features, representation functions, feature spaces, metrics
  • Clustering techniques: definitions and K-Means
  • Fitting Gaussians to data, Maximum Likelihood
  • Density estimation techniques: Parzen window, kernel density estimation, Gaussian Mixture Models (GMM)
  • Dimensionality reduction techniques: Principal Component Analysis (PCA)

Predictive Data Analysis

  • Fundamental concepts of predictive analysis: training, validation, and test sets, cross-validation. Generative and discriminative algorithms. Parameters and hyperparameters. Parametric and non-parametric methods. Overfitting and underfitting, bias, and variance. Linear and non-linear models.
  • Regression techniques. Evaluation metrics for regression problems: mean squared error (MSE) and mean absolute error (MAE).
  • Classification techniques. Performance evaluation of a classification model: confusion matrix, precision, recall, and F1 score. ROC curves for evaluating binary classification performance. Discriminant functions. Fisher Discriminant Analysis (FDA), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Mahalanobis Distance, K-Nearest Neighbor (KNN) as a non-parametric classification method. MAP and Naive Bayes.

-->


Textbook Information

Chapters from these books:

  • Peck, Roxy, Chris Olsen, and Jay L. Devore. Introduction to statistics and data analysis. Cengage Learning, 2015.
  • James, Gareth Gareth Michael. An introduction to statistical learning: with applications in Python, 2023.https://www.statlearning.com
  • Bishop, Christopher M. "Machine Learning. Machine learning, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
  • Hernán, Miguel A., and James M. Robins. Causal inference, 2010. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Teaching material shared by the teacher through Microsoft Teams (Team code: i87g4nb) and through the https://antoninofurnari.github.io/fadlecturenotes/ website.

Course Planning

 SubjectsText References
1Introduction to the courseTeaching material made provided by the teacher, specific chapters of the recommended nooks.
2Main Data Analysis Concepts Teaching material made provided by the teacher, specific chapters of the recommended nooks.
3Descriptive Statistics and Graphical Representation of dataTeaching material made provided by the teacher, specific chapters of the recommended nooks.
4Uncertainty and data as the observation of random eventsTeaching material made provided by the teacher, specific chapters of the recommended nooks.
5Probability DistributionsTeaching material made provided by the teacher, specific chapters of the recommended nooks.
6Introduction to statistical inference: generalizing to the populationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
7Associations of Two Variables Teaching material made provided by the teacher, specific chapters of the recommended nooks.
8Introduction to causal inferenceTeaching material made provided by the teacher, specific chapters of the recommended nooks.
9Simple causal inference techniques to analyze observational dataTeaching material made provided by the teacher, specific chapters of the recommended nooks.
10Clustering & Density estimationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
11Dimensionality ReductionTeaching material made provided by the teacher, specific chapters of the recommended nooks.
12Predictive Data AnalysisTeaching material made provided by the teacher, specific chapters of the recommended nooks.
13Probabilistic Models for ClassificationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
14Discriminant Functions for ClassificationTeaching material made provided by the teacher, specific chapters of the recommended nooks.

Learning Assessment

Learning Assessment Procedures

-->

The exam is divided into the following tests:

  • A written test, aimed at verifying the student's skills regarding the topics covered in the "Fundamentals of Data Analysis" module, from a theoretical and methodological point of view. The test is evaluated with a mark out of thirty. 
  • A project, agreed with the teacher and carried out independently by the student, aimed at verifying the skills acquired in the "Laboratory" module. The project is presented to the teacher through an interview and evaluated with a mark out of thirty

Students with disabilities and/or DSA must contact the teacher, the CInAP representative of the DMI (Prof. Daniele) and CInAP well in advance of the exam date to communicate that they intend to take the exam using the appropriate compensatory measures.

Two written in itinere exams are scheduled during the course. Passing both tests grants exemption from the final written exam.

-->

The final grade is obtained by means of a weighted average between the marks obtained in the two tests with weights of 2/3 for the written test and 1/3 for the laboratory test.

The assessment of learning can also be conducted remotely if the conditions require it.

The grading of each test is expressed on a scale of thirty points according to the following scheme:

Score 29-30 with honors

The student has a deep understanding of the concepts and techniques of data analysis. They can promptly analyze data analysis problems, identifying the most suitable data analysis techniques for the given problem independently and critically, and indicating the most suitable methodological practices for their application. They have excellent communication skills and language proficiency.

Score 26-28

The student has a good understanding of the concepts and techniques of data analysis. They can analyze data analysis problems, identifying appropriate data analysis techniques for the given problem and indicating suitable methodological practices for their application. They have good communication skills and language proficiency.

Score 22-25

The student has a fair knowledge of the concepts and techniques of data analysis, although it may be limited to the main topics. They can analyze data analysis problems, albeit not always in a linear manner, identifying suitable data analysis techniques for the given problem. They have fair communication skills and language proficiency.

Score 18-21

The student has minimal knowledge of the concepts and techniques of data analysis. They have limited ability to analyze data analysis problems. They have sufficient communication skills, although not always appropriate language proficiency.

Examination not passed

The student does not possess the minimum required knowledge of the main content of the course. Their ability to use specific language is very poor or nonexistent, and they are unable to independently apply the acquired knowledge.

Examples of frequently asked questions and / or exercises

-->

The data analysis project is generally based on medium-large datasets obtainable on the internet.

Examples of typical exam questions:

  • Define the classification problem, discuss the differences with respect to the regression problem and give practical examples.
  • Explain the K-NN algorithm for classification. Discuss the effect of parameter K on algorithm performance. Give graphical examples of how the algorithm works and the effect of K.
  • Discuss evaluation measures for classification problems: accuracy, confusion matrix, precision, recall and F1 score. The pros and cons of the measures considered are discussed, also in relation to the characteristics of the test dataset.
  • Illustrate the main techniques useful for studying the correlation between variables.