FUNDAMENTAL OF DATA ANALYSIS AND LABORATORYModule Laboratory
Academic Year 2024/2025 - Teacher: ANTONINO FURNARIExpected Learning Outcomes
- Knowledge and Understanding: The student will gain a solid understanding of the fundamental principles needed to collect, organize, model, analyze, and interpret data. This will be achieved through the presentation of a theoretical-mathematical framework and numerous examples of its application to real datasets. The student will develop a deep understanding of the conceptual foundations of data analysis.
- Applying Knowledge and Understanding: The student will acquire technical skills for constructing, managing, and analyzing real datasets, with the goal of building models and decision support systems. They will be able to apply the acquired knowledge to solve practical problems using tools and techniques for data analysis.
- Making Judgements: The student will be able to independently choose the most appropriate techniques for solving a data analysis problem, evaluating their pros and cons. They will be capable of justifying their choices and critically assessing various methodologies for data analysis and knowledge extraction.
- Communication Skills: The student will be trained to produce complete, rigorous, and visually appropriate reports that effectively and correctly communicate the results of data analysis and exploration. Conclusions will be clearly justified and communicated effectively to both technical and non-technical audiences.
- Learning Skills: The student will develop the necessary skills to update themselves independently on the use of techniques, software, and programming languages useful for data analysis, ensuring continuous learning even after the course ends.
Course Structure
Lectures in the classroom.
If the teaching is given in mixed or distance mode, the necessary variations with respect to what was previously stated may be introduced, in order to comply with the program provided and reported in the syllabus.
Required Prerequisites
Basic skills in programming, calculus, and linear algebra are required.
Attendance of Lessons
Attending lectures is not mandatory, but strongly recommended.
Detailed Course Content
The course is structured into six main modules:
· Introduction to Data Analysis
· Descriptive and Exploratory Data Analysis
· Inferential Data Analysis
· Data as N-Dimensional Points
· Predictive Data Analysis
The following paragraphs detail the contents of each module:
Introduction to Data Analysis
· Overview of data analysis: purpose and applications
· Main types of data analysis: descriptive, exploratory, inferential, predictive
· Examples of data analysis and applications (notable examples of data analysis and how they have been useful in solving real-world problems)
· Different types of data: nominal, ordinal, interval, and ratio data
· Data collection techniques: surveys, experiments, observational studies, sampling
· Difference between sample and population
· Data preprocessing techniques: data cleaning, handling missing data, data standardization, encoding categorical variables (dummy variables), noise reduction in data (filtering, smoothing, outlier removal, normalization)
· Use of probability in data analysis: basic probability concepts (joint, marginal, conditional probability, independence, and conditional independence), Bayes' theorem and its use in data analysis, discrete and continuous probability distributions, cumulative distributions, notable probability distributions.
Descriptive and Exploratory Data Analysis
· Measures of central tendency: mean, median, and mode
· Measures of dispersion: variance, standard deviation, quartiles, and interquartile range
· Covariance and correlation
· Data visualization techniques: pie charts, histograms, box plots, scatter plots, hexbin plots, density maps, contour plots, scatter matrices, regression plots
Inferential Data Analysis
· Point estimation and interval estimation
· Hypothesis testing: null and alternative hypotheses, p-value, and statistical significance
· Confidence intervals, significance levels, and how to interpret them
· Assessing the significance of correlation coefficients
· Use of linear and logistic regression to study the relationship between variables
· Statistical significance of linear and logistic regression
· Model selection techniques, including backward elimination
· Normality tests: Q-Q Plot and Pearson’s Chi-Squared Test
· Introduction to causal data analysis: correlation vs. causation, randomized controlled experiments, observational studies, counterfactuals and confounders, linear regression with confounder control
Data as N-Dimensional Points
· Features, representation functions, feature spaces, metrics
· Clustering techniques: definitions and K-Means
· Fitting Gaussians to data, Maximum Likelihood
· Density estimation techniques: Parzen window, kernel density estimation, Gaussian Mixture Models (GMM)
· Dimensionality reduction techniques: Principal Component Analysis (PCA)
Predictive Data Analysis
· Fundamental concepts of predictive analysis: training, validation, and test sets, cross-validation. Generative and discriminative algorithms. Parameters and hyperparameters. Parametric and non-parametric methods. Overfitting and underfitting, bias, and variance. Linear and non-linear models.
· Regression techniques. Evaluation metrics for regression problems: mean squared error (MSE) and mean absolute error (MAE).
· Classification techniques. Performance evaluation of a classification model: confusion matrix, precision, recall, and F1 score. ROC curves for evaluating binary classification performance. Discriminant functions. Fisher Discriminant Analysis (FDA), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Mahalanobis Distance, K-Nearest Neighbor (KNN) as a non-parametric classification method. MAP and Naive Bayes.
-->
Textbook Information
Chapters from these books:
- Peck, Roxy, Chris Olsen, and Jay L. Devore. Introduction to statistics and data analysis. Cengage Learning, 2015.
- James, Gareth Gareth Michael. An introduction to statistical learning: with applications in Python, 2023.https://www.statlearning.com
- Bishop, Christopher M. "Machine Learning. Machine learning, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
- Hernán, Miguel A., and James M. Robins. Causal inference, 2010. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Teaching material shared by the teacher through Microsoft Teams (Team code: i87g4nb) and through the https://antoninofurnari.github.io/fadlecturenotes/ website.
Course Planning
Subjects | Text References | |
---|---|---|
1 | Introduction to the course | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
2 | Main Data Analysis Concepts | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
3 | Descriptive Statistics and Graphical Representation of data | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
4 | Uncertainty and data as the observation of random events | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
5 | Probability Distributions | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
6 | Introduction to statistical inference: generalizing to the population | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
7 | Associations of Two Variables | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
8 | Introduction to causal inference | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
9 | Simple causal inference techniques to analyze observational data | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
10 | Clustering & Density estimation | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
11 | Dimensionality Reduction | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
12 | Predictive Data Analysis | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
13 | Probabilistic Models for Classification | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
14 | Discriminant Functions for Classification | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
Learning Assessment
Learning Assessment Procedures
The exam is divided into the following tests:
- A written test, aimed at verifying the student's skills regarding the topics covered in the "Fundamentals of Data Analysis" module, from a theoretical and methodological point of view. The test is evaluated with a mark out of thirty.
- A project, agreed with the teacher and carried out independently by the student, aimed at verifying the skills acquired in the "Laboratory" module. The project is presented to the teacher through an interview and evaluated with a mark out of thirty
Students with disabilities and/or DSA must contact the teacher, the CInAP representative of the DMI (Prof. Daniele) and CInAP well in advance of the exam date to communicate that they intend to take the exam using the appropriate compensatory measures.
Two written in itinere exams are scheduled during the course. Passing both tests grants exemption from the final written exam.
The final grade is obtained by means of a weighted average between the marks obtained in the two tests with weights of 2/3 for the written test and 1/3 for the laboratory test.
The assessment of learning can also be conducted remotely if the conditions require it.
The results of the tests will be communicated to students via Microsoft Teams (Team code: i87g4nb).
The grading of each test is expressed on a scale of thirty points according to the following scheme:
Score 29-30 with honors
The student has a deep understanding of the concepts and techniques of data analysis. They can promptly analyze data analysis problems, identifying the most suitable data analysis techniques for the given problem independently and critically, and indicating the most suitable methodological practices for their application. They have excellent communication skills and language proficiency.
Score 26-28
The student has a good understanding of the concepts and techniques of data analysis. They can analyze data analysis problems, identifying appropriate data analysis techniques for the given problem and indicating suitable methodological practices for their application. They have good communication skills and language proficiency.
Score 22-25
The student has a fair knowledge of the concepts and techniques of data analysis, although it may be limited to the main topics. They can analyze data analysis problems, albeit not always in a linear manner, identifying suitable data analysis techniques for the given problem. They have fair communication skills and language proficiency.
Score 18-21
The student has minimal knowledge of the concepts and techniques of data analysis. They have limited ability to analyze data analysis problems. They have sufficient communication skills, although not always appropriate language proficiency.
Examination not passed
The student does not possess the minimum required knowledge of the main content of the course. Their ability to use specific language is very poor or nonexistent, and they are unable to independently apply the acquired knowledge.
Examples of frequently asked questions and / or exercises
The data analysis project is generally based on medium-large datasets obtainable on the internet.
Examples of typical exam questions:
- Define the classification problem, discuss the differences with respect to the regression problem and give practical examples.
- Explain the K-NN algorithm for classification. Discuss the effect of parameter K on algorithm performance. Give graphical examples of how the algorithm works and the effect of K.
- Discuss evaluation measures for classification problems: accuracy, confusion matrix, precision, recall and F1 score. The pros and cons of the measures considered are discussed, also in relation to the characteristics of the test dataset.
- Illustrate the main techniques useful for studying the correlation between variables.