FUNDAMENTAL OF DATA ANALYSIS AND LABORATORYModule FUNDAMENTAL OF DATA ANALYSIS
Academic Year 2023/2024 - Teacher: ANTONINO FURNARIExpected Learning Outcomes
The objectives of the course are:
- Provide a solid understanding of the fundamental principles needed to collect, organize, model, analyze, and interpret data. The course aims to provide this understanding through the presentation of a theoretical-mathematical framework and numerous examples of application of this framework to real data sets.
- Guide the student in the acquisition of technical skills for the construction, management, analysis of real data sets in order to build, through the most appropriate techniques, data models and decision support systems.
- Provide adequate knowledge for the choice of the most appropriate techniques to solve a problem of data analysis and knowledge extraction, evaluating pros and cons.
- Train students for the preparation of complete, rigorous, and visually adequate reports that correctly and effectively communicate to the end user the results of the analysis and exploration of a set of data, clearly justifying the conclusions.
- Provide the necessary skills to allow students to update themselves independently on the use of techniques, software, and programming languages useful for data analysis.
Course Structure
Lectures in the classroom.
If the teaching is given in mixed or distance mode, the necessary variations with respect to what was previously stated may be introduced, in order to comply with the program provided and reported in the syllabus.
Required Prerequisites
The course includes the following curricular prerequisites, which must be met prior to taking the exam:
- Programmazione I e Laboratorio
- Algebra lineare e Geometria
- Elementi di Analisi Matematica I
- Strutture Discrete
Attendance of Lessons
Attending lectures is not mandatory, but strongly recommended.
Detailed Course Content
The course is divided into six main modules:
- Introduction to data analysis
- Descriptive and exploratory data analysis
- Inferential Data Analysis
- Elements of analysis of causal data
- Predictive analytics
- Introduction to Time Series Analysis
Introduction to Data Analysis
- Data Analysis Overview, Purpose and Applications
- Main types of data analysis: descriptive, exploratory, inferential, causal, predictive, temporal data analysis
- Examples of data analysis and applications (notable examples of data analysis and how these have been useful for solving real problems)
- Different data types: nominal, ordinal, range, and ratio data
- Data collection techniques: surveys, experiments, observational studies, sampling
- Difference between sample and population
- Data pre-processing techniques: data cleansing, missing data management, data standardization, categorical variable encoding (dummy variables), data noise reduction (filtering, smoothing, outlier removal, normalization)
- Use of probability for data analysis: basic concepts of probability (joint probability, marginal, conditional, independence and conditional independence), Bayes' theorem and its use in data analysis, discrete, continuous, cumulative probability distributions. Remarkable probability distributions.
- Measures of central, mean, median and fashion trend
- Dispersion, variance, standard deviation, quartiles, and interquartile range measurements
- Gaussian fit to data
- Covariance, correlation (Pearson, Spearman), use of linear regression (simple and multiple) and logistics (simple and multinomial) to study the relationship between variables
- Density estimation techniques and cluster analysis: Parzen window, kernel density estimation, Gaussian mixture models (GMM), K-Means
- Dimensionality reduction techniques: principal component analysis (PCA)
- Data visualization techniques: pie charts, histograms, boxplot, scatterplot, hexbin, density maps, contour lines, scattermatrix, regression plot
- Point and interval estimation
- Hypothesis testing, null and alternative hypothesis, p-value and statistical significance
- Confidence intervals, significance levels and how to interpret them
- Assess the significance of correlation coefficients
- Statistical significance of linear and logistic regression
- Model selection techniques, including stepwise regression and backward elimination
- Normality Test: Q-Q Plot and Pearson Chi Square Test
- Definition of causality. Difference between correlation and causality and importance of determining the causal relationship between variables.
- Experiments vs. observations. Differences between controlled experiments and observational studies. Importance of the former to establish causality. Randomized Controlled Experiments.
- Counterfactuals and confounders.
- Simple techniques of causal inference: linear regression with control of confounders
- Predictive analytics fundamentals: training, validation and test sets, cross validation, and how to use these sets to evaluate the performance of a model. Generative and discriminative algorithms. Parameters and hyper-parameters. Parametric and nonparametric methods. Overfitting and underfitting, bias and variance. Linear and nonlinear models.
- Regression techniques. Evaluation measures for regression problems: mean root mean error and mean absolute error.
- Classification techniques. Performance evaluation of a classification model: confusion matrix, precision, recall and F1 score. ROC and AUC curves for evaluating binary classification performance. Discriminating functions. Fisher Discriminant Analysis (FDA), Linear Discriminant Analysis (LDA), Mahalanobis Distance, K-Nearest Neighbor (KNN) as a nonparametric classification method. MAP and Naive Bayes. One vs rest and one vs all classification for multi-class classification. Data balancing techniques.
- Optimization techniques of model hyper-parameters: grid search.
- Introduction to time series data. Definitions and problems.
- Decomposition of time series into trends and seasonality
- Basic techniques and models for time series analysis
Textbook Information
Chapters from these books:
- Peck, Roxy, Chris Olsen, and Jay L. Devore. Introduction to statistics and data analysis. Cengage Learning, 2015.
- James, Gareth Gareth Michael. An introduction to statistical learning: with applications in Python, 2023.https://www.statlearning.com
- Bishop, Christopher M. "Machine Learning. Machine learning, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
- Hernán, Miguel A., and James M. Robins. Causal inference, 2010. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Teaching material shared through Microsoft Teams (code of the Team: i87g4nb).
Course Planning
Subjects | Text References | |
---|---|---|
1 | Introduction to the course | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
2 | Main Data Analysis Concepts | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
3 | Descriptive Statistics and Graphical Representation of data | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
4 | Uncertainty and data as the observation of random events | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
5 | Probability Distributions | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
6 | Introduction to statistical inference: generalizing to the population | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
7 | Associations of Two Variables | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
8 | Introduction to causal inference | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
9 | Simple causal inference techniques to analyze observational data | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
10 | Clustering & Density estimation | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
11 | Dimensionality Reduction | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
12 | Predictive Data Analysis | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
13 | Probabilistic Models for Classification | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
14 | Discriminant Functions for Classification | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
15 | Series data analysis | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
Learning Assessment
Learning Assessment Procedures
The examination is divided into two distinct parts:
- A project that consists of the analysis of a dataset agreed upon with the teacher. The project will involve the application of the most appropriate data analysis techniques, depending on the dataset considered, as discussed during the lectures.
- An oral interview for the presentation of the project and the assessment of the knowledge of the course topics.
The assessment of learning can also be conducted remotely if the conditions require it.
The grading is expressed on a scale of thirty points according to the following scheme:
Score 29-30 with honors
The student has a deep understanding of the concepts and techniques of data analysis. They can promptly analyze data analysis problems, identifying the most suitable data analysis techniques for the given problem independently and critically, and indicating the most suitable methodological practices for their application. They have excellent communication skills and language proficiency.
Score 26-28
The student has a good understanding of the concepts and techniques of data analysis. They can analyze data analysis problems, identifying appropriate data analysis techniques for the given problem and indicating suitable methodological practices for their application. They have good communication skills and language proficiency.
Score 22-25
The student has a fair knowledge of the concepts and techniques of data analysis, although it may be limited to the main topics. They can analyze data analysis problems, albeit not always in a linear manner, identifying suitable data analysis techniques for the given problem. They have fair communication skills and language proficiency.
Score 18-21
The student has minimal knowledge of the concepts and techniques of data analysis. They have limited ability to analyze data analysis problems. They have sufficient communication skills, although not always appropriate language proficiency.
Examination not passed
The student does not possess the minimum required knowledge of the main content of the course. Their ability to use specific language is very poor or nonexistent, and they are unable to independently apply the acquired knowledge.
Examples of frequently asked questions and / or exercises
The data analysis project is generally based on medium-large datasets obtainable on the internet.
Examples of typical exam questions:
- Define the classification problem, discuss the differences with respect to the regression problem and give practical examples.
- Explain the K-NN algorithm for classification. Discuss the effect of parameter K on algorithm performance. Give graphical examples of how the algorithm works and the effect of K.
- Discuss evaluation measures for classification problems: accuracy, confusion matrix, precision, recall and F1 score. The pros and cons of the measures considered are discussed, also in relation to the characteristics of the test dataset.
- Illustrate the main techniques useful for studying the correlation between variables.