FUNDAMENTAL OF DATA ANALYSIS AND LABORATORYModule Laboratory
Academic Year 2023/2024 - Teacher: ANTONINO FURNARIExpected Learning Outcomes
The laboratory module attached to the theoretical course aims to provide practical experience of data analysis. The main tools used are the Python language, the scipy stack scientific computing libraries and other well-known Python-based data analysis libraries, and Jupyter notebooks in their various declinations (locally, Google Colab and Kaggle).
The objectives of the course are:
- Provide an in-depth knowledge of the main technologies, languages, libraries and software useful for collecting, organizing, modeling, analyzing and interpreting
- Guide the student in the construction, management, analysis of real data sets and in the definition, through the most appropriate techniques, data models and decision support systems.
- Guide the student in choosing the most appropriate techniques to solve a given problem of data analysis and knowledge extraction, evaluating its pros and cons.
- Guide the student in the drafting of complete, rigorous, and visually adequate reports that communicate correctly and effectively to the end user the results of the analysis and exploration of a set of data, clearly justifying the conclusions.
- Provide the necessary skills to allow students to update themselves independently on the use of software and data analysis techniques.
Course Structure
Lectures in the classroom and individual work in the classroom with the computer.
If the teaching is given in mixed or distance mode, the necessary variations with respect to what was previously stated may be introduced, in order to comply with the program provided and reported in the syllabus.
Required Prerequisites
The course includes the following curricular prerequisites, which must be met prior to taking the exam:
- Programmazione I e Laboratorio
- Algebra lineare e Geometria
- Elementi di Analisi Matematica I
- Strutture Discrete
Attendance of Lessons
Attending lectures is not mandatory, but strongly recommended.
Detailed Course Content
The course is divided into six main modules:
- Introduction to data analysis: introduction to using Python for scientific computing, the Scipy stack and the Pandas library. Introduction to Jupyter notebooks and Google Colab as execution tools and data analysis documentation. Examples of datasets and their pre-processing. Using the Scipy and Numpy Python libraries to calculate probabilities and generate random values from different distributions.
- Descriptive and exploratory data analysis: Use Python libraries to perform descriptive and exploratory analysis and create data visualizations.
- Inferential data analysis: introduction to the statsmodels library. Use Python libraries such as statsmodels to perform hypothesis testing, estimate confidence intervals, perform linear regression, and model selection.
- Elements of causal data analysis: use the Python library to perform simple causal analysis.
- Predictive analytics: introduction to the scikit-learn library. Using Python and libraries such as scikit-learn to perform classification and regression tasks and evaluate the performance of a model.
- Introduction to time series analysis: Use Python to perform time series and forecast analysis.
Textbook Information
Chapters from these books:
- Peck, Roxy, Chris Olsen, and Jay L. Devore. Introduction to statistics and data analysis. Cengage Learning, 2015.
- James, Gareth Gareth Michael. An introduction to statistical learning: with applications in Python, 2023.https://www.statlearning.com
- Bishop, Christopher M. "Machine Learning. Machine learning, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
- Hernán, Miguel A., and James M. Robins. Causal inference, 2010. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Teaching material shared through Microsoft Teams (code of the Team: i87g4nb).
Course Planning
Subjects | Text References | |
---|---|---|
1 | Introduction to the course | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
2 | Main Data Analysis Concepts | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
3 | Descriptive Statistics and Graphical Representation of data | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
4 | Uncertainty and data as the observation of random events | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
5 | Probability Distributions | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
6 | Introduction to statistical inference: generalizing to the population | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
7 | Associations of Two Variables | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
8 | Introduction to causal inference | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
9 | Simple causal inference techniques to analyze observational data | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
10 | Clustering & Density estimation | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
11 | Dimensionality Reduction | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
12 | Predictive Data Analysis | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
13 | Probabilistic Models for Classification | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
14 | Discriminant Functions for Classification | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
15 | Series data analysis | Teaching material made provided by the teacher, specific chapters of the recommended nooks. |
Learning Assessment
Learning Assessment Procedures
The examination is divided into two distinct parts:
- A project that consists of the analysis of a dataset agreed upon with the teacher. The project will involve the application of the most appropriate data analysis techniques, depending on the dataset considered, as discussed during the lectures.
- An oral interview for the presentation of the project and the assessment of the knowledge of the course topics.
The assessment of learning can also be conducted remotely if the conditions require it.
The grading is expressed on a scale of thirty points according to the following scheme:
Score 29-30 with honors
The student has a deep understanding of the concepts and techniques of data analysis. They can promptly analyze data analysis problems, identifying the most suitable data analysis techniques for the given problem independently and critically, and indicating the most suitable methodological practices for their application. They have excellent communication skills and language proficiency.
Score 26-28
The student has a good understanding of the concepts and techniques of data analysis. They can analyze data analysis problems, identifying appropriate data analysis techniques for the given problem and indicating suitable methodological practices for their application. They have good communication skills and language proficiency.
Score 22-25
The student has a fair knowledge of the concepts and techniques of data analysis, although it may be limited to the main topics. They can analyze data analysis problems, albeit not always in a linear manner, identifying suitable data analysis techniques for the given problem. They have fair communication skills and language proficiency.
Score 18-21
The student has minimal knowledge of the concepts and techniques of data analysis. They have limited ability to analyze data analysis problems. They have sufficient communication skills, although not always appropriate language proficiency.
Examination not passed
The student does not possess the minimum required knowledge of the main content of the course. Their ability to use specific language is very poor or nonexistent, and they are unable to independently apply the acquired knowledge.
Examples of frequently asked questions and / or exercises
The data analysis project is generally based on medium-large datasets obtainable on the internet.
Examples of typical exam questions:
- Define the classification problem, discuss the differences with respect to the regression problem and give practical examples.
- Explain the K-NN algorithm for classification. Discuss the effect of parameter K on algorithm performance. Give graphical examples of how the algorithm works and the effect of K.
- Discuss evaluation measures for classification problems: accuracy, confusion matrix, precision, recall and F1 score. The pros and cons of the measures considered are discussed, also in relation to the characteristics of the test dataset.
- Illustrate the main techniques useful for studying the correlation between variables.