FUNDAMENTAL OF DATA ANALYSIS AND LABORATORY
Module Laboratory

Academic Year 2023/2024 - Teacher: ANTONINO FURNARI

Expected Learning Outcomes

-->

The laboratory module attached to the theoretical course aims to provide practical experience of data analysis. The main tools used are the Python language, the scipy stack scientific computing libraries and other well-known Python-based data analysis libraries, and Jupyter notebooks in their various declinations (locally, Google Colab and Kaggle). 

The objectives of the course are:

  1. Provide an in-depth knowledge of the main technologies, languages, libraries and software useful for collecting, organizing, modeling, analyzing and interpreting 
  2. Guide the student in the construction, management, analysis of real data sets and in the definition, through the most appropriate techniques, data models and decision support systems.
  3. Guide the student in choosing the most appropriate techniques to solve a given problem of data analysis and knowledge extraction, evaluating its pros and cons.
  4. Guide the student in the drafting of complete, rigorous, and visually adequate reports that communicate correctly and effectively to the end user the results of the analysis and exploration of a set of data, clearly justifying the conclusions.
  5. Provide the necessary skills to allow students to update themselves independently on the use of software and data analysis techniques.

Course Structure

Lectures in the classroom and individual work in the classroom with the computer.

If the teaching is given in mixed or distance mode, the necessary variations with respect to what was previously stated may be introduced, in order to comply with the program provided and reported in the syllabus.

Required Prerequisites

-->

The course includes the following curricular prerequisites, which must be met prior to taking the exam:

  •  Programmazione I e Laboratorio
  • Algebra lineare e Geometria 
  • Elementi di Analisi Matematica I
  • Strutture Discrete 

Attendance of Lessons

Attending lectures is not mandatory, but strongly recommended.

Detailed Course Content

The course is divided into six main modules:

  • Introduction to data analysis: introduction to using Python for scientific computing, the Scipy stack and the Pandas library. Introduction to Jupyter notebooks and Google Colab as execution tools and data analysis documentation. Examples of datasets and their pre-processing. Using the Scipy and Numpy Python libraries to calculate probabilities and generate random values from different distributions.
  • Descriptive and exploratory data analysis: Use Python libraries to perform descriptive and exploratory analysis and create data visualizations.
  • Inferential data analysis: introduction to the statsmodels library. Use Python libraries such as statsmodels to perform hypothesis testing, estimate confidence intervals, perform linear regression, and model selection.
  • Elements of causal data analysis: use the Python library to perform simple causal analysis.
  • Predictive analytics: introduction to the scikit-learn library. Using Python and libraries such as scikit-learn to perform classification and regression tasks and evaluate the performance of a model.
  • Introduction to time series analysis: Use Python to perform time series and forecast analysis.

Textbook Information

Chapters from these books:

  • Peck, Roxy, Chris Olsen, and Jay L. Devore. Introduction to statistics and data analysis. Cengage Learning, 2015.
  • James, Gareth Gareth Michael. An introduction to statistical learning: with applications in Python, 2023.https://www.statlearning.com
  • Bishop, Christopher M. "Machine Learning. Machine learning, 2006. https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
  • Hernán, Miguel A., and James M. Robins. Causal inference, 2010. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Teaching material shared through Microsoft Teams (code of the Team: i87g4nb).

Course Planning

 SubjectsText References
1Introduction to the courseTeaching material made provided by the teacher, specific chapters of the recommended nooks.
2Main Data Analysis Concepts Teaching material made provided by the teacher, specific chapters of the recommended nooks.
3Descriptive Statistics and Graphical Representation of dataTeaching material made provided by the teacher, specific chapters of the recommended nooks.
4Uncertainty and data as the observation of random eventsTeaching material made provided by the teacher, specific chapters of the recommended nooks.
5Probability DistributionsTeaching material made provided by the teacher, specific chapters of the recommended nooks.
6Introduction to statistical inference: generalizing to the populationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
7Associations of Two Variables Teaching material made provided by the teacher, specific chapters of the recommended nooks.
8Introduction to causal inferenceTeaching material made provided by the teacher, specific chapters of the recommended nooks.
9Simple causal inference techniques to analyze observational dataTeaching material made provided by the teacher, specific chapters of the recommended nooks.
10Clustering & Density estimationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
11Dimensionality ReductionTeaching material made provided by the teacher, specific chapters of the recommended nooks.
12Predictive Data AnalysisTeaching material made provided by the teacher, specific chapters of the recommended nooks.
13Probabilistic Models for ClassificationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
14Discriminant Functions for ClassificationTeaching material made provided by the teacher, specific chapters of the recommended nooks.
15Series data analysisTeaching material made provided by the teacher, specific chapters of the recommended nooks.

Learning Assessment

Learning Assessment Procedures

-->

The examination is divided into two distinct parts:

  • A project that consists of the analysis of a dataset agreed upon with the teacher. The project will involve the application of the most appropriate data analysis techniques, depending on the dataset considered, as discussed during the lectures.
  • An oral interview for the presentation of the project and the assessment of the knowledge of the course topics.

The assessment of learning can also be conducted remotely if the conditions require it.

The grading is expressed on a scale of thirty points according to the following scheme:

Score 29-30 with honors

The student has a deep understanding of the concepts and techniques of data analysis. They can promptly analyze data analysis problems, identifying the most suitable data analysis techniques for the given problem independently and critically, and indicating the most suitable methodological practices for their application. They have excellent communication skills and language proficiency.

Score 26-28

The student has a good understanding of the concepts and techniques of data analysis. They can analyze data analysis problems, identifying appropriate data analysis techniques for the given problem and indicating suitable methodological practices for their application. They have good communication skills and language proficiency.

Score 22-25

The student has a fair knowledge of the concepts and techniques of data analysis, although it may be limited to the main topics. They can analyze data analysis problems, albeit not always in a linear manner, identifying suitable data analysis techniques for the given problem. They have fair communication skills and language proficiency.

Score 18-21

The student has minimal knowledge of the concepts and techniques of data analysis. They have limited ability to analyze data analysis problems. They have sufficient communication skills, although not always appropriate language proficiency.

Examination not passed

The student does not possess the minimum required knowledge of the main content of the course. Their ability to use specific language is very poor or nonexistent, and they are unable to independently apply the acquired knowledge.

Examples of frequently asked questions and / or exercises

-->

The data analysis project is generally based on medium-large datasets obtainable on the internet.

Examples of typical exam questions:

  • Define the classification problem, discuss the differences with respect to the regression problem and give practical examples.
  • Explain the K-NN algorithm for classification. Discuss the effect of parameter K on algorithm performance. Give graphical examples of how the algorithm works and the effect of K.
  • Discuss evaluation measures for classification problems: accuracy, confusion matrix, precision, recall and F1 score. The pros and cons of the measures considered are discussed, also in relation to the characteristics of the test dataset.
  • Illustrate the main techniques useful for studying the correlation between variables.