Introduction to Data Mining

Academic Year 2022/2023 - Teacher: GIOVANNI MICALE

Expected Learning Outcomes

General teaching training objectives in terms of expected learning outcomes.

  1. Knowledge and understanding: The course aims to give the knowledge and basic and advanced skills to the analysis of data.
  2. Applying knowledge and understanding: the student will acquire knowledge about the models and algorithms for analyzing data such as: mining high support, recommendation systems, search for similarities in high dimension, networks analysis, neural networks, classification and clustering.
  3. Making judgments: Through concrete examples and case studies, the student will be able to independently develop solutions to specific problems related to data analysis.
  4. Communication skills: the student will acquire the necessary communication skills and expressive appropriateness in the use of technical language in the general area of ​​data analysis.
  5. Learning skills: The course aims to provide students with the necessary theoretical and practical methods to deal independently and solve new problems that may arise during a work activity. For this purpose, different topics will be covered in class by involving students in the search for possible solutions to real problems, using benchmarks available in the literature.

Course Structure

Lectures.

Should teaching be carried out in mixed mode or remotely, it may be necessary to introduce changes with respect to previous statements, in line with the programme planned and outlined in the syllabus.

Required Prerequisites

Programming and data structures.

Attendance of Lessons

Attendance of lectures is highly recommended.

To better follow lectures, slides are provided by the professor.

Slides are not a means for studying, but help to learn the topics explained during lectures.

For further investigation of the topics covered by the course, references to textbooks and online resources will be provided during lectures

Detailed Course Content

The course includes a theoretical part, in which the main Data Mining problems will be explained, and a practical part, in which we will introduce the R programming language and we will show how to solve the illustrated data mining problems using R. The two parts of the course will be carried on in parallel.

The following topics will be covered:

  • Introduction to Data Mining
  • Mention on probability theory
  • R programming language
  • High support Data Mining (apriori algorithm, frequent itemsets, association rules)
  • Classification (decision trees, SVM, bayesian classifiers, lazy classifiers, rules extractors)
  • Clustering (hierarchical, k-means, density-based)
  • Recommendation systems
  • Markov chains and HMM
  • Introduction to Networks (Centrality measures, Clustering coefficient)
  • Network random models
  • Graph matching
  • Graph mining
  • Neural Networks (Feed-Forward, Convolutional, Recurrent, Long-Short Term Memory)

Concerning R language, we will show base functions for data analysis as well as several packages for data mining, such as "caret" (for classification), "igraph" (for network analysis and visualization) and "keras" (for building neural networks). 

Textbook Information

For the theoretical description of data mining problems, we will mainly refer to different chapters of the following book:

  • Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeff Ullman (http://www.mmds.org)

Other suggested textbooks for Data Mining are:

  • Data Mining: Concepts and Techniques, Jiawei Han and Micheline Kamber, The Morgan Kaufmann Series in Data Management Systems
  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie,Robert Tibshirani, Jerome Friedman, Springer

Concerning probability theory, the suggested textbook is:

  • Statistical methods in Bioinformatics: an Introduction (Second Edition), Warren J. Ewens e Gregory R. Grant, Springer.

A suggested book for learning the R programming language (available online) is:

  • The R book (Second Edition), Michael J. Crawley, Wiley (https://www.cs.upc.edu/~robert/teaching/estadistica/TheRBook.pdf).

Course Planning

 SubjectsText References
1Introduction to data miningLeskovec Chapter 1
2Recall on probability theoryWarren Chapters 1 and 2
3R languageMaterials provided by the lecturer
4High support data mining (apriori, frequent itemsets, association rules)Leskovec Chapter 6
5Classification (decision trees, SVM, bayesian classifiers, lazy classifiers, rule extractors)Leskovec Chapter 12 + Materials provided by the lecturer
6Clustering (hierarchical, k-means, density-based)Leskovec Chapter 7
7Recommendation systemsLeskovec Chapter 9
8Markov chains and HMMsMaterials provided by the lecturer
9Introduction to networks (centrality measures, clustering coefficient)Materials provided by the lecturer
10Random models of networksMaterials provided by the lecturer
11Graph matchingMaterials provided by the lecturer
12Graph miningMaterials provided by the lecturer
13Neural networks (Feed-Forward, Convolutional, Recurrent, Long-Short Term Memory)Leskovec Chapter 13

Learning Assessment

Learning Assessment Procedures

Final exam consists of a written test, followed by an oral examination.

The written test includes 3 theoretical open questions on topics covered during lectures.

The grade obtained with the written test is the starting grade of the exam, which can be incremented by 2 or 4 points after the oral examination, depending on the type of oral examination chosen by the student.

Oral examination can be chosen among:

  • A project (proposed by the student or the professor and anyway agreed by both) which generally consists of an implementation (preferably by using the R language illustrated during lectures) of the solution to a data mining problem. The project can lead to a maximum increment of 4 points to the grade of the written test;
  • A seminar (agreed by the student and the professor), which consists of a short Power Point presentation (maximum 15 minutes) of the content of a scientific paper that investigates topics covered by the course. The seminar can lead to a maximum increment of 2 points to the grade of the written test.

The two parts of the exam (written test and oral examination) can be held in any order and even in different exam sessions.

Once assigned, the project must be completed within 3 months.

Unless otherwise communicated, the written test will be held at 11 AM and will be 1 hour long.

Notes:

  • Usage of any hardware instrument (calculators, tablets, smartphones, cell phones, BT earphones, etc.), books or personal documents during the written exam is forbidden;
  • To attend the exam, the student must reserve for the exam by using the proper module on the CEA student portal;
  • Late reservations by email are not admitted. If reservation is missing, the final exam cannot be verbalized;
  • Learning assessment may also be carried out on line, should the conditions require it.

Examples of frequently asked questions and / or exercises

Examples of questions for the written exam, projects and seminars will be illustrated during lectures.