Introduction to Data Mining

Academic Year 2024/2025 - Teacher: ANTONIO DI MARIA

Expected Learning Outcomes

General teaching training objectives in terms of expected learning outcomes.

  1. Knowledge and understanding: The course aims to give the knowledge and basic and advanced skills to the analysis of data.
  2. Applying knowledge and understanding: the student will acquire knowledge about the models and algorithms for analyzing data such as: mining high support, recommendation systems, search for similarities in high dimension, networks analysis, neural networks, classification and clustering.
  3. Making judgments: Through concrete examples and case studies, the student will be able to independently develop solutions to specific problems related to data analysis.
  4. Communication skills: the student will acquire the necessary communication skills and expressive appropriateness in the use of technical language in the general area of ​​data analysis.
  5. Learning skills: The course aims to provide students with the necessary theoretical and practical methods to deal independently and solve new problems that may arise during a work activity. For this purpose, different topics will be covered in class by involving students in the search for possible solutions to real problems, using benchmarks available in the literature.

Course Structure

Attendance of lectures is highly recommended.

To better follow lectures, slides are provided by the professor.

Slides are not a means for studying, but help to learn the topics explained during lectures.

For further investigation of the topics covered by the course, additional references to textbooks and online resources for each topic are specifies in the Syllabus.

Required Prerequisites

Programming and data structures.

Attendance of Lessons

Attendance of lectures is mandatory.

To better follow lectures, slides are provided by the professor.

Slides are not a means for studying, but help to learn the topics explained during lectures.

For further investigation of the topics covered by the course, additional references to textbooks and online resources for each topic are specifies in the Syllabus.

Detailed Course Content

The course includes a theoretical part, in which the main Data Mining problems will be explained, and a practical part, in which we will introduce the Python programming language and we will show how to solve the illustrated data mining problems using Python. The two parts of the course will be carried on in parallel.

The following topics will be covered:

  • Introduction to Data Mining
  • Python programming language
  • Data preprocessing
  • Mining of frequent itemsets (apriori algorithm, frequent itemsets, association rules)
  • Classification/Regression (decision trees and rules extractors, Naive Bayes, Perceptron, SVM, kNN, random forest, AdaBoost, Linear Regression, Logistic Regression)
  • Clustering (hierarchical, k-means, HDBSCAN, DBSCAN, OPTICS)
  • Introduction to Networks (Centrality measures, Clustering coefficient)
  • Network random models
  • Graph matching
  • Graph mining
  • Neural Networks (Feed-Forward, Convolutional, Recurrent, Long-Short Term Memory, Transformer, LLM)
  • Mining of data streams

Concerning Python language, we will show base functions for data analysis as well as several packages for data mining, such as "igraph" for network analysis and visualization and "pyTorch" for building neural networks. 

Textbook Information

For the theoretical description of data mining problems, we will mainly refer to different chapters of the following book:

  • "Mining of Massive Datasets". Jure Leskovec, Anand Rajaraman, Jeff Ullman (http://www.mmds.org).
  • "Data Mining: the Textbook". Charu C. Aggarwal, Springer, 2015.
  • "Network Science". Albert-Laszlo Barabasi, Cambridge University Press, 2016. 

    For learning the Python programming language a valuable resource is the official tutorial of Python:

    • https://docs.python.org/3/tutorial/index.html

    Course Planning

     SubjectsText References
    1Introduction to data miningLeskovec Chapter 1
    2Python languageMaterials provided by the lecturer
    3Data preprocessingAggarwal Chapter 2 + Materials provided by the lecturer
    4Mining of frequent itemsets (apriori, frequent itemsets, association rules)Leskovec Chapter 6
    5Classification (decision trees and rules extractors, Naive Bayes, Perceptron, SVM, kNN, Random Forest)Leskovec Chapter 12 + Materials provided by the lecturer
    6Clustering (hierarchical, k-means, BFR, CURE, DBSCAN, OPTICS)Leskovec Chapter 7
    7Introduction to networks (centrality measures, clustering coefficient)Barabasi Chapters 1 and 2
    8Random models of networksBarabasi Chapters 3, 4 and 5
    9Graph matchingMaterials provided by the lecturer
    10Graph miningMaterials provided by the lecturer
    11Neural networks (Feed-Forward, Convolutional, Recurrent, Long-Short Term Memory)Leskovec Chapter 13

    Learning Assessment

    Learning Assessment Procedures

    The final exam consists of two mandatory parts:
       1. Written exam: two open-ended theoretical questions on topics covered during the lectures.
           The grade ranges from 0 to 20.

       2. Project: related to algorithms and methodologies discussed in class. The grade ranges from 0 to 10.
           The sum of the written exam + project gives a score from 0 to 30.

    In addition to the mandatory part, there are three optional parts:

    1. Classwork: in-class exercises related to algorithms and methodologies explained. The score ranges from 0 to 2 points.
    2. Homework: exercises to be done at home related to algorithms and methodologies explained in class (in collaboration with companies).  The score ranges from 0 to 2 points.
    3. Seminar: Presentation on a paper provided by the professor. The score ranges from 0 to 1 point.
    By adding the scores from the mandatory and optional parts, a maximum score of 35 can be obtained. Any score above 30 results in a final grade of 30 with honors ("30L").


    Unless otherwise communicated, the written test will be held at 11 AM and will be 1 hour long.

    Notes:

    • Usage of any hardware instrument (calculators, tablets, smartphones, cell phones, BT earphones, etc.), books or personal documents during the written exam is forbidden;
    • To attend the exam, the student must reserve for the exam by using the proper module on the CEA student portal;
    • Late reservations by email are not admitted. If reservation is missing, the final exam cannot be verbalized;
    • Learning assessment may also be carried out on line, should the conditions require it.

    Examples of frequently asked questions and / or exercises

    Examples of questions for the written exam, projects and seminars will be illustrated during lectures.