BIG DATA
Academic Year 2024/2025 - Teacher: ALFREDO PULVIRENTIExpected Learning Outcomes
The course introduces the main data mining techniques. The focus is on the mining of large amounts of data which do not enter in main memory. The examples presented during the course will cover the web, social networks and Next Generation Sequencing data produced in the biomedical field. In addition, the course deals with the subject also from the point of view of algorithms, emphasizing the difference from machine-learning. Among the topics discussed, we find, tools such as map-reduce, to work in a distributed environment with large amounts of data. This argument will be the common denominator in all the mining issues presented. Next, we introduce the issue of similarity research and the use of hashing techniques for large volumes of data. It also addresses the classical problem of high-support mining by describing the a priori algorithm and its variants. Recommendation systems will then be introduced. In this context we will also address the problem of high dimensional data and dimensional dimensioning techniques such as SVD, CUR, NNMF. The course will then introduce the main themes in network analysis. The centrality measures for the network will be introduced, with particular emphasis to the page-rank and its variants. The concept of null network model will be introduced to maintain network characteristics such as, degree distribution and clustering coefficient. Among the models presented we will find: Erdos-Renyi, Chung-Lu, Preferential Attachment. The problem of clustering will be addressed through the use of modularity and spectral clustering techniques.
General teaching training objectives in terms of expected learning outcomes.
Knowledge and understanding: The course aims to give the knowledge and basic and advanced skills to the analysis of large amounts of data.
Applying knowledge and understanding: the student will acquire knowledge about the models and algorithms for analyzing data such as: mining high support, recommendation systems, search for similarities high dimension, map-reduce and spark, complex networks analysis, text mining and the document tagging systems.
Making judgments: Through concrete examples and case studies, the student will be able to independently develop solutions to specific problems related to big data.
Communication skills: the student will acquire the necessary communication skills and expressive appropriateness in the use of technical language in the general area of big data.
Learning skills: The course aims to provide students with the necessary theoretical and practical methods to deal independently and solve new problems that may arise during a work activity. For this purpose, different topics will be covered in class by involving students in the search for possible solutions to real problems, using benchmarks available in the literature.
Course Structure
Lectures and laboratory.
Should teaching be carried out in mixed mode or remotely, it may be necessary to introduce changes with respect to previous statements, in line with the programme planned and outlined in the syllabus.
Learning assessment may also be carried out on line, should the conditions require it.
Detailed Course Content
High Support Data Mining. Recommendation Systems. Map-Reduce. Beyond or map-reduce Similarity search of higher dimensions: shingling, Min-Hashing, LSH, Min-LSH. Dimensionality reduction: SVD, CUR, Application to LSI Johnson-Lindenstrauss theorem. Link Analysis: PageRank, link spam, Hub-Authorities, Applications on Map-Reduce. Web Advertising: online Algorithms, Adword and its implementations. Graph mining: subgraph matching, motif finding, community detection, Network alignment and network analysis. Text mining: TF.IDF, Bag-Of-Word, Entity annotation.
Textbook Information
Course Planning
Subjects | Text References | |
---|---|---|
1 | Introduzione, Map Reduce, Spark | Capitoli 1 e 2 + materiale didattico integrativo |
2 | Mining di insiemi frequenti | Capitolo 6 + materiale didattico integrativo |
3 | Similarità ad alte dimensioni. Locality sensitive Hashing (LSH). | Capitolo 3 + materiale didattico integrativo |
4 | Attività pratica su LSH e sue applicazioni | Capitolo 3 + materiale didattico integrativo |
5 | Dimensionality reduction. PCA, SVD, CUR, NNMF | Capitolo 11 + materiale didattico integrativo |
6 | Attività pratica su dimensionality reduction. | Capitolo 11 + materiale didattico integrativo |
7 | Sistemi di raccomandazione. Latent Semantic Indexing, Collaborative filtering e Network based inference, | Capitolo 9 + materiale didattico integrativo |
8 | Attività pratica su sistemi di raccomandazione. | Capitolo 9 + materiale didattico integrativo |
9 | Link Analysis: PageRank Link spam Hub-Authorities Applicazioni su Map-Reduce | Capitolo 5 + materiale didattico integrativo |
10 | Analisi di Grafi di grandi dimensioni. Conteggio triangoli subgraph matching e motif finding, community detection: overlapping communities Network alignment | Capitolo 10 + materiale didattico integrativo |
11 | Attività pratica su motif finding su grafi di grandi dimensioni. Applicazioni in Finanza. | Capitolo 10 + materiale didattico integrativo |
12 | Web Advertising: Algoritmi online Adword e sue implementazioni | Capitolo 8 è materiale didattico integrativo |
13 | Text mining. TF.IDF, Entity annotation | Materiale didattico integrativo |
14 | Attività pratica su text mining e sistemi di raccomandazione per analisi di banche dati citazioni: arxiv, pubmed | Materiale didattico integrativo |