BIG DATA

Academic Year 2023/2024 - Teacher: ALFREDO PULVIRENTI

Expected Learning Outcomes

The course introduces the main data mining techniques. The focus is on the mining of large amounts of data which do not enter in main memory. The examples presented during the course will cover the web, social networks and Next Generation Sequencing data produced in the biomedical field. In addition, the course deals with the subject also from the point of view of algorithms, emphasizing the difference from machine-learning. Among the topics discussed, we find, tools such as map-reduce, to work in a distributed environment with large amounts of data. This argument will be the common denominator in all the mining issues presented. Next, we introduce the issue of similarity research and the use of hashing techniques for large volumes of data. It also addresses the classical problem of high-support mining by describing the a priori algorithm and its variants. Recommendation systems will then be introduced. In this context we will also address the problem of high dimensional data and dimensional dimensioning techniques such as SVD, CUR, NNMF. The course will then introduce the main themes in network analysis. The centrality measures for the network will be introduced, with particular emphasis to the page-rank and its variants. The concept of null network model will be introduced to maintain network characteristics such as, degree distribution and clustering coefficient. Among the models presented we will find: Erdos-Renyi, Chung-Lu, Preferential Attachment. The problem of clustering will be addressed through the use of modularity and spectral clustering techniques.

General teaching training objectives in terms of expected learning outcomes.

Knowledge and understanding: The course aims to give the knowledge and basic and advanced skills to the analysis of large amounts of data.

Applying knowledge and understanding: the student will acquire knowledge about the models and algorithms for analyzing data such as: mining high support, recommendation systems, search for similarities high dimension, map-reduce and spark, complex networks analysis, text mining and the document tagging systems.

Making judgments: Through concrete examples and case studies, the student will be able to independently develop solutions to specific problems related to big data.

Communication skills: the student will acquire the necessary communication skills and expressive appropriateness in the use of technical language in the general area of ​​big data.

Learning skills: The course aims to provide students with the necessary theoretical and practical methods to deal independently and solve new problems that may arise during a work activity. For this purpose, different topics will be covered in class by involving students in the search for possible solutions to real problems, using benchmarks available in the literature.

Course Structure

Lectures and laboratory.

Should teaching be carried out in mixed mode or remotely, it may be necessary to introduce changes with respect to previous statements, in line with the programme planned and outlined in the syllabus.
 

Learning assessment may also be carried out on line, should the conditions require it.

Detailed Course Content

High Support  Data Mining. Recommendation Systems. Map-Reduce. Beyond or map-reduce Similarity search of higher dimensions: shingling, Min-Hashing, LSH, Min-LSH. Dimensionality reduction: SVD, CUR, Application to LSI Johnson-Lindenstrauss theorem. Link Analysis: PageRank, link spam,  Hub-Authorities, Applications on  Map-Reduce. Web Advertising: online Algorithms, Adword and its implementations. Graph mining: subgraph matching, motif finding, community detection, Network alignment and network analysis. Text mining: TF.IDF, Bag-Of-Word, Entity annotation.

Textbook Information

Mining of Massive Datasets

Jure LeskovecAnand RajaramanJeff Ullman

http://www.mmds.org

Course Planning

 SubjectsText References
1Introduzione, Map Reduce, SparkCapitoli 1 e 2 + materiale didattico integrativo
2Mining di insiemi frequentiCapitolo 6 + materiale didattico integrativo
3Similarità ad alte dimensioni. Locality sensitive Hashing (LSH). Capitolo 3 + materiale didattico integrativo
4Attività pratica su LSH e sue applicazioniCapitolo 3 + materiale didattico integrativo
5Dimensionality reduction. PCA, SVD, CUR, NNMFCapitolo 11 + materiale didattico integrativo
6Attività pratica su dimensionality reduction. Capitolo 11 + materiale didattico integrativo
7Sistemi di raccomandazione. Latent Semantic Indexing, Collaborative filtering e Network based inference,Capitolo 9 + materiale didattico integrativo
8Attività pratica su sistemi di raccomandazione.Capitolo 9 + materiale didattico integrativo
9Link Analysis: PageRank Link spam Hub-Authorities Applicazioni su Map-ReduceCapitolo 5 + materiale didattico integrativo
10Analisi di Grafi di grandi dimensioni. Conteggio triangoli subgraph matching e motif finding, community detection: overlapping communities Network alignmentCapitolo 10 + materiale didattico integrativo
11Attività pratica su motif finding su grafi di grandi dimensioni. Applicazioni in Finanza.Capitolo 10 + materiale didattico integrativo
12Web Advertising: Algoritmi online Adword e sue implementazioniCapitolo 8 è materiale didattico integrativo
13Text mining. TF.IDF, Entity annotationMateriale didattico integrativo
14Attività pratica su text mining e sistemi di raccomandazione per analisi di banche dati citazioni: arxiv, pubmedMateriale didattico integrativo