Data Mining

Description

The course gives an introduction to data mining. Data mining is the discipline that studies the extraction of useful knowledge from data bases. Nowadays, data mining techniques are used in many different types of applications, such as the prediction of fraud in credit card transactions, the customer segmentation for marketing studies, or medical diagnosis, among others. For this reason, data mining is a multi-disciplinary field that has a great interest.

Type Subject

Tercer - Obligatoria

Semester

First

Course

Credits

5.00

Previous Knowledge

Students should have a solid understanding of linear algebra and programming before taking the course. The minimum requirements in these areas are as follows:

Linear algebra:
- Knowledge of vector operations, such as addition, subtraction, scalar multiplication, and scalar product.
- Understanding of operations with matrices, including multiplication, addition, and subtraction of matrices.
- Familiarity with concepts such as matrix transposes, inverses, determinants, and eigenvalues/eigenvectors.
- Understanding of linear transformations, including their representation by matrices.
- Knowledge of solving systems of linear equations and matrix factorizations.

Programming:
- Knowledge of Object-oriented programming.
- Understanding of fundamental programming concepts, including variables, data types, control flow (eg, loops, conditionals), functions, and error handling.
- Experience with writing and running code to manipulate data structures, such as arrays, lists, and dictionaries.
- Familiarity with basic input/output file operations.

Objectives

Upon completion of the course, the student will be able to:

- Design and implement a data mining process for real-world applications
- Use data warehouse tools and OLAP techniques to perform a given data mining analysis.
- Assess the results of data mining, visualize the results, and use the knowledge extraction appropriate for the domain of interests.
- Communicate with experts and non-experts (broad audience) the application and use of data mining
- Describe the main algorithms of data mining and be able to develop them, and if it is necessary, adapting the best algorithms for the application at hand and thus, improving their efficiency and performance.

As a summary, the student will be able to work as a data mining analyst, using current commercial data mining tools, and develop projects of data mining as well.

Contents

The syllabus of the subject is as follows:

1. MapReduce
2. PageRank
3. Instance Based Learning (IBL)
4. Optimization of Least Squares and Gradient
5. Other Bioinspired Optimization Methods (Simulated Annealing and Genetic Algorithms)
6. Data Preprocessing
7. Attribute Selection and Regularization
8. Evaluation of DM Models
9. Inductive Learning
10. Association Rules
11. Boosting / Bagging and other Ensemble Methods
12. Bayesian Learning
13. Neural Networks

Additional topics:
14. Clustering and Unsupervised Learning
15. Support Vector Machines (SVM)

Methodology

The course follows the methodology Problem Based Learning (PBL). This methodology is aimed at promoting the student´s learning by means of the definition of a problem that the student should develop with a team of other students. The student does not attend classes as in the traditional way (lecture classes). Lectures usually anticipate the knowlege that a student should acquire before the student feels the need for that particular knowledge. By using PBL, the student builds his own knowledge on the domain by means of the guided solution to a given problem properly designed.

The benefits of this methodology are the achievement of a better learning and the preparation of the professional skills needed by a student when she enters the workforce. During the development of the problem, the student has to use and improve several skills such as teamwork, project management, communication, etc. In this methodology, there is not a clear separation between `theory´ and `practice´, but both are interleaved continously during the project development. The student acquires the required knowledge incrementally, as he needs it to solve the problem.

The specific implementation of this methodology in this course is detailed as follows:

- For each subject of the course, a problem will be defined. This refers to section 2,3,4 and 5 of the contents detailed above.

- The student will work and teams and will be required to provide a solution for that problem. The professor will act as a supervisor that will help and guide the project development. A list of resources (notes, books and articles) will be available for each project. Some of this resources will be prepared by the professor, while others will be included by the students themselves, thus promoting that the students acquire autonomous learning.

- There are no lecturing classes as in the traditional way. Instead, seminars and meetings will be scheduled, which will garantee that 1) the project is properly develop by each of the teams and 2) there are no knowledge gaps by the students.

- The teams will be sized depending on the number of students at class and the complexity of the projects.

- The might be different problems for a specific subject of the course. Each team might develop a given problem and at the end, each team shares the findings and results with the remaining students of the course.

Evaluation

The evaluation is adapted to the PBL methodology. Each student will have an overall grade for the projects, which will be the average of the mini-projects submitted. The grade for each project will be the weighted average between the overall project grade and an individual grade (which will depend on their level of participation in the project development). At the end of the course, the student will have to defend a scientific paper taking into account the general knowledge of the subject.

Partial Exams | Checkpoints (2 Controls Test) * 30%
Practical Assignments (3 Assignments) * 40%
Defence of Scientific Paper * 30%

Evaluation Criteria

Please read the previsous section.

Basic Bibliography

Instance-Based Learning Algorithms (Aha et al., 1991)
D.W. Aha; D. Kibler; M.K. Albert

"Instance-Based Learning Algorithms"
Machine Learning, 6, 37-66 (1991)
Kluwer Academic Publishers

Case-Based Reasoning (Article Aamodt & Plaza, 1994)
A. Aamodt & E. Plaza
"Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches"

AI Communications. IOS Press, Vol. 7: 1, pp. 39-59. (1994)
Improved Heterogeneous Distance Functions (Wilson et al., 1997) Fitxer
D.R. Wilson and T.R. Martinez.

"Improved Heterogeneous Distance Functions "
Journal of Artificial Intelligence Research 6 (1997) 1-34
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions URL
Alexandr, A., Indyk, P.,
Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium

Five Balltree Construction Algorithms
Omohundro, S.M.,
International Computer Science Institute Technical Report (1989)

LSH Forest: Self-Tuning Indexes for Similarity Search
Bawa, M., Condie, T., Ganesan
P., WWW 05 Proceedings of the 14th international conference on World Wide Web Pages 651-660

Andrew NG  Stanford Lecture Notes
Machine Learning for Natural Language Processing
Algorithms in Nature (CMU) - Optimization and Search
Algorithms in Nature (CMU) - Genetic Algorithms
Simulated Annealing - Kirill Netreba

Lecture Notes on Perception, Sensing & Instrumentation
Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions
Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp
SIAM Rev., Survey and Review section, Vol. 53, num. 2, pp. 217-288, June 2011

Matrix decompositions & latent semantic indexing
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (2008), Introduction to Information Retrieval, Cambridge University Press, chapter 18: Matrix decompositions & latent semantic indexing

Algorisme ID3 (Quinlan, 1986) Fitxer
J.R. Quinlan
"Induction of Decision Trees"
Machine Learning 1:81-106
Kluwer Academic Publishers
(1986)

Algorisme C4.5 (Quinlan, 1996) Fitxer
J.R. Quinlan
"Improved Use of Continuos Attributes in C4.5"

Journal of Artificial Intelligence Research 4:77-90
(1996)

Association Rules (R. Agrawal et al., 1993) Fitxer
R. Agrawal, T. Imielinski and A. Swami

"Mining association rules between sets of items in large databases"
Proceedings of the 1993 ACM SIGMOD International Conference on Management of data (SIGMOD'93), pp. 207-216
ISBN:0-89791-592-5
(1993)
http://dl.acm.org/citation.cfmac?id=170072

Algorisme APRIORI (R. Agrawal & R. Srikant, 1994) Fitxer
R. Agrawal & R. Srikant
"Fast Algorithms for Mining Association Rules"

Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94), pp 487-499
Morgan Kaufmann Publishers Inc. San Francisco, CA, USA
ISBN:1-55860-153-8
(1994)

Bankruptcy forecasting: An empirical comparison of AdaBoost and neural networks Fitxer
Esteban Alfaro, Noelia García, Matías Gámez, David Elizondo
School of Computing, De Montfort University, The Gateway, Leicester LE1 9BH, U.K.
Decision Support Systems (Impact Factor: 2.31). 04/2008; 45(1):110-122. DOI: 10.1016/j.dss.2007.12.002

Mapes autoorganitzatius (SOM) - Kohonen (1982)
T. Kohonen.
Self-Organized Formation of Topologically Correct Feature Maps.
Biological Cybernetics 43, 59-69 (1982)

Introducció BPN (INPUT 1996)
E. Pous, M. Roman and E. Golobardes.
Introducció a les xarxes neuronals, presentació de les Backpropagation
INPUT 10 (1996)

Deep Learning
Y LeCun, Y Bengio, G Hinton - Nature, 2015 - nature.com
Human gesture recognition using Kinect camera

O. Patsadu, C. Nukoolkit and B. Watanapa, "Human gesture recognition using Kinect camera," 2012 Ninth International Conference on Computer Science and Software Engineering (JCSSE), Bangkok, 2012, pp. 28-32.
doi: 10.1109/JCSSE.2012.6261920

New types of deep neural network learning for speech recognition and related applications: an overview
L. Deng, G. Hinton and B. Kingsbury, "New types of deep neural network learning for speech recognition and related applications: an overview," 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, 2013, pp. 8599-8603.
doi: 10.1109/ICASSP.2013.6639344

A Practical Introduction to Deep Learning with Caffe and Python
The goal of this blog post is to give you a hands-on introduction to deep learning. To do this, we will build a Cat/Dog image classifier using a deep learning algorithm called convolutional neural network (CNN) and a Kaggle dataset.

K-means (MacQueen, 1967)
J.B. MacQueen
"Some Methods for classification and Analysis of Multivariate Observations"
Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, University of California Press (1967)

X-means (Pelleg & Moore, 2000)
D. Pelleg and A. Moore
"X-means: Extending K-means with Efficient Estimation of the Number of Clusters"
In Proceedings of the 17th International Conference on Machine Learning (ICML00), pages 727-734, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, ISBN:1-55860-707-2 (2000)

Additional Material

Slides and articles will be available for each project in eStudy.

Online University Expert in Artificial Intelligence

Create the future with AI: master algorithms, drive change, and lead the technological revolution

Help

Follow La Salle BCN

Online University Expert in Artificial Intelligence

Create the future with AI: master algorithms, drive change, and lead the technological revolution

Data Mining

Help

Follow La Salle BCN

Search form