Non-structured data analysis

Description

This module has been designed to provide students with sufficient resources to understand how to extract the business value embedded in the unstructured data. The technologies that are presented include an initial statistical section and a second, more computational part dedicated to language and image processing as the most common types of unstructured data. This module takes a practical approach, supported by sufficient theory to assimilate and consolidate knowledge of the main techniques and the current technological state of the art. A certain level of mathematical (statistical) knowledge is assumed for some theoretical justifications, as well as programming skills in Python, given that many relevant technologies have available libraries for this language. This module aims for students to acquire fundamental knowledge and sufficient criteria to select the most suitable techniques for analysing and processing unstructured data and extract maximum value.

Type Subject

Tercer - Obligatoria

Semester

Second

Course

Credits

3.00

Professors

Francesc Taxonera Isart

Mikhail Nikiforov Osipov

Previous Knowledge

Objectives

1 Achieve a basic understanding of traditional language processing techniques.
2 Understand the functionality and potential of embeddings and transformers.
3 Gain knowledge of the main application scenarios for generative artificial intelligence.
4 Understand the most suitable market technologies for unstructured data analytics.
5 Learn about the potential of generative AI in image and video processing.

Contents

1. Introduction to non-structured data. This session provides an overview of the course.
2. Co-occurrence analysis, visualizing high-dimensional data with PCA. In this session frequency and patterns of paired elements (e.g. keywords and codes within a dataset) are analyzed to reveal associations and structural relationships among data components. This session also covers projections of multi-feature datasets into lower-dimensional spaces, enabling clearer interpretation of data structure and variance into two or three dimensions.
3. PCA (cont'd), manifold learning. In this session a family of non-linear algorithms is analyzed in order to uncover low-dimensional structures embedded within high-dimensional data, preserving intrinsic geometric relationships to reveal complex patterns
4. Clustering, k-means, Gaussian mixture models. This session focuses on unsupervised clustering algorithm that partitions data into a predefined number of clusters, and later it is extended by modelling data as a combination of multiple Gaussian distributions.
5. Clustering (cont') - interpreting GMMs, automatically selecting the number of clusters. In this session automatic cluster count selection is achieved through likelihood-based criteria (e.g. Bayesian Information Criteria), which balances model fit against complexity.
6. Unstructured Data review. The session revisits the challenges of handling unstructured data and specifically those regarding the language or the images, its prevalence in real-world scenarios, and the importance of analytical techniques and the expected meaningful insights. Neuroscience theory is also discussed and foundational concepts, common preprocessing steps, and tools for managing unstructured data.
7. Rule based NLP. This session is focused in rule-based NLP methods, which use handcrafted rules and patterns to analyze and manipulate text. Most essential techniques are coved including tokenization, part-of-speech tagging, named entity recognition, and syntax parsing. Through examples, participants learn the strengths and limitations of rule-based approaches and their role in niche applications or as complements to data-driven models.
8. Neural Networks. This session is intended to refresh the students skills about neural networks and prepare them for advanced NLP and deep learning topics, this session revisits the fundamentals of neural networks. Key concepts like perceptrons, activation functions, backpropagation, and architectures such as feedforward, convolutional, and recurrent networks are reviewed.
9. NLP with Deep Learning. In this session, the deep-learning based approaches to NLP will be introduced: how neural networks can handle complex language tasks like sentiment analysis, machine translation, and question answering. Techniques like recurrent neural networks (RNNs), long short-term memory (LSTM), and gated recurrent units (GRUs) are explained, along with their advantages over rule-based methods. Some of these techniques will be demonstrated.
10. Embeddings and vectorization. This session will cover the representation of text data into numbers for machine learning. Word embeddings like Word2Vec, GloVe, and contextual embeddings from models such as BERT will be introduced. Students will learn how vectorization techniques capture semantic relationships and contextual information, enabling sophisticated language modeling and understanding.
11. Transformers and Generative AI. The evolution of transformers will be explored in this session. Concepts like self-attention and multi-head attention mechanisms are explained and also the standard architecture of transformers and how does a transformer work. Models like BERT and GPT will be also introduced as well as different examples.
12. Generative AI and Business applications. This session focuses on the practical applications of generative AI in business. Case studies are presented, demonstrating how AI can enhance customer experiences, optimize workflows, and enable innovative solutions. Ethical considerations, challenges, and best practices for deploying generative AI systems in real-world scenarios are also discussed. RAGs and Agentic RAGs will be covered as well in this session
13. Image processing. From CNN to transformers. This is the closing session of the module and covers both the basics of the image processing, starting with convolutional neural networks (CNNs) and their role in tasks like image classification and object detection. The transition to transformer-based architectures in computer vision is covered, showing how these models have outperformed traditional CNNs in tasks requiring contextual understanding. Second part of the session will be reserved for the final Exam.

Methodology

This subject is delivered in a weekly session divided into two parts. The first part will focus on introducing the subject matter in a descriptive manner, as well as providing the theoretical or conceptual explanation of aspects that require mathematical or computational justification. The second part will be practical, dedicated to exploring the explained concepts through demonstrations or exercises that should help to assimilate and understand its utility and application scenarios.

Evaluation

Attendance & participation 30%
Homeworks 35%
Final Exam 35%
The evaluation criteria apply to all the students, retakers must attend to class. Any exceptional situation should be communicated previously to the professors and validated by the tutor.
The subject will be passed if the overall score is higher than 5.
RETAKE POLICY
The Retake will consist of an exam that includes all the content of the subject.
Maximum grade to pass the course in the retake is 6.0.

Evaluation Criteria

Basic Bibliography

Jurafsky, D., & Martin, J. H. (2022). Speech and Language Processing
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.
Vaswani, A., et al. (2017). Attention Is All You Need.
L. Tunstall, L.Von Werra & T.Wolf (2022). Natural Language Processing with Transformers: Building Language Applications with HuggingFace.

Additional Material

Bachelor in Business Intelligence and Data Analytics

Become an expert in data analysis and business decision making in a technological ecosystem and with great networking opportunities

Professors

Help

Follow La Salle BCN

Bachelor in Business Intelligence and Data Analytics

Become an expert in data analysis and business decision making in a technological ecosystem and with great networking opportunities

Non-structured data analysis

Professors

Help

Follow La Salle BCN

Search form