Big data analysis

Description

The goal of Big Data Analysis is to teach you how to use tools that can control the avalanche of data generated in the modern day. This will be done through a combination of Python, Hadoop and Spark. By the end of this course, you should be able to process large data files and manipulate data to generate statistics, metrics and graphs.

Type Subject

Tercer - Obligatoria

Semester

First

Course

Credits

6.00

Titular Professors

Didier Grimaldi

Professor

Professors

Didier Grimaldi

Andrew Ashton

Previous Knowledge

Objectives

Learning Outcomes of this subject are:

LO1. Use Python to read and transform data into different formats
LO2. Generate basic statistics and metrics using data on disk
LO3. Work with computing tasks distributed over a cluster
LO4. Convert data from various sources in storage or querying formats
LO5. Prepare data for statistical analysis, visualization and machine learning
LO6. Present data in the form of effective visuals

Contents

- Big data
- Cloud and big data
- Distributed systems
- Mass processing
- Mass storage
- Python Data Analysis Tools: Numpy, Pandas, Matplotlib, SciPy
- Hadoop: What is Hadoop and Hadoop ecosystem, HDFS architecture (Namenode, Datanode), MapReduce, YARN, HBase and NoSQL Databases.
- Spark: Spark architecture and core components, Spark programming (with Python), Data processing with Spark SQL, Spark streaming

Methodology

The subject has two teaching sessions every week. Each Session is divided into Two parts: part one is predominantly teacher led, the teacher explains the new contents and theory; part two, the students work on exercises to consolidate the knowledge they have learnt. Every two sessions, individual or group evaluations are carried out by means of written tests, individual or group activities and collection of exercises carried out at home, etc.

The following table relates the learning outcomes to the areas and the content taught to achieve them:
LO1. Use Python to read and transform data into different formats: Develop solutions on your own using standard libraries such as Numpy, Pandas, Matplotlib or SciPy.
LO2. Generate basic statistics and metrics using data on disk: Retrieve data from disk storage, load it into an appropriate format, and clean and preprocess the data as needed.
Calculate basic statistics (e.g., mean, median, standard deviation) and relevant metrics (e.g., average, percentage) based on the prepared data, and present the findings.
LO3. Work with computing tasks distributed over a cluster: Set up a computing cluster environment, including selecting appropriate hardware, configuring software frameworks and establishing network communication between cluster nodes.
Develop and execute computing tasks, including tasks like data parallelism, task distribution, fault tolerance, and resource management.
LO4. Convert data from various sources in storage or querying formats: Identify diverse data sources such as files and databases and implement procedures to extract data from these sources while handling any format-specific challenges.
Develop conversion processes to transform and standardize data from different sources into a common format or structure, ensuring data quality, consistency, and compatibility for downstream analysis or storage.
LO5. Prepare data for statistical analysis, visualization and machine learning: Identify and address missing values, outliers, and inconsistencies in the dataset by applying techniques such as imputation, scaling, and encoding categorical variables to ensure the data is ready for analysis and modeling.
Create new relevant features, and select informative variables, optimizing the dataset for statistical analysis, visualization, and machine learning model training, while preserving the integrity of the data's information.
LO6. Present data in the form of effective visuals: Identify the most appropriate types of data visualizations (e.g., bar charts, scatter plots, heatmaps) based on the nature of the data and the insights to be conveyed, taking into consideration factors like data distribution, relationships, and patterns.
Design and create compelling visualizations by choosing suitable colors, labels, and titles, ensuring clarity, accuracy, and aesthetic appeal, and then integrate these visuals into reports or presentations to effectively communicate data-driven insights to stakeholders.

Evaluation

In order to evaluate if the student has achieved an adequate score for the objectives pursued in the subject, different evaluation activities are used (with a frequency of approximately weekly).
The following table shows the percentage of evaluation of each activity on the final grade:
MidTerm 40%:
- 15%: INDIVIDUAL ASSIGNMENTS
- 25%: GROUP ASSIGNMENTS
Final Exam 60%:
- 30%: GROUP ASSIGNMENTS
- 30%: FINAL EXAM (ORDINARY CALL)
Students who do not pass the regular call will have an Extraordinary Call in July. Students who do not take any of the rest exams will have a final grade of the subject NP (Not Presented) in the extraordinary call.
Objectives of the continuous evaluation:
- The main objective is to help students to update the subject and get a good method of work, so that it helps them to assimilate the subject, taught progressively, and in obtaining good academic results.
- It also allows to value the work that the student does day by day, without his note depends only on the examinations realized during the semesters of the academic course.
- As a teacher, it helps to have more information about the work done by students and a better knowledge of them, both academically and personally.
Artificial Intelligence: It is prohibited to use Artificial Intelligence tools such as ChatGPT. Using AI tool will be considered as cheating and will be sanctioned with a 0. Moreover, the professor will inform the academic director which could be the basis for deciding on additional disciplinary measures.

Retake policy: Should you fail the course overall, you will have the opportunity to sit a re-take exam, as long as individual and group assignments have been presented. The final grade will not exceed 6/10.
The retake grade will then be: 40% the re-take exam and 60% the continuous assessment obtained during the course.

Evaluation Criteria

Basic Bibliography

Marin, I., Shukla, A., & VK, S. (2019). Big Data Analysis with Python. Packt Publishing.

Additional Material

Degree in Business Intelligence and Data Analytics

Lead the transformation of companies through the use and analysis of data.

Titular Professors

Professors

Help

Follow La Salle BCN

Degree in Business Intelligence and Data Analytics

Lead the transformation of companies through the use and analysis of data.

Big data analysis

Titular Professors

Professors

Help

Follow La Salle BCN

Search form