Big Data Emerging Technologies

Yonsei University via Coursera

Go to Course: https://www.coursera.org/learn/big-data-emerging-technologies

Introduction

### Course Review: Big Data Emerging Technologies on Coursera #### Overview In the digital age, the importance of big data has surged to unprecedented heights. The course **"Big Data Emerging Technologies"** on Coursera encapsulates this surge, expertly guiding students through the intricate world of big data technologies that power the services we use daily— from Google searches to social media interactions, and more. It’s an essential course for anyone interested in understanding how big data systems operate and the technologies they rely upon. #### Course Structure and Syllabus The course is structured across six comprehensive modules, each designed to provide both theoretical knowledge and practical insights into various aspects of big data technology: 1. **Big Data Rankings & Products** This module sets the stage by exploring the landscape of big data hardware, software, and professional services, emphasizing the market shares of major companies like IBM, SAP, and AWS. Students learn how these technologies influence industries, investment strategies, and even governmental organizations. The introduction to the 4 V's of big data—volume, variety, velocity, and veracity—provides a solid foundation for understanding subsequent topics. 2. **Big Data & Hadoop** Here, the course dives into Hadoop, the original big data system pioneered by Google. This module elucidates core concepts such as MapReduce, HDFS, and the operational roles within a Hadoop cluster. It effectively contrasts Hadoop's processes with traditional database approaches, giving students clarity on how large datasets are managed and analyzed. 3. **Spark** Spark has become a significant player in the big data arena. This module covers Spark’s core functionalities, emphasizing Resilient Distributed Datasets (RDDs) and detailing the architecture that allows for fast and efficient data processing. The course highlights the nuanced differences between Spark and Hadoop, particularly how Spark optimizes processing through lazy evaluation and DAG operations. 4. **Spark ML & Streaming** Building on Spark's foundation, this module introduces Spark ML and streaming capabilities. Participants explore machine learning algorithms and data streaming techniques, gaining a strong understanding of how Spark can be applied for real-time data processing and analysis, which is critical for applications in today’s fast-paced environment. 5. **Storm** The focus shifts to another pivotal technology—Storm. This module provides insights into its architecture and the disparities between Storm, Spark, and Hadoop. Understanding Storm’s data processing model and its applications in real-time analytics positions students well within the technological landscape. 6. **IBM SPSS Statistics Project** Rounding out the course, this module provides hands-on experience with IBM SPSS Statistics, a leading tool in big data statistical analysis. By engaging in projects that utilize SPSS for data processing and visualization, learners can consolidate their theoretical insights through practical application. #### Why You Should Take This Course 1. **Comprehensive Coverage**: The course content spans the foundational concepts of big data systems to advanced technologies like Spark and Storm, making it suitable for both beginners and those looking to deepen their existing knowledge. 2. **Industry-Relevant Skills**: Given the rapid adoption of big data technologies in various sectors, this course equips learners with skills that are highly sought after in the job market. 3. **Practical Learning**: The module on IBM SPSS allows students to apply their learning in a practical context, enhancing their analytical capabilities with a widely used tool in the industry. 4. **Flexible Learning**: As it is hosted on Coursera, you can learn at your own pace, making it easy to fit into your schedule. 5. **Expert Instruction**: The modules are constructed and delivered by industry experts, ensuring that students gain insights that reflect current trends and technologies. #### Conclusion **"Big Data Emerging Technologies"** on Coursera is an outstanding resource for individuals interested in delving into the vast world of big data. Whether you are a novice looking to break into the field, or a professional seeking to enhance your skills, this course offers invaluable knowledge and practical experience. I highly recommend this course to anyone eager to leverage big data in their career or enhance their understanding of the emerging technologies that are reshaping our world.

Syllabus

Big Data Rankings & Products

The first module “Big Data Rankings & Products” focuses on the relation and market shares of big data hardware, software, and professional services. This information provides an insight to how future industry, products, services, schools, and government organizations will be influenced by big data technology. To have a deeper view into the world’s top big data products line and service types, the lecture provides an overview on the major big data company, which include IBM, SAP, Oracle, HPE, Splunk, Dell, Teradata, Microsoft, Cisco, and AWS. In order to understand the power of big data technology, the difference of big data analysis compared to traditional data analysis is explained. This is followed by a lecture on the 4 V big challenges of big data technology, which deal with issues in the volume, variety, velocity, and veracity of the massive data. Based on this introduction information, big data technology used in adding global insights on investments, help locate new stores and factories, and run real-time recommendation systems by Wal-Mart, Amazon, and Citibank is introduced.

Big Data & Hadoop

The second module “Big Data & Hadoop” focuses on the characteristics and operations of Hadoop, which is the original big data system that was used by Google. The lectures explain the functionality of MapReduce, HDFS (Hadoop Distributed FileSystem), and the processing of data blocks. These functions are executed on a cluster of nodes that are assigned the role of NameNode or DataNodes, where the data processing is conducted by the JobTracker and TaskTrackers, which are explained in the lectures. In addition, the characteristics of metadata types and the differences in the data analysis processes of Hadoop and SQL (Structured Query Language) are explained. Then the Hadoop Release Series is introduced which include the descriptions of Hadoop YARN (Yet Another Resource Negotiator), HDFS Federation, and HDFS HA (High Availability) big data technology.

Spark

The third module “Spark” focuses on the operations and characteristics of Spark, which is currently the most popular big data technology in the world. The lecture first covers the differences in data analysis characteristics of Spark and Hadoop, then goes into the features of Spark big data processing based on the RDD (Resilient Distributed Datasets), Spark Core, Spark SQL, Spark Streaming, MLlib (Machine Learning Library), and GraphX core units. Details of the features of Spark DAG (Directed Acyclic Graph) stages and pipeline processes that are formed based on Spark transformations and actions are explained. Especially, the definition and advantages of lazy transformations and DAG operations are described along with the characteristics of Spark variables and serialization. In addition, the process of Spark cluster operations based on Mesos, Standalone, and YARN are introduced.

Spark ML & Streaming

The fourth module “Spark ML & Streaming” focuses on how Spark ML (Machine Learning) works and how Spark streaming operations are conducted. The Spark ML algorithms include featurization, pipelines, persistence, and utilities which operate on the RDDs (Resilient Distributed Datasets) to extract information form the massive datasets. The lectures explain the characteristics of the DataFrame-based API, which is the primary ML API in the spark.ml package. Spark ML basic statistics algorithms based on correlation and hypothesis testing (P-value) are first introduced followed by the Spark ML classification and regression algorithms based on linear models, naive Bayes, and decision tree techniques. Then the characteristics of Spark streaming, streaming input and output, as well as streaming receiver types (which include basic, custom, and advanced) are explained, followed by how the Spark Streaming process and DStream (Discretized Stream) enable big data streaming operations for real-time and near-real-time applications.

Storm

The fifth module “Storm” focuses on the characteristics and operations of Storm big data systems. The lecture first covers the differences in data analysis characteristics of Storm, Spark, and Hadoop technology. Then the features of Storm big data processing based on the nimbus, spouts, and bolts are described followed by the Storm streams, supervisor, and ZooKeeper details. Further details on Storm reliable and unreliable spouts and bolts are provided followed by the advantages of Storm DAG (Directed Acyclic Graph) and data stream queue management. In addition, the advantages of using Storm based fast real-time applications, which include real-time analytics, online ML (Machine Learning), continuous computation, DRPC (Distributed Remote Procedure Call), and ETL (Extract, Transform, Load) are introduced.

IBM SPSS Statistics Project

The sixth and last module “IBM SPSS Statistics Project” focuses on providing experience on one of the most famous and widely used big data statistical analysis systems in the world. First, the lecture starts with how to setup and use IBM SPSS Statistics, and continues on to describe how IBM SPSS Statistics can be used to gain corporate data analysis experience. Then the data processing statistical results of two projects based on using the IBM SPSS Statistics big data system is conducted. The projects are conducted so the student can discover new ways to use, analyze, and draw charts of the relationship between datasets, and also compare the statistical results using IBM SPSS Statistics.

Overview

Every time you use Google to search something, every time you use Facebook, Twitter, Instagram or any other SNS (Social Network Service), and every time you buy from a recommended list of products on Amazon.com you are using a big data system. In addition, big data technology supports your smartphone, smartwatch, Alexa, Siri, and automobile (if it is a newer model) every day. The top companies in the world are currently using big data technology, and every company is in need of advanced big data

Skills

Reviews

I have learned so much about the Big Data technologies in this course. It is a very useful course.

A lot to learn and remember but a great course overall!

This course is really good,I Learn a lot of new things

This course was really amazing and I am happy to achieve the certificate hope to learn more going forward.

good course get lot of knowledge how data is processed online