description Description


The world of Hadoop and "Big Data" have hundreds of different technologies with cryptic names form the Hadoop ecosystem. With this course, you'll not only understand what those systems are and how they fit together but, you'll go hands-on and learn how to use them to solve real business problems.

This course is comprehensive, covering over 25 different technologies in over 14 hours of video lectures. It's filled with hands-on activities and exercises, so you get some real experience in using Hadoop - it's not just theory.

At the end of this course, you may expect to learn the course with a real, deep understanding of Hadoop and its associated distributed systems and you can apply Hadoop to real-world problems.

Course Outline


The outline of this course is mentioned below:


  • Introduction
  • Hadoop Overview
  • Overview of the Hadoop Ecosystem
  • Tips and Tricks

2.Using Hadoop's Core: HDFS and MapReduce

  • HDFS Overview
  • Install the MovieLens dataset into HDFS using the Ambari UI
  • Install the MovieLens dataset into HDFS using the command line
  • MapReduce Overview
  • MapReduce distributes processing
  • MapReduce example: Break down movie ratings by rating score
  • Installing Python, MRJob, and nano
  • Coding up the rating histogram MapReduce job
  • Rank movies by their popularity
  • Check your results against mine

3.Programming Hadoop with Pig

  • Introducing Ambari
  • Introducing Pig
  • Find the oldest movie with a 5-star rating using Pig
  • Find old 5-star movies with Pig
  • More Pig Latin
  • Find the most-rated one-star movie
  • Pig Challenge: Compare Your Results to Mine

4.Programming Hadoop with Spark

  • Spark Overview
  • The Resilient Distributed Dataset (RDD)
  • Find the movie with the lowest average rating - with RDD's
  • Datasets and Spark 2.0
  • Finding movie with the lowest average rating
  • Movie recommendations with MLLib
  • Filter the lowest-rated movies by number of ratings
  • Check your results against mine

5. Usage of relational data stores with Hadoop

  • What is Hive?
  • Hive to find the most popular movie
  • How Hive works
  • Hive to find the movie with the highest average rating
  • Compare Solutions
  • Integrating MySQL with Hadoop
  • Install MySQL and import our movie data
  • Sqoop to import data from MySQL to HFDS/Hive
  • Use Sqoop to export data from Hadoop to MySQL

6.Usage of non-relational data stores with Hadoop

  • Why No SQL?
  • What is HBase
  • Import movie ratings into HBase
  • Use HBase with Pig to import data at scale
  • Cassandra overview
  • Installing Cassandra
  • Write Spark output into Cassandra
  • MongoDB overview
  • Install MongoDB, and integrate Spark with MongoDB
  • Using the MongoDB shell
  • Choosing a database technology
  • Choose a database for a given problem

7.Querying your Data Interactively

  • Overview of Dri
  • Setting up Drill
  • Querying across multiple databases with Drill
  • Overview of Phoenix
  • Install Phoenix and query HBase with it
  • Integrate Phoenix with Pig
  • Overview of Presto
  • Install Presto and query Hive with it
  • Query both Cassandra and Hive using Presto

8.Managing your Cluster

  • YARN explained
  • Tez explained
  • Hive on Tez and measure the performance benefit
  • Mesos explained
  • ZooKeeper explained
  • Simulating a failing master with ZooKeeper
  • Oozie explained
  • Set up a simple Oozie workflow
  • Zeppelin overview
  • Zeppelin to analyze movie ratings: Part 1
  • Zeppelin to analyze movie ratings: Part 2
  • Hue overview
  • Other technologies worth mentioning

9.Feeding Data to your Cluster

  • Kafka explained
  • Setting up Kafka and publishing some data
  • Publishing weblogs with Kafka
  • Flume explained
  • Set up Flume and publish logs with it
  • Set up Flume to monitor a directory and store its data in HDFS

10.Analyzing Streams of Data

  • Spark Streaming: Introduction
  • Analyze weblogs published with Flume using Spark Streaming
  • Monitor Flume-published logs for errors in real time
  • Exercise solution: Aggregating HTTP access codes with Spark Streaming
  • Apache Storm: Introduction
  • Count words with Storm
  • Flink: An Overview
  • Counting words with Flink

11.Designing Real-World Systems

  • The Best of the Rest
  • Review: How the pieces fit together
  • Understanding your requirements
  • Sample application: consume web server logs and keep track of top-sellers
  • Sample application: serving movie recommendations to a website
  • Design a system to report web sessions per day
  • Exercise solution: Design a system to count daily sessions


  • Books and online resources
  • Bonus lecture: Discounts on my other big data/data science courses!


Course Takeaways


  • Design distributed systems that manage "big data" using Hadoop and related technologies.
  • Use HDFS and MapReduce for storing and analyzing data at scale.
  • Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
  • Analyze relational data using Hive and MySQL
  • Analyze non-relational data using HBase, Cassandra, and MongoDB
  • Query data interactively with Drill, Phoenix, and Presto
  • Choose an appropriate data storage technology for your application
  • Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
  • Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
  • Consume streaming data using Spark Streaming, Flink, and Storm
  • Additional Benefits:
  • Learn Anything, Anytime, Anywhere
  • Dedicated WIISE Learning Buddy will help you in achieving your Personal and Professional Goals.

Terms & Conditions

  • WIISE courses are offered in Monthly Subscription Packages.
  • You may enjoy learning more than one course at a time.
  • WIISE library has 1000+ short skill-based courses taught by world-class instructors. Please visit for complete payment details.
  • Please visit for the complete terms and conditions.