The outline of this course is mentioned below:
1.Introduction
- Introduction
- Hadoop Overview
- Overview of the Hadoop Ecosystem
- Tips and Tricks
2.Using Hadoop's Core: HDFS and MapReduce
- HDFS Overview
- Install the MovieLens dataset into HDFS using the Ambari UI
- Install the MovieLens dataset into HDFS using the command line
- MapReduce Overview
- MapReduce distributes processing
- MapReduce example: Break down movie ratings by rating score
- Installing Python, MRJob, and nano
- Coding up the rating histogram MapReduce job
- Rank movies by their popularity
- Check your results against mine
3.Programming Hadoop with Pig
- Introducing Ambari
- Introducing Pig
- Find the oldest movie with a 5-star rating using Pig
- Find old 5-star movies with Pig
- More Pig Latin
- Find the most-rated one-star movie
- Pig Challenge: Compare Your Results to Mine
4.Programming Hadoop with Spark
- Spark Overview
- The Resilient Distributed Dataset (RDD)
- Find the movie with the lowest average rating - with RDD's
- Datasets and Spark 2.0
- Finding movie with the lowest average rating
- Movie recommendations with MLLib
- Filter the lowest-rated movies by number of ratings
- Check your results against mine
5. Usage of relational data stores with Hadoop
- What is Hive?
- Hive to find the most popular movie
- How Hive works
- Hive to find the movie with the highest average rating
- Compare Solutions
- Integrating MySQL with Hadoop
- Install MySQL and import our movie data
- Sqoop to import data from MySQL to HFDS/Hive
- Use Sqoop to export data from Hadoop to MySQL
6.Usage of non-relational data stores with Hadoop
- Why No SQL?
- What is HBase
- Import movie ratings into HBase
- Use HBase with Pig to import data at scale
- Cassandra overview
- Installing Cassandra
- Write Spark output into Cassandra
- MongoDB overview
- Install MongoDB, and integrate Spark with MongoDB
- Using the MongoDB shell
- Choosing a database technology
- Choose a database for a given problem
7.Querying your Data Interactively
- Overview of Dri
- Setting up Drill
- Querying across multiple databases with Drill
- Overview of Phoenix
- Install Phoenix and query HBase with it
- Integrate Phoenix with Pig
- Overview of Presto
- Install Presto and query Hive with it
- Query both Cassandra and Hive using Presto
8.Managing your Cluster
- YARN explained
- Tez explained
- Hive on Tez and measure the performance benefit
- Mesos explained
- ZooKeeper explained
- Simulating a failing master with ZooKeeper
- Oozie explained
- Set up a simple Oozie workflow
- Zeppelin overview
- Zeppelin to analyze movie ratings: Part 1
- Zeppelin to analyze movie ratings: Part 2
- Hue overview
- Other technologies worth mentioning
9.Feeding Data to your Cluster
- Kafka explained
- Setting up Kafka and publishing some data
- Publishing weblogs with Kafka
- Flume explained
- Set up Flume and publish logs with it
- Set up Flume to monitor a directory and store its data in HDFS
10.Analyzing Streams of Data
- Spark Streaming: Introduction
- Analyze weblogs published with Flume using Spark Streaming
- Monitor Flume-published logs for errors in real time
- Exercise solution: Aggregating HTTP access codes with Spark Streaming
- Apache Storm: Introduction
- Count words with Storm
- Flink: An Overview
- Counting words with Flink
11.Designing Real-World Systems
- The Best of the Rest
- Review: How the pieces fit together
- Understanding your requirements
- Sample application: consume web server logs and keep track of top-sellers
- Sample application: serving movie recommendations to a website
- Design a system to report web sessions per day
- Exercise solution: Design a system to count daily sessions
12.BONUS
- Books and online resources
- Bonus lecture: Discounts on my other big data/data science courses!