CVF 2071 - Hadoop Administration

Credits: 3
Hours/Week: Lecture 2 Lab 1
Course Description: This course builds on topics in CVF 1071 , Introduction to Big Data Analytics and Security. It provides students with a comprehensive introduction to the steps necessary to install, configure, operate, and maintain Hadoop. The course begins with an overview of the Big Data landscape and then dives into a system administration working view of running Hadoop. Students will also have the opportunity to install Splunk on top of Hadoop and examine how to process and analyze the data using Splunk’s Search Processing Language (SPL) as an implementation of MapReduce function. This course employs both “open source technology” (Hadoop) and “commercial technology” (Splunk).
MnTC Goals
None

Prerequisite(s): CVF 1071 and CVF 1205 or CSCI 1060 with grades of C or higher, or instructor consent.
Corequisite(s): None
Recommendation: None

Major Content

Introduction to Hadoop
1. History of Hadoop
2. Core Components of Hadoop
3. Fundamental Concepts of Hadoop

2. Planning Hadoop Cluster

Basic Planning Considerations
Choosing Hardware
Network Considerations
Nodes Configuring
Planning for Cluster Management

3. Hadoop Distributed File System

HDFS Features
Reading and Writing Files
NameNode Considerations
HDFS Security
Namenode Web User Interface
Hadoop File Shell

4. Getting Data into HDFS

Pulling data from External Sources with Flume
Using Sqoop to import Data from Relational Databases
Best Practices
REST Interfaces

5. MapReduce

Architectural Overview
MapReduce overview
Features of MapReduce
YARN MapReduce Version 2
Failure Recovery
The JobTracker Web User Interface

6. Installation, Initialization, and Configuration of Hadoop

Configuration and Deployment Types
Installing Hadoop
Specifying the Hadoop Configuration
Initial HDFS and MapReduce Configuration
Log Files

7. Installing/Configuring

Hive
Impala
Pig

8. Hadoop Clients

What is Hadoop Client?
Installing and Configuring Hadoop Clients
Installation and Configuration of Hue
Authentication and Configuration of Hue

9. Hadoop Advanced Cluster Configuration

Advanced Configuration Parameters
Configuring Hadoop Ports
Configuring HDFS for Rack
Awareness & HDFS High Availability
Explicitly Including and Excluding Hosts

10. Hadoop Security

Importance of Hadoop Security
Hadoop’s Security System Concepts
What Kerberos Is and How it Works
Using Kerberos to Secure a Hadoop Cluster

11. Scheduling and Managing Jobs

Scheduling Hadoop Jobs
Managing Running Jobs
Configuring the FairScheduler

12. Cluster Maintenance

Checking HDFS Status
Copying Data Between Clusters
Removing /Adding Cluster Nodes
Rebalancing of Cluster
NameNode Metadata Backup
Cluster Upgrades

13. Monitoring and Troubleshooting Cluster

General System Monitoring
Clusters Monitoring
Managing Hadoop’s Log Files
Common Troubleshooting Issues

Learning Outcomes
At the end of this course, students will be able to:

describe history of Hadoop.
describe the fundamental concepts of using Big Data.
identify where Hadoop fits into a Big Data strategy.
design a plan to create Hadoop cluster.
explain HDFS features and NameNode.
demonstrate how to get data into HDFS
explain how to work with MapReduce
implement installation and configuration of Hadoop.
install and Configure Hadoop Clients.
configure HDFS for Rack Awareness & HDFS High Availability
administer cluster maintenance.
schedule Hadoop’s job
describe Hadoop cluster maintenance.
monitor and troubleshoot Hadoop cluster
identify common integration points.
explain Hadoop Security.

Competency 1 (1-6)
None
Competency 2 (7-10)
None

Courses and Registration