Skip to course content

Hadoop For Administrators

Plan, deploy, secure, monitor, and optimize Hadoop clusters.

Get Course Info

Audience: Hadoop administrators

Duration: Three days / Four days

Format: Lectures and hands-on labs, approximate balance 60 % lectures, 40 % labs.

Overview

Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. In this course, attendees will learn about the business benefits and use cases for Hadoop and its ecosystem, how to plan cluster deployment and growth, how to install, maintain, monitor, troubleshoot, and optimize Hadoop.

Objective

Plan, deploy, secure, monitor, and optimize Hadoop clusters.

What You Will Learn

  • Hadoop & Big Data
  • Installing Hadoop
  • Managing and Monitoring Hadoop
  • Loading data in HDFS
  • Managing ecosystem
  • Securing Hadoop

Course Details

Audience: Hadoop administrators

Duration: Three days / Four days

Format: Lectures and hands-on labs, approximate balance 60 % lectures, 40 % labs.

Prerequisites:

Comfortable with basic Linux system administration • Basic scripting skills

Setup: Zero Install Hadoop cluster • SSH client • Firefox browser with FoxyProxy

Detailed Outline

  • Hadoop history, concepts
  • Ecosystem
  • Distributions
  • High-level architecture
  • Hadoop myths
  • Hadoop challenges (hardware/software)
  • Labs: discuss your Big Data projects and problems
  • Selecting software, Hadoop distributions
  • Sizing the cluster, planning for growth
  • Selecting hardware and network
  • Rack topology
  • Installation
  • Multi-tenancy
  • The directory structure, logs
  • Benchmarking
  • Labs: cluster install, run performance benchmarks
  • Concepts (horizontal scaling, replication, data locality, rack awareness)
  • Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
  • Health monitoring
  • Command-line and browser-based administration
  • Adding storage, replacing defective drives
  • Labs: getting familiar with HDFS command lines
  • Flume for logs and other data ingestion into HDFS
  • Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
  • Hadoop data warehousing with Hive
  • Copying data between clusters (distcp)
  • Using S3 as complementary to HDFS
  • Data ingestion best practices and architectures
  • Labs: setting up and using Flume, the same for Sqoop
  • Parallel computing before MapReduce: compare HPC vs Hadoop administration
  • MapReduce cluster loads
  • Nodes and Daemons (JobTracker, TaskTracker)
  • MapReduce UI walkthrough
  • MapReduce configuration
  • Job config
  • Optimizing MapReduce
  • Fool-proofing MR: what to tell your programmers
  • Labs: running MapReduce examples
  • YARN design goals and implementation architecture
  • New actors: ResourceManager, NodeManager, Application Master
  • Installing YARN
  • Job scheduling under YARN
  • Labs: investigate job scheduling
  • Hardware monitoring
  • Cluster monitoring
  • Adding and removing servers, upgrading Hadoop
  • Backup, recovery and business continuity planning
  • Oozie job workflows
  • Hadoop high availability (HA)
  • Hadoop Federation
  • Securing your cluster with Kerberos
  • Labs: set up monitoring
  • Cloudera Manager for cluster administration, monitoring, and routine tasks
  • Ambari for cluster administration, monitoring, and routine tasks

Ready to Get Started?

Contact us to learn more about this course and schedule your training.