Course content

Overview:

Apache Hadoop is the open source data management software that helps organizations analyze huge volumes of structured and unstructured data, is a very hot topic across the tech industry. It can be quickly learn to take advantage of the MapReduce framework through technical sessions and hands on labs.

Training Objectives of Hadoop:

Hadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo.

Target Students / Prerequisites:

Students must be belonging to IT Background and familiar with Concepts in Java and Linux.

Introduction , The Motivation for Hadoop
  • Problems with traditional large-scale systems
  • Requirements for a new approach
Hadoop Basic Concepts
  • An Overview of Hadoop
  • The Hadoop Distributed File System
  • Hands on Exercise
  • How MapReduce Works
  • Hands on Exercies
  • Anatomy of a Hadoop Cluster
  • Other Hadoop Ecosystem Components
Writing a MapReduce Program
  • Examining a Sample MapReduce Program
  • With several examples
  • Basic API Concepts
  • The Driver Code
  • The Mapper
  • The Reducer
  • Hadoop’s Streaming API
Delving Deeper Into The Hadoop API
  • More About ToolRunner
  • Testing with MRUnit
  • Reducing Intermediate Data With Combiners
  • The configure and close methods for Map/Reduce Setup and Teardown
  • Writing Partitioners for Better Load Balancing
  • Hands-On Exercise
  • Directly Accessing HDFS
  • Using the Distributed Cache
  • Hands-On Exercise
Performing several hadoopjobs
  • The configure and close Methods
  • Sequence Files
  • Record Reader
  • Record Writer
  • Role of Reporter
  • Output Collector
  • Processing video files and audio files
  • Processing image files
  • Processing XML files
  • Counters
  • Directly Accessing HDFS
  • ToolRunner
  • Using The Distributed Cache
Common MapReduce Algorithms
  • Sorting and Searching
  • Indexing
  • Classification/Machine Learning
  • Term Frequency – Inverse Document Frequency
  • Word Co-Occurrence
  • Hands-On Exercise: Creating an Inverted Index
  • Identity Mapper
  • Identity Reducer
  • Exploring well known problems using MapReduce applications
Usining HBase
  • What is HBase?
  • HBase API
  • Managing large data sets with HBase
  • Using HBase in Hadoop applications
  • Hands-on Exercise
COURSE DETAILS IN A NUTSHELL

1.    In-detail explanation on the concepts of HDFS & MapReduce frameworks

2.    What is Hadoop 2.X Architecture & How to set up Hadoop Cluster

3.    How to write complex MapReduce Programs

4.    In-detail explanation on how to load data using tools like Sqoop & Flume,solr

5.    How to perform data analysis using tools like PIG, HIVE & YARN

6.    How to implement & integrate HBASE & MapReduce

7.    How to execute Advanced Usage and Indexing

8.    How to schedule jobs using Oozie

9.    What are the best practices for overall Hadoop development

10. RTAs on Data Analytics

11. What is Spark & brief about its ecosystem & how to work on RDD Using Spark

Programming languages: Java & Scala

Frame works: Hadoop Distributed File System (HDFS) & MapReduce, spark

Loading Tools: Sqoop & Flume

Analytical Tools: Pig, Hive and YARN

Scheduling Tools: Oozie

 

CURRICULUM for HADOOP 2.X

S NoConceptsSyllabus ObjectivesTopicsRTAs
1Understanding Big Data and Hadoop

The syllabus for this lecture would brief about:

1.      Big Data

2.      Big Data problems & solutions, their limitations

3.      HADOOP’s solutions that handle Big Data issue

4.      Common Hadoop Ecosystem and its Architecture

5.      Introduction to HDFS

6.      What is a file and how to write & read

7.      Brief on MapReduce Framework and its working style.

1.      Big Data, Limitations and Solutions of existing Data Analytics Architecture,

2.      Hadoop,

3.      Hadoop Features,

4.      Hadoop Ecosystem,

5.      Hadoop 2.x core components,

6.      Hadoop Storage: HDFS,

7.      Hadoop Processing

8.      MapReduce Framework,

9.      Hadoop Different Distributions.

 
2

Hadoop requirements

The syllabus for this lecture would brief about:

Pre-requisites to learn hadoop

10.  Linux commands

30 Essential Linux Basic Commands You Must Know

11.  vmware

·         Basics

·         Installations

·         Backups

12.  sql basics

·         Introduction to SQL

·         MySQL Essentials

·         Database Fundamentals

13.  Hands on exercise and Assignments

 
3

Hadoop Architecture and HDFS

The syllabus for this lecture would brief about:

1.      What is Hadoop Cluster Architecture

2.      What are the important Configuring files in a Hadoop Cluster

3.      What are the various Data loading techniques

4.      What are Single node and Multi nodes and their setups

14.  Hadoop 2.x Cluster Architecture

15.  Federation and High Availability,

16.  A Typical Production Hadoop Cluster,

17.  Hadoop Cluster Modes,

18.  Common Hadoop Shell Commands,

19.  Hadoop 2.x Configuration Files,

20.  Single node cluster and Multi node cluster set up Hadoop Administration.

21.  Hands on exercise and Assignments

 
4

Hadoop MapReduce Framework

The syllabus for this lecture would brief about:

1.      In-depth analysis on Hadoop MapReduce Framework

2.      How MapReduce works on data stored in HDFS.

3.      What are Splits, Combiner & Partitioner.

4.      How to work on MapReduce using different data sets

22.  MapReduce Use Cases,

23.  Traditional way Vs MapReduce way,

24.  Why MapReduce,

25.  Hadoop 2.x MapReduce Architecture,

26.  Hadoop 2.x MapReduce Components,

27.  YARN MR Application Execution Flow,

28.  YARN Workflow,

29.  Anatomy of MapReduce Program,

30.  Demo on MapReduce.

31.  Input Splits,

32.  Relation between Input Splits and HDFS Blocks,

33.  MapReduce Combiner & Partitioner,

34.  Hands on exercise and Assignments

 
5

Pig

The syllabus for this lecture would brief about:

1.      What is PIG & types of use, demo case

2.      How to couple PIG with MapREduce

3.      What are PIG Latin Scripting

4.      What are PIG running Modes PIG UDF, Pig Streaming, Testing PIG Scripts.

35.  About Pig,

36.  MapReduce Vs Pig,

37.  Pig Use Cases,

38.  Programming Structure in Pig,

39.  Pig Running Modes,

40.  Pig components,

41.  Pig Execution,

42.  Pig Latin Program,

43.  Data Models in Pig,

44.  Pig Data Types,

45.  Shell and Utility Commands,

46.  Pig Latin Relational Operators,

47.  File Loaders,

48.  Group Operator,

49.  COGROUP Operator,

50.  Joins and COGROUP,

51.  Union,

52.  Diagnostic Operators,

53.  Specialized joins in Pig,

54.  Hands on exercise and Assignments

.

 
6

Hive

The syllabus for this lecture would brief about:

1.      What  are HIVE concepts

2.      What are HIVE data types

3.      What are Loading & Querying in HIVE,

4.      How to run HIVE scripts

5.      What are Hive UDF

.

55.  Hive Background,

56.  Hive Use Case,

57.  About Hive,

58.  Hive Vs Pig,

59.  Hive Architecture and Components,

60.  Metastore in Hive,

61.  Limitations of Hive,

62.  Comparison with Traditional Database,

63.  Hive Data Types and Data Models,

64.  Partitions and Buckets,

65.  Hive Tables(Managed Tables and External Tables),

66.  Importing Data,

67.  Querying Data,

68.  Managing Outputs,

69.  Hive Script,

70.  Hive UDF,

71.  Retail use case in Hive,

72.  Hands on exercise and Assignments

 
7

Advanced Hive and HBase

The syllabus for this lecture would brief about:

1.      What are Advanced HIVE concepts

2.      What are UDF, Dynamic Partitioning, HIVE indexes & Views

3.      What are Optimizations in HIVE

4.      In-depth analysis on HBase, its Architecture, components and its running modes

73.  Hive QL: Joining Tables,

74.  Dynamic Partitioning,

75.  Custom Map/Reduce Scripts,

76.  Hive Indexes and views

77.  Hive query optimizers,

78.  User Defined Functions,

79.  HBase:

80.  Introduction to NoSQL

81.  Databases and HBase,

82.  HBase v/s RDBMS,

83.  HBase Components,

84.  HBase Architecture,

85.  Run Modes & Configuration,

86.  HBase Cluster Deployment.

87.  Hands on exercise and Assignments

 
8

Advanced HBase

The syllabus for this lecture would brief about:

1.      What are Advanced HBase Concepts

2.      How to perform bulk loading

3.      What are filters

4.      What is Zookeeper and how it helps in Cluster monitoring.

5.      Why HBase utilizes Zookeeper

88.  HBase Data Model,

89.  HBase Shell,

90.  HBase Client API,

91.  Data Loading Techniques,

92.  ZooKeeper

93.  Demos on Bulk Loading,

94.  Getting and Inserting Data,

95.  Filters in HBase.

96.  Hands on exercise and                                  Assignments

 
9

Sqoop

The syllabus for this lectureould brief about:

1. Import data from other databases to hdfs

2. Import data from other databases to hive

3. export data from hadoop to other databses

97.  Introduction.

98.  Import Data.

99.  Export Data.

100.                      Sqoop Syntax.

101.                      Databases connection.

102.                      Hands on exercise and Assignments

 
10

Impala

The syllabus for this lecture would brief about:

Impala

103.                      .Introduction to Impala

104.                      .Impala Configuration

105.                      .Comparison between Hive and  Impala

106.                      .Impala Commands

107.                      Hands on exercise and Assignments

 
11

Processing Distributed Data with Apache Spark

The syllabus for this lecture would brief about:

1.      What is Spark Ecosystem

2.      What is Scala and its utility in Spark

3.      What is SparkContext

4.      How to work on RDD in Spark

5.      How to run a Spark Cluster

6.      Comparison of MapReduce vs Spark

108.                      What is Apache Spark,

109.                      Spark Ecosystem,

110.                      Spark Components,

111.                      History of Spark

112.                      Spark Versions/Releases,

113.                      What is Scala?,

114.                      Why Scala?,

115.                      SparkContext,

116.                      Spark Sql

117.                      Hands on exercise and Assignments.

 
12

Flume & solr

The syllabus for this lecture would brief about:

Flume and solr

118.                      Introduction.

119.                      Configuration and Setup

120.                      Flume Sink with example

121.                      Channel

122.                      Flume Source with example

123.                      Complex flume architecture

124.                      Streaming data storing into solr

125.                      customization of solr

126.                      Hands on exercise and Assignments

 
13

Hue

The syllabus for this lecture would brief about:

Hue

127.                      Introduction to Hue

128.                      Advantages of Hue

129.                      Hue Web Interface

130.                      Ecosystems in Hue

131.                      Hands on exercise and Assignments

 
14

Oozie

The syllabus for this lecture would brief about:

1.      How multiple Hadoop ecosystem components work

2.      How they should be implemented to solve Big Data Issues

132.                      Oozie,

133.                      Oozie Components,

134.                      Oozie Workflow,

135.                      Scheduling with Oozie,

136.                      Demo on Oozie Workflow,

137.                      Oozie Co-ordinator,

138.                      Oozie Commands,

139.                      Oozie Web Console,

140.                      Oozie for MapReduce,

141.                      PIG, Hive, and Sqoop,

142.                      Combine flow of MR, PIG,    Hive in Oozie

143.                      Hands on exercise and Assignments

 
1

Tableau

The syllabus for this lecture would brief about:

Tableau

.

144.                      Tableau Fundamentals

145.                      Tableau Analytics.

146.                      Visual Analytics.

147.                      Hands on exercise and Assignments

 
      

PROJECTS:

1.

Hadoop Project

Hadoop -Tableau live integration

Topics : This is a project that gives you opportunity to work on retail data analytics.

:

1.Hadoop Integration with Tableau 
2.

Hadoop

Project2

Multi-node cluster setup

Topics : This is a project that gives you opportunity to work on real world Hadoop multi-node cluster setup in a distributed environment.

·         Running a Hadoop multi-node using a 4 node cluster

·         Deploying of MapReduce job on the Hadoop cluster

·         You will get a complete demonstration of working with various Hadoop cluster master and slave nodes, installing Java as a prerequisite for running Hadoop, installation of Hadoop and mapping the nodes in the Hadoop cluster.

 
3.

Hadoop Project3

Social media analytics

Topics : This is a project that gives you opportunity to work on  social media Analytics.

·         Streaming Twitter data

·         Store data into hadoop

·         Process social media data

·         Sentiment analysis on twitter data

·         Final result store in table

·         Connect BI Tool.

 

Mode of Training

Online

Total duration of the course

5 to 7 weeks

Training duration per day

50 mins - 90 mins

Communication Mode

Go to meeting, WEB-EX

Software access:

Software will be installed/Server access will be provided, whichever is possible

Material

Soft copy of the material will be provided during the training.

Training

Both weekdays and weekends

Training Fee

 $500