YARN in hadoop

YARN-Yet Another Resource Negotiator.
Introduction:  
With introduction of YARN, hadoop become more powerful. YARN is introduced in hadoop 2.x. Yarn provides advantages over previous version of hadoop, including better scalability, cluster utilization and user agility. YARN provides full backward compatibility with existing MapReduce task and application. YARN project started by Apache community  to give hadoop the ability to run non-mapreduce program on hadoop framework. 
Fig. illustrate how YARN fits into the new Hadoop ecosystem.
Fig.1 YARN in hadoop


Services in Hadoop (1.x):
There are 5 services:
  1. NameNode
  2. SecondaryNameNode
  3. JobTracker
  4. TaskTracker
  5. DataNode 

Services in Hadoop (2.x) :
 After installation of hadoop 2.x, format namenode and start all services using start-all.sh command in terminal. And then enter jps command. If all services shown in fig.2 are running, then installation of hadoop is successful.


Fig.2 Hadoop Services
JobTracker is responsible for Resource manager. Fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker - that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster.
Fig.3  YarnArchitecture
The NodeManager is the per-machine slave, which is responsible for launching the applications containers, monitoring their resource usage (CPU, memory, disk, net-work), and reporting the same to the ResourceManager. The ResourceManager divides the resources among all the applications in the system. The ResourceManager has a pluggable scheduler component, which is responsible for allocating resources to the various running applications. The scheduler performs its scheduling function based on the resource requirements of an application by using the abstract notion of a resource container, which incorporates resource dimensions such as memory, CPU, disk, and network. The per-application ApplicationMaster is, performs negotiating for resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.
Various scheduler options:
1)FIFO Scheduler:-
Basically a simple “first come, first served” scheduler in which the Job-Tracker pulls jobs from a work queue, oldest job first. 
2)Capacity scheduler:-
The Capacity scheduler is another pluggable scheduler for YARN that allows for multiple groups to securely share a large Hadoop cluster.
3)Fair scheduler:-
Fair scheduling is a method of assigning resources to applications such that all applications get, on average, an equal share of resources over time.
Containers:
A container is a collection of physical resources such as RAM, CPU cores, and disks on a single node. There can be multiple containers on a single node (or a single large one). Every node in the system is considered to be composed of multiple containers of minimum size of memory (e.g., 512 MB or 1 GB) and
CPU. The ApplicationMaster can request any container so as to occupy a multiple of the minimum size.

NodeManager:
The NodeManager is YARN’s per-node “worker” agent, taking care of the individual compute nodes in a Hadoop cluster. Its duties include keeping up-to-date with the ResourceManager, overseeing application containers’ life-cycle management, monitoring resource usage (memory, CPU) of individual containers, tracking node health, log man-
agement, and auxiliary services that may be exploited by different YARN applications. On start-up, the NodeManager registers with the ResourceManager; it then sends heartbeats with its status and waits for instructions. Its primary goal is to manage application containers assigned to it by the ResourceManager.

ApplicationMaster(AM):
The AM is the process that coordinates an application’s execution in the cluster. Each application has its own unique AM, which is tasked with negotiating resources(containers) from the ResourceManager and working with the NodeManager to execute and monitor the tasks. In the YARN design, Map-Reduce is just one application framework; this design permits building and deploying distributed applications using other frameworks. Once the AM is started (as a container), it will periodically send heartbeats to the ResourceManager to affirm its health and to update the record of its resource demands.

ResourceManager(RM):
RM is primarily a pure scheduler. It is strictly limited to arbitrating requests for available resources in the system made by the competing applications. It optimizes for cluster utilization (i.e., keeps all resources in use all the time) against various constraints such as capacity guarantees, fairness,and service level agreements (SLAs). To allow for different policy constraints, the RM has a pluggable scheduler that enables different algorithms such as those focusing on capacity and fair scheduling to be used as necessary.

0 comments:

Post a Comment