05 Jan 2017
January 5, 2017

Creating a Hadoop cluster using Ambari

January 5, 2017 Ambari, Big Data, Hadoop 0 Comment

Introduction

In this guide we will be installing a Hadoop cluster using Ambari. It seems this is a common approach nowadays. I personally like to install hadoop manually, it gives me control, however Ambari has some nice features to make the initial install simpler.

First we will set up a set of hosts that we will use in our cluster. Each of these hosts will be a LXC container using Ubuntu as the host server. Each LXC gues will be assigned an ip-address from the default LXCBR0 interface using dnsmasq.conf

The hosts/ip-address combination which we will use in the example are as follows:

Host name

ip-address

hadoop24

10.0.3.124

hadoop25

10.0.3.125

hadoop26

10.0.3.126

hadoop27

10.0.3.127

hadoop24 will be our head node on which Ambari will be installed.

Change to the root user on the ubuntu host.

 

Create LXC containers

First we will create this lxc containers using the following commands:

Using the following command we can list to see what containers we have created.

lxc-ls --fancy

Result:


First we will start all the guests

Result:


We will then log into hadoop24 using ssh from our ubuntu hpost, and then we wil issue a command to copy the ssh keys.

 

result

 

Configure password-less SSH

 

We need to copy the contents of “id_rsa.pub” into the “authorized_keys” file. So this can be done by using below command:

Verify

Result

Modify the permissions as follows:


We will not ensure that hadoop24 has password less access to all hosts using ubuntu user.

Log into each host an accept the key

Let’s install pdsh on hadoop24, and test the ssh key access

Create a file that stores our hosts

Add these hosts:

Save and exit vi

Add this host

Save and exit vi

Result

 

result

Verify that psdh works using manual command shell option

We can see that I have iused the command-line to set up a hostname using hadoop then passing in the number of the hosts using character substitutiion

Result

Verify psdh using file option

Result

In this case, I have set the file containing the host name on the command-line and told psdh to execut the same command on each host and exit.

We now know that we have access to all the hosts using ubuntu as the user.

In another version of this guide, I will discuss using a user called hadoop.

Creating access for Root

We also wish to allow Ambari to run as root, so we need to do this for root as well

Copy the SSH Public Key (id_rsa.pub) to the root account on your target hosts.

I followed this process

from hadoop24, I used this command to copy the two files to each hosts /tmp

I then logged on to each host from hadoop24 using

This process essentially allowed me to use the same root public key for all the hosts and allow a password-less access as root.

Also change the permissions on each hosts .ssh folder and authorized_kays file as root.

 

 

The important thing is that I need to have the same id_rsa on all hosts

Installing Ambari

#Log in to the head host for example hadoop24 as root.

At this time I used sudo as ubuntu

Install wget

 

Set the repo

Download the Ambari repository file to a directory on your installation host.

 

Confirm that Ambari packages downloaded successfully by checking the package name list.

You should see the Ambari packages in the list.

Result:

 

 

Install Ambari Server

Install the Ambari bits. This also installs the default PostgreSQL Ambari database.

 

Setup Ambari Server

Before starting the Ambari Server, you
must
set up the Ambari Server. Setup configures Ambari to talk to the Ambari database, installs the JDK and allows you to customize the user account the Ambari Server daemon will run as. The
ambari-server setup  command manages the setup process. Run the following command on the Ambari server host to start the setup process. You may also append
Setup Options
to the command.

 

Respond to the setup prompt:

  1. If you have not temporarily disabled SELinux, you may get a warning. Accept the default (y), and continue.
  2. By default, Ambari Server runs under
    root. Accept the default (n) at the
    Customize user account for ambari-server daemon

    prompt, to proceed as
    root. If you want to create a different user to run the Ambari Server, or to assign a previously created user, select
    at the
    Customize user account for ambari-server daemon

    prompt, then provide a user name. Refer to the Ambari Security Guide >
    Configuring Ambari for Non-Root, for more information about running the Ambari Server as non-root.
  3. If you have not temporarily disabled
    iptables
    you may get a warning. Enter y
    to continue.
  4. Select a JDK version to download. Enter 1 to download Oracle JDK 1.8.

Postgres is installed as embedded using default and username is ambari with password bigdata.

 

If we want to set up a separate Database which would be done in an enterprise environment we can refer to this page for some insights.

 

Start the Ambari Server

Run the following command on the Ambari Server host:

 

To check the Ambari Server processes:

 

To stop the Ambari Server:

 

[Note]

If you plan to use an existing database instance for Hive or for Oozie, you must complete the preparations described in Using Non-Default Databases-Hive and Using Non-Default Databases-Oozie before installing your Hadoop cluster.

 

Installing, Configuring, and Deploying a HDP Cluster

Log In to Apache Ambari

 

After starting the Ambari service, open Ambari Web using a web browser.

Point your browser to http://<your.ambari.server>:8080,where <your.ambari.server> is the name of your ambari server host. For example, a our Ambari server host is located at http://hadoop24:8080.

Log in to the Ambari Server using the default user name/password: admin/admin. You can change these credentials later.

Result:

 

For a new cluster, the Ambari install wizard displays a Welcome page from which you launch the Ambari Install wizard.

Name Your Cluster

 

In Name your cluster, type a name for the cluster you want to create. Use no white spaces or special characters in the name.

 

Choose Next.

Select Stack

 

The Service Stack (the Stack) is a coordinated and tested set of HDP components. Use a radio button to select the Stack version you want to install. To install an HDP 2x stack, select the HDP 2.3, HDP 2.2, HDP 2.1, or HDP 2.0 radio button.

  1. Install Options

 

In order to build up the cluster, the install wizard prompts you for general information about how you want to set it up. You need to supply the FQDN of each of your hosts. The wizard also needs to access the private key file you created in Set Up Password-less SSH. Using the host names and key file information, the wizard can locate, access, and interact securely with all hosts in the cluster.

Use the Target Hosts text box to enter your list of host names, one per line. You can use ranges inside brackets to indicate larger sets of hosts. For example, for host01.domain through host10.domain use host[01-10].domain

Note: If you are deploying on EC2, use the internal Private DNS host names.

If you want to let Ambari automatically install the Ambari Agent on all your hosts using SSH, select Provide your SSH Private Key and either use the Choose File button in the Host Registration Information section to find the private key file that matches the public key you installed earlier on all your hosts or cut and paste the key into the text box manually.

 

Note: If you are using IE 9, the Choose File button may not appear. Use the text box to cut and paste your private key manually.

Fill in the user name for the SSH key you have selected. If you do not want to use root , you must provide the user name for an account that can execute sudo without entering a password.

If you do not want Ambari to automatically install the Ambari Agents, select Perform manual registration. For further information, see Installing Ambari Agents Manually.

Confirm Hosts

Choose Register and Confirm to continue.

I had issues, and the other three hosts did not get installed. Here is the report from the successful host hadoop24, basically I need to install ntp.

I got this error~:

==========================

 

Basically the installation of the agents failed, so I decided to log into each box, and install the agents using this process

When I retried the wizard, it was successful, so some sort of bug, maybe the agent is required to complete the work ie dependencies? A bit flaky as not very well explained. Typical Linux!

Once all the hosts are successfully registered we see the following

Before we can continue with the wizard we need to setup NTP

Setting up NTP

We now need to install ntp on all the hosts

as root on all hosts

Re-run the checks

 

Select which services to install. This is wherer Ambari is very useful, especialy when using Hortonworks or cloudera, the entire Eco system can be installed very quickly

Here is a list of all the eco system services I have installed.

Service

Version

Description

HDFS

2.7.1.2.3

Apache Hadoop Distributed File System

YARN + MapReduce2

2.7.1.2.3

Apache Hadoop NextGen MapReduce (YARN)

Tez

0.7.0.2.3

Tez is the next generation Hadoop Query Processing framework written on top of YARN.

Hive

1.2.1.2.3

Data warehouse system for ad-hoc queries & analysis of large datasets and table & storage management service

HBase

1.1.1.2.3

A Non-relational distributed database, plus Phoenix, a high performance SQL layer for low latency applications.

Pig

0.15.0.2.3

Scripting platform for analyzing large datasets

Sqoop

1.4.6.2.3

Tool for transferring bulk data between Apache Hadoop and structured data stores such as relational databases

Oozie

4.2.0.2.3

System for workflow coordination and execution of Apache Hadoop jobs. This also includes the installation of the optional Oozie Web Console which relies on and will install the ExtJS Library.

ZooKeeper

3.4.6.2.3

Centralized service which provides highly reliable distributed coordination

Falcon

0.6.1.2.3

Data management and processing platform

Storm

0.10.0

Apache Hadoop Stream processing framework

Flume

1.5.2.2.3

A distributed service for collecting, aggregating, and moving large amounts of streaming data into HDFS

Accumulo

1.7.0.2.3

Robust, scalable, high performance distributed key/value store.

Ambari Metrics

0.1.0

A system for metrics collection that provides storage and retrieval capability for metrics collected from the cluster

Atlas

0.5.0.2.3

Atlas Metadata and Governance platform

Kafka

0.8.2.2.3

A high-throughput distributed messaging system

Knox

0.6.0.2.3

Provides a single point of authentication and access for Apache Hadoop services in a cluster

Mahout

0.9.0.2.3

Project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification

Slider

0.80.0.2.3

A framework for deploying, managing and monitoring existing distributed applications on YARN.

Spark

1.4.1.2.3

Apache Spark is a fast and general engine for large-scale data processing.

 

Click Next

Assign Master components

I decided to let the system choose, it looked good enough for this 40node cluater to have these services running as masters as the default selected hosts

Click Next

Once again, I selected to install a data node on all hosts,giving 4 daya nodes

Click Next

Before can continue, we need to set specific properties such as login credentials for key services for example Hive etc

I am going to use hadoop/bigdata for as many username/password combinations as possible

Note: for secrets, I used bigdatasecret, for all password I used bigdata

 

The wizard will progress through installing the required services on each appropriate host as defined

Note: It will take some time for this process to complete, go get a cup of coffee. If you’re like me and a coffee snob, you will drive to somewhere that has competition-level flat whites.

In this example there was a failure starting all the services, lets have a look see

Looking at the output below

There is no information that is useful

I just clicked the retry, often it can be network connectivity of your local broadband router timing out for each LXC host.

The retry seems to work well, the state management of Ambari is excellent

Result

We have some warnings, lets take a look, to do this we will complete the wizard and look at the services that are not starting. It could be a resource contention due to LXC memoery allocation or other like disk space

Click Complete

The metrics will load. I have noticed that the amabari server ie web-interface has slowed down, so I guess there is some memoery allocation weakness.

I tried to ssh directly to hos hadoop24 which is the Ambari server and it was very slow, so I logged into the ubuntu host to check swap etc

I could not log in, so the default settings as applied at beginning of wizard re memory allocation is something we need to consider. HDP is a resource intensive system as would be expected with so many services in the eco system. This will have an impact on my physical ubuntu server. I guess I might need to look at using two physical boxes or carefully look at the resources allocated to each LXC container.

I decided to use lxc-info to interrogate each guest

hadoop24

 

Hadoop25

 

Hadoop26

 

Hadoop27

 

I decided to see how I could interrogate the hadoop eco system services the reason was that myfirst thought was to stop un needed services at this time

To start with we can check the following using the YARN api

Service Checks can be started via the Ambari API and it is also possible to start all available service checks with a single API command. To bulk run these checks it is necessary to use the same API/method that is used to trigger a rolling restart of Datanodes (request_schedules). The “request_schedules” API starts all defined commands in the specified order, its even possible to specify a pause between the commands.

Available Service Checks:

 

Note: Make sure you replace user, password, clustername and ambari-server with the actual values

Start single service check via Ambari API (e.g. HDFS Service Check):

Syntax:

 

Example:

 

in the hhome directory of hadop24, I created a file called payload with this contents:

 

Content added to Payload file…

 

Result

 

 

Start bulk Service checks via Ambari API (e.g. HDFS, Yarn, MapReduce2 Service Checks):

curl -ivk -H “X-Requested-By: ambari” -u <user>:<password> -X POST -d @payload http://<ambari-server>:8080/api/v1/clusters/<clustername>/request_schedules

Payload:

 

I did not find this useful I am left wondering where the resulting information is located?

I decided to try an install ambari-sehll to see if the command-libe could be useful?

I then decided to look for the latest version 2 build

Because we have installed Java 8 as part of Ambari etc, we can ue that for our JVM

 

Download “ambari-shell” code from git and compile the code using gradle –

 

This process failed due to Java 1.8, so I installed open JDK 1.7

Locating the Java install

 

I found the JDK 1.7 was installed

I tried a build and it still did not work?

I then decided to install gradle3

 

 

I could not get Gradle to work, the instructions were for CentOS, so I used maven

I downloaded the latest tar ball, and installed in /opt/


Log in as hue, and modify the ubuntu .bashrc to add the maven bin path for example, append this line and save


 

I re-set JAVA_HOME to JDK 1.8 as before above

I then ran mvn package from /opt/ambari-shell

All build processes I tried did not work, so I used the /opt/ambari-shell/latest-snap.sh to download the latest built JAR.

Result:

We have a working shell.

To this day I cannot figure out why the build does not work on Ubuntu 14.4, but I do not need to debug this at this time as the JAR file suffices. Maybe another time, I will try CentOS 7.x

Using the following commands, I was able to use ambari-shell to check the services

Result:

 

Result

Basically I used services stop, to stop all the services to give myself some processing power.

I found the ambari-shell a little basic, and much improvement needed, it was so limited. I could not figure out how to stop a names service.

This links how several commands that can be used to stop specific services

I then logged back in to Ambari and used the console to restart each service I required.

I started with thr HDFS service on all nodes

 

I then started Mapreduce, and YARN

How to delete a service

 

Before we use this command, I wanted to remind that the Amabri web interface is great to install and view status, but it doesn’t inform you where the configs are located. I suspect most of this is core-site.xml sections, and hdfs-site.xml sections.

Here is how I am going to list the falcon services, as I am not using falcon

Syntax

 

Example

 

Actual command run as root

I am now going to parse this json using command line and python

Get version of Python, to ensure it is Python 3

 

This should be the command using python3

 

I used python3, and the result was:

 

I can use this for a variety of things for example a script called aComp.sh contains:

 

And the python file called aComp.py contains:

 

The result of running aComp.sh  is below:

I can now issue this command syntax to delete the falcon service

When I run the ./aComp.sh command, we no longer have the service

When we log into ambari we see that the falcon service does not exist, we will now delete storm, but first lets change the aComp.sh script to accept a command-line parameter

Example call

 

Result:

 

We can now confirm all that we need to do is delete the service name called STORM to remove all the compnents of STORM

 

Storm is now deleted

Lets remove mahout, and flume

Lets remove knox, and atlas

Lets remove Sqoop and Hbase

Lets remove slider and accumulo

Lets remove tez and kafka

Once I removed a tonne of unused services, I was then able to try and re-distribute services using ambari, and my load/performancxe etc was much better.

I then tried to move the oozie service, these were the command ambari presented as manual tasks..

Manual commands

I also moved the seconday node server, and zoo keeper away from hadoop25 to hadoop27

I had troublke with oozie after moving it, and also delting falcon, so I removed the oozie service and re-added it using ambari web interface

Then

 

I then re-added Ozzie service which required tez clients, but this time I put it on hadoop27.

I have a nice working HDP eco system

I now need to look at the under replicated blocks

Fix under replicated blocks

Switch to hdfs user

 

Then

 

Result:

 

then

 

Result:

We are now done

 

After teo days, I have completed the core admin of a new Ambari managed HDP cluster using Hortonworks. I have learned loads, especially the REST-API for Ambari, which is very useful.

What I need to do now is understand why these components require each other, then create some use-cases to use the current services in this echo-system,. Once done, I can then add the other more complex services.

Facebooktwittergoogle_plusredditpinterestlinkedinmail

Leave a Reply