--D. Thiebaut (talk) 14:35, 22 June 2013 (EDT)

Server Hardware

Here are the specs of our main machine

ASUS M5A97 R2.0 AM3+ AMD 970 SATA 6Gb/s USB 3.0 ATX AMD Motherboard ($84.99)
Corsair Vengeance 16GB (2x8GB) DDR3 1600 MHz (PC3 12800) Desktop Memory (CMZ16GX3M2A1600C10) ($120.22)
AMD FX-8150 8-Core Black Edition Processor Socket AM3+ FD8150FRGUBOX ($175)
2x 1 TB-RAID 1 disks
- used Ctrl-F in POS to call RAID setup.
- setup 2 1TB drives as RAID-1

Software Setup

Ubuntu Desktop V13.04 is the main operating system of the computer. VirtualBox is used to setup the virtual hadoop servers. Since the main machine has an 8-core processor, we create a 6-host virtual cluster.

While installing Ubuntu, use Ubuntu Software Manager GUI and install following packages:

VirtualBox
emacs
eclipse
python 3.1
ddclient (for dynamic hostname)
open-ssh to access remotely
setup keyboard ctrl/capslock swap
setup 2 video displays

Create The First Virtual Server

Bo Feng @ http://b-feng.blogspot.com/2013/02/create-virtual-ubuntu-cluster-on.html describes the setup of a virtual cluster as a series of simple (though high-level steps):

Create a virtual machine
- Change the network to bridge mode
Install Ubuntu Server OS
- Setup the hostname to something like "node-01"
Boot up the new virtual machine
- sudo rm -rf /etc/udev/rules.d/70-persistent-net.rules
Clone the new virtual machine
- Initialize with a new MAC address
Boot up the cloned virtual machine
- edit /etc/hostname
  - change "node-01" to "node-02"
- edit /etc/hosts
  - change "node-01" to "node-02"
Reboot the cloned virtual machine
Redo step 4 to 6 until there are enough nodes

Note: the ip address of each node should start with "192.168." unless you don't have a router.

That's basically it, but we'll go through the fine details here.

Create Virtual Server

This description is based on http://serc.carleton.edu/csinparallel/vm_cluster_macalester.html.

On the Ubuntu physical machine, download ubuntu-13.04-server-amd64.iso from Ubuntu repository
Mount the image so that the virtual machines can install directly from it: Open windows manager, go to Downloads, right click (need 3-button mouse) on ubuntu-13.04-server-amd64.iso and pick open with archive-mounter
Start VirtualBox (should be in Applications/Accessories)
Create new virtual box with these attributes:
- Linux
- Ubuntu 64 bits
- 2 GB virtual RAM
- Create virtual hard disk, VDI
- dynamically allocated
- set the name to hadoop100, with 8.00 GB drive size

VirtualHadoopCluster VirtualBoxCreateHadoop100.png

- install Ubuntu (pick language, keyboard, time zone, etc...)
- use a superuser name and password
- partition the disk using the guided feature. Select LVM
- install security updates automatically
- select the OpenSSH server (leave all others unselected)
Start the Virtual Machine Hadoop100. It works!

Setup Network

Using virtualBox, Stop hadoop100
Right click on the machine and in the settings, go to Network
select Enable Network Adapter
select Bridged Adapter and pick eth0.
Using virtualBox restart hadoop100
from its terminal window, ping google.com and verify that site is reachable from the virtual server.
(get the machine's ip address using ifconfig -a if necessary)
Change server name by editing /etc/hostname and set to hadoop100
similarly, edit /etc/hosts to contain line 127.0.1.1 hadoop100

Setup Dynamic-Host Address

On another computer, connect to dyndns and register hadoop100 with this system.

on Hadoop0, install ddclient

   sudo apt-get install ddclient

and enter the full address of the server (e.g. hadoop100.webhop.net)
If you need to modify the information collected by the installation of ddclient, it can be found /etc/ddclient.conf
wait a few minutes and verify that hadoop100 is now reachable via ssh from a remote computer.

Install Hadoop

Install java

sudo apt-get install openjdk-6-jdk
java -version
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.5) (6b27-1.12.5-1ubuntu1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Disable ipv6 (recommended by many users setting up Hadoop)

 sudo emacs -nw /etc/sysctl.conf

and add these lines at the end:

  #disable ipv6  
  net.ipv6.conf.all.disable_ipv6 = 1  
  net.ipv6.conf.default.disable_ipv6 = 1  
  net.ipv6.conf.lo.disable_ipv6 = 1

reboot hadoop100

Now follow directives in Michael Noll's excellent tutorial for building a 1-node Hadoop cluster.
create hduser user and hadoop group (already existed)
create ssh password-less entry
open /usr/local/hadoop/conf/hadoop-env.sh and set JAVA_HOME

  export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64/

Download hadoop from hadoop repository

  sudo -i
  cd /usr/local
  wget http://www.poolsaboveground.com/apache/hadoop/common/stable/hadoop-1.1.2.tar.gz
  tar -xzf hadoop-1.1.2.tar.gz
  mv hadoop-1.1.2 hadoop
  chown -R hduser:hadoop hadoop

setup user hduser by loging in a hduser

  su - hduser
  emacs .bashrc

and add

 # HADOOP SETINGS
 # Set Hadoop-related environment variables
 export HADOOP_HOME=/usr/local/hadoop

 # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
 #export JAVA_HOME=/usr/lib/jvm/java-6-sun
 export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

 # Some convenient aliases and functions for running Hadoop-related commands
 unalias fs &> /dev/null
 alias fs="hadoop fs"
 unalias hls &> /dev/null
 alias hls="fs -ls"

 export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin"
 export PATH="$PATH:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:."

This setting of .bashrc should be done by any user wanting to use hadoop.

edit /usr/local/hadoop/conf/hadoop-env.sh and set JAVA_HOME

 export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

Edit /usr/local/hadoop/conf/core-site.xml and add

  <property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
 </property>
 
 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
 </property>

Edit /usr/local/hadoop/conf/mapred-site.xml and add

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

- Edit /usr/local/hadoop/conf/hdfs-site.xml and add

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

Format namenode. As user hduser enter the command:

  hadoop namenode -format

as super user run the following commands:

  sudo mkdir -p /app/hadoop/tmp
  sudo chown hduser:hadoop /app/hadoop/tmp
  sudo chmod 750 /app/hadoop/tmp

  sudo ls -n   /usr/local/hadoop/hadoop-examples-1.1.2.jar /usr/local/hadoop/hadoop-examples.jar

back as hduser

  cd
  mkdir input
  cp /usr/local/hadoop/conf/*.xml input
  start-all.sh
  hadoop dfs copyFromLocal input/ input

  hadoop dfs -ls
  Found 1 item
  drwxr-xr-x   - hduser supergroup          0 2013-07-24 11:16 /user/hduser/input

  hadoop jar /usr/local/hadoop/hadoop-examples.jar wordcount /usr/hduser/input/ output
  13/07/24 11:24:46 INFO input.FileInputFormat: Total input paths to process : 7
  13/07/24 11:24:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
  13/07/24 11:24:46 WARN snappy.LoadSnappy: Snappy native library not loaded
  13/07/24 11:24:47 INFO mapred.JobClient: Running job: job_201307241105_0006
  13/07/24 11:24:48 INFO mapred.JobClient:  map 0% reduce 0%
  13/07/24 11:24:57 INFO mapred.JobClient:  map 28% reduce 0%
  13/07/24 11:25:04 INFO mapred.JobClient:  map 42% reduce 0%
  .
  .
  .
  13/07/24 11:25:22 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1448595456
  13/07/24 11:25:22 INFO mapred.JobClient:     Reduce output records=805
  13/07/24 11:25:22 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5797994496
  13/07/24 11:25:22 INFO mapred.JobClient:     Map output records=3114

  hadoop dfs -ls 
  Found 2 items
  drwxr-xr-x   - hduser supergroup          0 2013-07-24 11:16 /user/hduser/input
  drwxr-xr-x   - hduser supergroup          0 2013-07-24 11:29 /user/hduser/output
 
  hadoop dfs -ls output
  Found 3 items
  -rw-r--r--   1 hduser supergroup          0 2013-07-24 11:29 /user/hduser/output/_SUCCESS
  drwxr-xr-x   - hduser supergroup          0 2013-07-24 11:28 /user/hduser/output/_logs
  -rw-r--r--   1 hduser supergroup      15477 2013-07-24 11:28 /user/hduser/output/part-r-00000

  hadoop dfs -cat output/part-r-00000
  "*"	11
  "_HOST"	1
  "alice,bob	11
  "local"	2
  'HTTP/'	1
  (-1).	2
  (1/4	1
  (maximum-system-jobs	2
  *not*	1
  ,	2
  ...

All set!

Add New User who will run Hadoop Jobs

Give the Right Permissions to New User for Hadoop

This is taken and adapted from http://amalgjose.wordpress.com/2013/02/09/setting-up-multiple-users-in-hadoop-clusters/

as superuser, add new user mickey to Ubuntu

  sudo useradd  -d /home/mickey -m mickey
  sudo passwd mickey

login as hduser

 hadoop fs –chmod -R  1777 /app/hadoop/tmp/mapred/   (this generates errors...)
 chmod 777 /app/hadoop/tmp
 hadoop fs -mkdir /user/mickey/
 hadoop fs -chown -R mickey:mickey /user/mickey/

login as mickey
change the default shell to bash

  chsh

emacs .bashrc and add these lines at the end:

 # HADOOP SETINGS
 # Set Hadoop-related environment variables
 #export HADOOP_HOME=/usr/local/hadoop
 
 # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
 #export JAVA_HOME=/usr/lib/jvm/java-6-sun
 export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
 
 # Some convenient aliases and functions for running Hadoop-related commands
 unalias fs &> /dev/null
 alias fs="hadoop fs"
 unalias hls &> /dev/null
 alias hls="fs -ls"

 export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin"
 export PATH="$PATH:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:."

log out and back in as Mickey

Test

create input directory and copy some random text files there.

  mkdir input
  cp /usr/local/hadoop/conf/*.xml input
  hadoop dfs -copyFromLocal input/ input

run hadoop wordcount job on the files in the new directory to test the setup.

  hadoop jar /usr/local/hadoop/hadoop-examples.jar wordcount /usr/hduser/input/ output
  hadoop dfs -ls
  hadoop dfs -ls output
  hadoop dfs -cat output/part-r-00000
  "*"	11
  "_HOST"	1
  "alice,bob	11
  "local"	2
  'HTTP/'	1
  (-1).	2
  (1/4	1
  ...

It should work correctly and create a list of words and their frequency of occurrence.

Sections below under construction!

Clone Server

This work is done on the physical server, using the VirtualBox Manager

Use hadoop100 as Master for cloning
on hadoop100, remove /etc/udev/rules.d/70-persistent-net.rules

  sudo rm -f /etc/udev/rules.d/70-persistent-net.rules

Not doing this will prevent the network interface from being brought up

right click on hadoop100 in VBox manager
pick new name (i.e. hadoop101)
check ON "Reinitialize the MAC address of all network cards
pick Linked clone
Clone away! (quick, takes about 1/2 a second or so)
right click on new machine, pick settings and adjust these quantities:
- System: select 2GB RAM
- Processor: 1CPU
- Network: bridged adapter
start the virtual machine
it will complain that it is not finding the network, so login as the super user and do this:

ifconfig -a

Notice the first Ethernet adapter id. It should be "eth?"
edit the file at /etc/network/interface. Change the adapter (twice) from "eth0" to the adapter id in the previous step
Save the file
Reboot the virtual appliance
edit /etc/ddclient.conf on virtual machine. Set it to new host defined with the dynamic-host system
edit /etc/hostname and put the name of the new virtual machine there
edit /etc/hosts and put name of new machine after 127.0.1.1

   127.0.1.1     hadoop101

Setup Virtual Hadoop Cluster under Ubuntu with VirtualBox

Contents

Server Hardware

Software Setup

Create The First Virtual Server

Create Virtual Server

Setup Network

Setup Dynamic-Host Address

Install Hadoop

Add New User who will run Hadoop Jobs

Give the Right Permissions to New User for Hadoop

Test

Clone Server

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools