Tuesday, May 19, 2015

SAP - Hadoop data lab project on Raspberry PI.

Hadoop has developed into a key enabling technology for all kinds of Big Data analytics scenarios. Although Big Data applications have started to move beyond the classic batch-oriented Hadoop architecture towards near real-time architectures such as Spark, Storm, etc., [1] a thorough understanding of the Hadoop & MapReduce & HDFS principles and services such as Hive, HBase, etc. operating on top of the Hadoop core still remains one of the best starting points for getting into the world of Big Data. Renting a Hadoop cloud service or even getting hold of an on-premise Big Data appliance will get you Big Data processing power but no real understanding of what is going on behind the scene.

To inspire your own little Hadoop data lab project, this four part blog will provide a step-by-step guide for the installation of open source Apache Hadoop from scratch on Raspberry Pi 2 Model B over the course of the next three to four weeks. Hadoop is designed for operation on commodity hardware so it will do just fine for tutorial purposes on a Raspberry Pi. We will start with a single node Hadoop setup, will move on to the installation of Hive on top of Hadoop, followed by using the Apache Hive connector of the free SAP Lumira desktop trial edition to visually explore a Hive database. We will finish the series with the extension of the single node setup to a Hadoop cluster on multiple, networked Raspberry Pis. If things go smoothly and varying with your level of Linux expertise, you can expect your Hadoop Raspberry Pi data lab project to be up and running within approximately 4 to 5 hours.

We will use a simple, widely known processing example (word count) throughout this blog series. No prior technical knowledge of Hadoop, Hive, etc. is required. Some basic Linux/Unix command line skills will prove helpful throughout. We are assuming that you are familiar with basic Big Data notions and the Hadoop processing principle. If not so, you will find useful pointers in [3] and at: http://hadoop.apache.org/. Further useful references will be provided in due course.

Part 1 - Single node Hadoop on Raspberry Pi 2 Model B (~120 mins)
Part 2 - Hive on Hadoop (~40 mins)
Part 3 - Hive access with SAP Lumira (~30mins)
Part 4 - A Hadoop cluster on Raspberry Pi 2 Model B(s) (~45mins)
 
HiveServices5.jpg
 

Part 1 - Single node Hadoop on Raspberry Pi 2 Model B (~120 mins)

 

Preliminaries

To get going with your single node Hadoop setup, you will need the following Raspberry Pi 2 Model B bits and pieces:
  • One Raspberry Pi 2 Model B, i.e. the latest Raspberry Pi model featuring a quad core CPU with 1 GB RAM.
  • 8GB microSD card with NOOBS (“New Out-Of-the-Box Software”) installer/boot loader pre-installed (https://www.raspberrypi.org/tag/noobs/).
  • Wireless LAN USB card.
  • Mini USB power supply, heat sinks and HDMI display cable.
  • Optional, but recommended: A case to hold the Raspberry circuit board.

To make life a little easier for yourself, we recommend to go for a Raspberry Pi accessory bundle which typically comes with all of these components pre-packaged and will set you back approx. € 60-70.
RaspberryPiBundle.png
We intend to install the latest stable Apache Hadoop and Hive releases available from any of the Apache Software Foundation download mirror sites, http://www.apache.org/dyn/closer.cgi/hadoop/common/, alongside the free SAP Lumira desktop trial edition, http://saplumira.com/download/, i.e.
  • Hadoop 2.6.0
  • Hive 1.1.0
  • SAP Lumira 1.23 desktop edition

The initial Raspberry setup procedure is described by, amongst others, Jonas Widriksson athttp://www.widriksson.com/raspberry-pi-hadoop-cluster/. His blog also provides some pointers in case you are not starting off with a Raspberry Pi accessory bündle but prefer obtaining the hard- and software bits and pieces individually. We will follow his approach for the basic Raspbian setup in this part, but updated to reflect Raspberry Pi 2 Model B-specific aspects and providing some more detail on various Raspberry Pi operating system configuration steps. To keep things nice and easy, we are assuming that you will be operating the environment within a dedicated local wireless network thereby avoiding any firewall and port setting (and the Hadoop node & rack network topology) discussion. The basic Hadoop installation and configuration descriptions in this part make use of [3].

The subsequent blog parts will be based on this basic setup.
 

Raspberry Pi setup

Powering on your Raspberry Pi will automatically launch the pre-installed NOOBS installer on the SD card. Select “Raspbian”, a Debian 7 Wheezy-based Linux distribution for ARM CPUs, from the installation options and wait for its subsequent Installation procedure to complete. Once the Raspbian operating system has been installed successfully, your Raspberry Pi will reboot automatically and you will be asked to provide some basic configuration settings using raspi-config. Note that since we are assuming that you are using NOOBS, you will not need to expand your SD card storage (menu Option Expand Filesystem). NOOBS will already have done so for you. By the way, if you want or need to run NOOBS again at some point, press & hold the shift key on boot and you will be presented with the NOOBS screen.
 

Basic configuration

What you might want to do though is to set a new password for the default user “pi” via configuration optionChange User Password. Similarly, set your internationalisation options, as required, via optionInternationalisation Options.
BasicConfiguration Menu.png
More interestingly in our context, go for menu item Overclock and set a CPU speed to your liking taking into account any potential implications for your power supply/consumption (“voltmodding”) and the life-time of your Raspberry hardware. If you are somewhat optimistic about these things, go for the “Pi2” setting featuring 1GHz CPU and 500 MHz RAM speeds to make the single node Raspberry Pi Hadoop experience a little more enjoyable.
AdvancedOptions_Overclocking.png
Under Advanced Options, followed by submenu item Hostname, set the hostname of your device to “node1”.  Selecting Advanced Options again, followed by Memory Split, set the GPU memory to 32 MB.
AdvancedOptions_Hostname.png
Finally, under Advanced Options, followed by SSH, enable the SSH server and reboot your Raspberry Pi by selecting in the configuration menu. You will need the SSH server to allow for Hadoop cluster-wide operations.

Once rebooted and with your “pi” user logged in again, the basic configuration setup of your Raspberry device has been successfully completed and you are ready for the next set of preparation steps.
 

Network configuration

To make life a little easier, launch the Raspbian GUI environment by entering startx in the Raspbian command line.(Alternatively, you can use, for example, the vi editor, of course.) Use the GUI text editor, “Leafpad”, to edit the /etc/network/interfaces text file as shown to change the local ethernet settings for eth0 from DHCP to the static IP address 192.168.0.110. Also add the netmask and gateway entries shown. This is the preparation for our multi-node Hadoop cluster which is the subject of Part 4 of this blog series.
Ethernet2.png
Check whether the nameserver entry in file /etc/resolv.conf is given and looks ok. Restart your device afterwards.
 
 

Java environment

Hadoop is Java coded so requires Java 6 or later to operate. Check whether the pre-installed Java environment is in place by executing:
 
java –version

You should be prompted with a Java 1.8, i.e. Java 8, response.
 
 

Hadoop user & group accounts

Set up dedicated user and group accounts for the Hadoop environment to separate the Hadoop installation from other services. The account IDs can be chosen freely, of course. We are sticking here with the ID examples in Widriksson’s blog posting, i.e. group account ID “hadoop" and user account ID “hduser” within this and the sudo user groups.

     sudo addgroup hadoop
     sudo adduser –-ingroup hadoop hduser
     sudo adduser to group sudo
UserGroup2.png

SSH server configuration

Generate a RSA key pair to allow the “hduser” to access slave machines seamlessly with empty passphrase. The public key will be stored in a file with the default Name “id_rsa.pub” and then appended to the list of SSH authorised keys in the file “authorized_keys”. Note that this public key file will need to be shared by all Raspberry Pis in an Hadoop cluster (Part 4).
 
     su hduser
           mkdir ~/.ssh
     ssh-keygen –t rsa –P “”
     cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
SSHKeys2.png

Verify your SSH server access via: ssh localhost

This completes the Raspberry Pi preparations and you are all set for downloading and installing the Hadoop environment.
 

Hadoop installation & configuration

Similar to the Rasbian installation & configuration description above, we will talkyou through the basic Hadoop installation first, followed by the various
environment variable and configuration settings.
 

Basic setup

You need to get your hands on the latest stable Hadoop version (here: version 2.6.0) so initiate the download from any of the various Apache mirror sites (here: spacedump.net).
 

Once the download has been completed, unpack the archive to a sensible location, e.g., /opt represents a typical choice.

     sudo mkdir /opt
     sudo tar –xvzf hadoop-2.6.0.tar.gz -C /opt/

Following extraction, rename the newly created hadoop-2.6.0 folder into something a little more convenient such as “hadoop”.

     cd /opt
     sudo mv Hadoop-2.6.0 hadoop

Running, for example, ls –al, you will notice that your “pi” user is the owner of the “Hadoop” directory, as expected. To allow for the dedicated Hadoop user “hduser” to operate within the Hadoop environment, change the ownership of the Hadoop directory to “hduser”.

     sudo chown -R hduser:hadoop hadoop

This completes the basic Hadoop installation and we can proceed with its configuration.
 

Environment settings

Switch to the “hduser” and add the export statements listed below to the end of the shell startup file ~/.bashrc. Instead of using the standard vi editor, you could, of course, make use of the Leafpad text editor within the GUI environment again.

     su hduser
     vi ~/.bashrc

Export statements to be added to ~/.bashrc:

     export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
     export HADOOP_INSTALL=/opt/hadoop
     export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin

This way both the Java and the Hadoop installation as well as the Hadoop binary paths become known to your user environment. Note that you may add the JAVA_HOME setting to the hadoop-env.sh script instead, as shown below.

Apart from these environment variables, modify the /opt/hadoop/etc/hadoop/hadoop-env.sh script as follows. If you are using an older version of Hadoop, this file can be found in: /opt/hadoop/conf/. Note that in case you decide to relocate this configuration directory, you will have to pass on the
directory location when starting any of the Hadoop daemons (see daemon table below) using the --config option.

     vi /opt/hadoop/etc/hadoop/hadoop-env.sh

Hadoop assigns 1 GB of memory to each daemon so this default value needs to be reduced via parameterHADOOP_HEAPSIZE to
allow for Raspberry Pi conditions. The JAVA_HOME setting for the location of the Java implementation may be omitted if already set in your shell environment, as shown above. Finally, set the datanode’s Java virtual machine to client mode. (Note that with the Raspberry Pi 2 Model B’s ARMv7 processor, this
ARMv6-specific setting is not strictly necessary anymore.)

     # The java implementation to use. Required, if not set in the home shell
     export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

     # The maximum amount of heap to use, in MB. Default is 1000.
     export HADOOP_HEAPSIZE=250
     # Command specific options appended to HADOOP_OPTS when specified
     export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTSi -client"
 

Hadoop daemon properties

With the environment settings completed, you are ready for the more advanced Hadoop daemon configurations. Note that the configuration files are not held globally, i.e. each node in an Hadoop cluster holds its own set of configuration files which need to be kept in sync by the administrator using, for example, rsync.

Modify the following files, as shown below, to configure the Hadoop system for operation in pseudodistributed mode. You can find these files in directory /opt/hadoop/etc/hadoop. In the case of older Hadoop versions, look for the files in: /opt/hadoop/conf
core-site.xml
Common configuration settings for Hadoop Core.

hdfs-site.xml
Configuration settings for HDFS daemons:
The namenode, the secondary namenode and the datanodes.

mapred-site.xml
General configuration settings for MapReduce
daemons
. Since we are running MapReduce usingYARN, the MapReduce jobtracker and tasktrackers are replaced with a single resource manager running on the namenode.
 
File: core-site.XML
     
    
          hadoop.tmp.dir
          /hdfs/tmp
    
    
          fs.default.name
          hdfs://localhost:54310
    
 

File: hdfs-site.xml
    
 
    
          dfs.replication
          1
    
 

File: mapred-site.xml.template ( “mapred-site.xml”, if dealing with older Hadoop versions)

 
    
          mapred.job.tracker
          localhost:54311
    
 

Hadoop Data File System (HDFS) creation

HDFS has been automatically installed as part of the Hadoop installation. Create a tmp folder within HDFS to store temporary test data and change the directory ownership to your Hadoop user of choice. A new HDFS installation needs to be formatted prior to use. This is achieved via -format.

     sudo mkdir -p /hdfs/tmp
     sudo chown hduser:hadoop /hdfs/tmp
     sudo chmod 750 /hdfs/tmp
     hadoop namenode -Format

Launch HDFS and YARN daemons

Hadoop comes with a set of scripts for starting and stopping the various daemons. They can be found in the /bindirectory. Since you are dealing with a single node setup, you do not need to tell Hadoop about the various machines in the cluster to execute any script on and you can simply execute the following scripts straightaway to launch the Hadoop file system (namenode, datanode and secondary namenode) and YARN resource manager daemons. If you need to stop these daemons, use the stop-dfs.sh and stop-yarn.sh script, respectively.

     /opt/hadoop/sbin/start-dfs.sh
     /opt/hadoop/sbin/start-yarn.sh

Check the resource manager web UI at http://localhost:8088 for a node overview. Similarly,http://localhost:50070 will provide you with details on your HDFS. If you find yourself in need for issue diagnostics at any point, consult the log4j.log file in the Hadoop installation directory /logs first. If preferred, you can separate the log files from the Hadoop installation directory by setting a new log directory in HADOOP_LOG_DIR and adding it to script hadoop-env.sh.
WebUI_NodeOverview2.png
With all the implementation work completed, it is time for a little Hadoop processing example.
 

An example

We will run some word count statistics on the standard Apache Hadoop license file to give your Hadoop core setup a simple test run. The word count executable represents a standard element of your Hadoop jar file. To get going, you need to upload the Apache Hadoop license file into your HDFS home directory.
 
     hadoop fs -copyFromLocal /opt/hadoop/LICENSE.txt /license.txt

Run word count against the license file and write the result into license-out.txt.
 
     hadoop jar /opt/hadoop-examples-2.6.0.jar wordcount /license.txt /license-out.txt

You can get hold of the HDFS output file via:
 
     hadoop fs -copyToLocal /license-out.txt ~/

Have a look at ~/license-out.txt/part-r-00000 with your preferred text editor to see the word count results. It should look like shown in the extract below.
WordCount_ResultExtract2.png

We will build on these results in the subsequent parts of this blog series on Hive QL and its SAP Lumira integration.
 
Part 2 - Hive on Hadoop (~40 mins)
Part 3 - Hive access with SAP Lumira (~30mins)
Part 4 - A Hadoop cluster on Raspberry Pi 2 Model B(s) (~45mins)
 
Part 2 - Hive on Hadoop (~40 mins)

Following on from the Hadoop core installation on a Raspberry Pi 2 Model B in Part 1 of this blog series, in this Part 2, we will proceed with installing Apache Hive on top of HDFS and show its basic principles with the help of last part's word count Hadoop processing example.

Hive represents a distributed relational data warehouse featuring a SQL-like query language, HiveQL, inspired by the MySQL SQL dialect. A high-level comparison of the HiveQl and SQL is provided in [1]. For a HiveQL command reference, see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual.

The Hive data sits in HDFS with HiveQL queries getting translated into MapReduce jobs by the Hadoop run-time environment. Whilst traditional relational data warehouses enforce a pre-defined meta data schema when writing data to the warehouse, Hive performs schema on read, i.e., the data is checked when a query is launched against it. Hive alongside the NoSQL data warehouse HBase represent frequently used components of the Hadoop data processing layer for external applications to push query workloads towards data in Hadoop. This is exactly what we are going to do in Part 3 of this series when connecting to the Hive environment via the SAP Lumira Apache Hive standard connector and pushing queries through this connection against the word count output file.
 
HiveServices5.jpg
First, let us get Hive up and running on top of HDFS.
 
Hive installation
The latest stable Hive release will operate alongside the latest stable Hadoop release and can be obtained from Apache Software Foundation mirror download sites. Initiate the download, for example, from spacedump.net and unpack the latest stable Hive release as follows. You may also want to rename the binary directory to something a little more convenient.

cd ~/
wget http://apache.mirrors.spacedump.net/hive/stable/apache-hive-1.1.0-bin.tar.gz
tar -xzvf apache-hive-1.1.0-bin.tar.gz
mv apache-hive-1.1.0-bin hive-1.1.0

Add the paths to the Hive installation and the binary directory, respectively, to your user environment.

cd hive-1.1.0
export HIVE_HOME={{pwd}}
export PATH=$HIVE_HOME/bin:$PATH
export HADOOP_USER_CLASSPATH_FIRST=true

Make sure your Hadoop user chosen in Part 1 (here: hduser) has ownership rights to your Hive directory.

chown -R hduser:hadoop hive

To be able to generate tables within Hive, run the Hadoop start scripts start-dfs.sh and start-yarn.sh (see also Part 1). You may also want to create the following directories and access settings.

hadoop fs -mkdir /tmp
hadoop fs -chmod g+w /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /user/hive/warehouse

Strictly speaking, these directory and access settings assume that you are intending to have more than one Hive user sharing the Hadoop cluster and are not required for our current single Hive user scenario.

By typing in hive, you should now be able to launch the Hive command line interface. By default, Hive issues information to standard error in both interactive and noninteractive mode. We will see this effect in action in Part 3 when connecting to Hive via SAP Lumira. The -S parameter of the hive statement will suppress any feedback statements.

Typing in hive --service help will provide you with a list of all available services [1]:
cli
Command-line interface to Hive. The default service.
hiveserver
Hive operating as a server for programmatic client access via, for example, JDBC and ODBC. Http, port 10000. Port configuration parameter HIVE_PORT.
hwi
Hive web interface for exploring the Hive schemas. Http, port: 9999. Port configuration parameter hive.hwi.listen.port.
jarHive equivalent to hadoop jar. Will run Java applications in both the Hadoop and Hive classpath.
metastoreCentral repository of Hive meta data.

If you are curious about the Hive web interface, launch hive --service hwi, enter http://localhost:9999/hwi in your browser and you will be shown something along the lines of the screenshot below.
HWI.png

If you run into any issues, check out the Hive error log at /tmp/$USER/hive.log. Similarly, the Hadoop error logs presented in Part 1 can prove useful for Hive debugging purposes.


An example (continued)

Following on from our word count example in Part 1 of this blog series, let us upload the word count output file into Hive's local managed data store. You need to generate the Hive target table first. Launch the Hive command line interface and proceed as follows.

create table wcount_t(word string, count int) row format delimited fields terminated by '\t' stored as textfile;

In other words, we just created a two-column table consisting of a string and an integer field delimited by tabs and featuring newlines for each new row. Note that HiveQL expects a command line to be finished with a semicolon.
 
The word count output file can now be loaded into this target table.

load data local inpath '~/license-out.txt/part-r-00000' overwrite into table wcount_t;

Effectively, the local file part-r-00000 is stored in the Hive warehouse directory which is set touser/hive/warehouse by default. More specifically, part-r-00000 can be found in Hive Directoryuser/hive/warehouse/wcount_t and you may query the table contents.

show tables;
select * from wcount_t;

If everything went according to plan, your screen should show a result similar to the screenshot extract below.
 
ShowTables2.png

If so, it means you managed to both install Hive on top of Hadoop on Raspberry Pi 2 Model B and load the word count output file generated in Part 1 into the Hive data warehouse environment. In the process, you should have developed a basic understanding of the Hive processing environment, its SQL-like query language and its interoperability with the underlying Hadoop environment.
 
In the next part of this series, we will bring the implementation and configuration effort of Parts 1 & 2 to fruition by running SAP Lumira as a client against the Hive server and will submit queries against the word count result file in Hive using standard SQL with the Raspberry Pi doing all the MapReduce work. Lumira's Hive connector will translate these standard SQL queries into HiveQL so that things appear pretty standard from the outside. Having worked your way through the first two parts of this blog series, however, you will be very much aware of what is actually going on behind the scene.