Data Lakes


Image Source:


add user group hadoop and hduser to this group
$ sudo groupadd hadoop
$ sudo adduser -G hadoop hduser
$ sudo passwd hduser assign password to hduser
modify ssh configuration
$ sudo nano /etc/ssh/sshd_config
uncomment line PubkeyAuthentication yes
$ sudo service sshd restart required after every system restart
create ssh key pair and test ssh connection
$ sudo su - hduser
$ ssh-keygen -t rsa -P ""
$ rm $HOME/.ssh/authorized_keys $ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys
$ chmod og-wx ~/.ssh/authorized_keys $ ssh localhost
download and unzip hadoop, copy contents to /opt/hadoop/ and change permissions
$ sudo mkdir /opt/hadoop
$ sudo cp -r ~/Downloads/hadoop-2.7.3/* /opt/hadoop
$ sudo chown -R hduser:hadoop /opt/hadoop
download and unzip data from
$ sudo mv ~/Downloads/2008.csv /home/hduser/tmp/airlines

hdfs /user/hduser and file system /home/hduser not to be confused!

on hduser account
$ cd /opt/hadoop
$ bin/hdfs dfsadmin -safemode leave
$ bin/hdfs namenode -format $ bin/hadoop fs -mkdir /user/hduser/airline
$ bin/hdfs dfs -copyFromLocal /home/hduser/tmp/airline/delay /user/hduser/airline/
$ bin/hdfs dfs -ls /user/hduser/airline/delay
Found 1 items
-rw-r--r-- 1 hduser hduser 689413344 2017-01-13 23:31 /user/hduser/airline/delay/2008.csv
if datanode not started (check using jps)
$ rm -r ~/hadoopinfra/hdfs/datanode/current
test if datanode can be created
$ bin/hdfs datanode -regular

copy wikipedia data

mkdir /home/hduser/tmp/wikipedia
wget /home/hduser/tmp/wikipedia
bin/hadoop fs -mkdir /user/hduser/wikipedia
bin/hdfs dfs -copyFromLocal /home/hduser/tmp/wikipedia /user/hduser/wikipedia/
bin/hadoop fs -mkdir /user/xps13
bin/hadoop fs -chown xps13:xps13 /user/xps13

[hduser@xps13 hadoop]$ /opt/hadoop/bin/hdfs dfs -ls /user/hive/warehouse 17/01/15 13:36:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Found 2 items drwxrwxr-x - hduser supergroup 0 2017-01-15 12:17 /user/hive/warehouse/employee drwxrwxr-x - hduser supergroup 0 2017-01-15 12:48 /user/hive/warehouse/flights

run spark query on non-`hduser` account, i.e. `xps13`

- need to export `$HADOOP_CONF_DIR` containing `core-site.xml` and `hdfs-site.xml`

$SPARK_HOME/bin/spark-shell val rdd = sc.textFile(“hdfs://localhost:9000/user/hduser/airline/delay/2007.csv”) val count = rdd.flatMap(line => line.split(“ “)) .map(word => (word, 1)) .reduceByKey(_ +_) count.foreach(println)

### Ports

- there are possible configuration variations for `` in `$HADOOP_HOME/etc/hadoop` aka `$HADOOP_CONF`
- both ports `8020` and `9000` are common values
- e.g. Pig uses port `8020` for HDFS and `8032` for YARN Resource Manager hdfs://localhost:8020

### Apache Hive

- [ Hive Installation](

- download hive from [](

extract and move folder contents to `/opt/hive`
:   `$ sudo mv ~/Downloads/apache-hive-2.1.1-bin/* /opt/hive`

add permissions
:   `$ sudo chown -R hduser:hadoop /opt/hive`

add hive environment variables to `/home/hduser/.bashrc` and source

export HIVE_HOME=/opt/hive export PATH=$PATH:$HIVE_HOME/bin export CLASSPATH=$CLASSPATH:/opt/hadoop/lib/:. export CLASSPATH=$CLASSPATH:/opt/hive/lib/:.

- copy `` and add `HADOOP_HOME=/opt/hadoop`

### Configuring Metastore of Hive

Configuring Metastore means specifying to Hive where the database is stored. You can do this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the template file using the following command:

$ cd $HIVE_HOME $ cp conf/hive-default.xml.template conf/hive-site.xml

make following modificatios to `hive-site.xml`

hive.exec.scratchdir /tmp/hive-${} hive.exec.local.scratchdir /tmp/${} hive.downloaded.resources.dir /tmp/${}_resources


javax.jdo.option.ConnectionURL jdbc:derby:metastore_db;create=true JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag i$ For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.

init the database
:   `$ cd $HIVE_HOME`  
    `$ bin/schematool -initSchema -dbType derby`

### Make configuration available to spark

edit `~/.bashrc`


start `dfs` and `yarn`
:   `$ $HADOOP_HOME/sbin/`  
    check availability at `http://localhost:50070/`  
   	`$ $HADOOP_HOME/sbin/`

allow writing to HDFS
:   `$ $HADOOP_HOME/bin/hdfs dfsadmin -safemode leave`  
    `$ $HADOOP_HOME/bin/hdfs dfsadmin -safemode wait`  
    should print `Safe mode is OFF`

test query from [](

./bin/spark-submit –class org.apache.spark.examples.SparkPi \ –master yarn \ –deploy-mode cluster \ –driver-memory 4g \ –executor-memory 2g \ –executor-cores 1 \ examples/jars/spark-examples*.jar \ 10

stop `dfs` and `yarn`
:   `$ $HADOOP_HOME/sbin/`  
	`$ $HADOOP_HOME/sbin/`

todo: use of `queue` flag `--queue hadoop`

- [`YarnClientSchedulerBackend` - SchedulerBackend for YARN in Client Deploy Mode](

using interactive `spark-shell`

./bin/spark-shell \ –master yarn \ –deploy-mode client \ –driver-memory 2g \ –executor-memory 1g \ –executor-cores 1


- `native-hadoop library for your platform... using builtin-java classes where applicable`
- `Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.`

- [excessive memory allocation](

add to `yarn-site.xml`

yarn.nodemanager.pmem-check-enabled false yarn.nodemanager.vmem-check-enabled false

write permission to `/tmp/hive`

bin/hadoop fs -mkdir /user/xps13 bin/hadoop fs -chown xps13:xps13 /user/xps13 bin/hdfs dfs -chmod 777 /user/xps13

bin/hadoop fs -mkdir /tmp bin/hadoop fs -mkdir /tmp/hive bin/hadoop fs -chown xps13:xps13 /tmp/hive bin/hdfs dfs -chmod 777 /tmp/hive

### Test Sparklyr

:   `export HIVE_CONF_DIR=/opt/hive/conf`

### Apache Derby (not required)

- download from [](

extract and move folder contents to `/opt/derby`
:   `$ sudo mv ~/Downloads/db-derby-* /opt/derby`

add permissions
:   `$ sudo chown -R hduser:hadoop /opt/derby`

- [ Step 4: Derby Network Server](

$ export DERBY_INSTALL=/opt/derby $ export CLASSPATH=$DERBY_INSTALL/lib/derbytools.jar:$DERBY_INSTALL/lib/derbynet.jar $ cd $DERBY_INSTALL/bin $ . setNetworkServerCP $ startNetworkServer Sat Jan 14 16:38:00 CET 2017 : Security manager installed using the Basic server security policy. Sat Jan 14 16:38:00 CET 2017 : Apache Derby Network Server - - (1765088) started and ready to accept connections on port 1527

- http://localhost:50070/
- http://localhost:8088/

- read in Python using `read_csv_from_hdfs`

## Apache Hadoop

- [Native Libraries Guide](
- [fedora: Changes/Hadoop](

### Download

- download source from [](
- install `libprotoc 2.5.0` lower than `libprotoc 2.6.1` error [compile hadoop from source](

install `snappy-devel.x86_64`
:   `$ sudo dnf install -y snappy-devel.x86_64`

install `cmake`
:   `$ sudo dnf install -y cmake`

:   `$ mvn clean`

:   `$ mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy`

### Install

uninstall `hadoop-client`
:   `$ sudo dnf remove -y hadoop-client`

### User Management

- [Install Apache Hadoop on CentOS 7](

remove group `hadoop`
:   `$ sudo groupdel hadoop`

remove `/opt/hadoop`
:   `$ sudo rm -r /opt/hadoop`

copy to `/opt/hadoop`
:   `$ sudo mkdir /opt/hadoop`  
    `$ sudo cp -r /home/xps13/hadoop/hadoop-2.7.3-src/hadoop-dist/target/hadoop-2.7.3/* /opt/hadoop/`

create `hadoop` user account without root powers to use for Hadoop installation path and working environment; the new account home directory will reside in `/opt/hadoop` directory

sudo useradd -d /opt/hadoop hadoop

sudo passwd hadoop

sudo chown -R hadoop:hadoop /opt/hadoop/

HADOOP env variables


check installation
$ bin/hadoop checknative -a


start hadoop

Fedora 24

Create a User
$ sudo useradd -m rstudio-user
$ sudo passwd rstudio-user
Create new directory in hdfs
$ hadoop fs -mkdir /user/rstudio-user
$ hadoop fs -chmod 777 /user/rstudio-user


19 November 2016