Hadoop
Data Lakes
Image Source: searchtechnologies.com
Yarn
- hadoop.apache.org: Hadoop: Setting up a Single Node Cluster
- michael-noll.com: Running Hadoop on Ubuntu Linux (Single-Node Cluster)]
- How to Install and Configure Apache Hadoop on a Single Node in CentOS 7
- add user group
hadoop
andhduser
to this group $ sudo groupadd hadoop
$ sudo adduser -G hadoop hduser
$ sudo passwd hduser
assign password tohduser
- modify ssh configuration
$ sudo nano /etc/ssh/sshd_config
uncomment linePubkeyAuthentication yes
$ sudo service sshd restart
required after every system restart- create ssh key pair and test ssh connection
$ sudo su - hduser
$ ssh-keygen -t rsa -P ""
$ rm $HOME/.ssh/authorized_keys
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ chmod og-wx ~/.ssh/authorized_keys
$ ssh localhost
- download and unzip hadoop, copy contents to
/opt/hadoop/
and change permissions $ sudo mkdir /opt/hadoop
$ sudo cp -r ~/Downloads/hadoop-2.7.3/* /opt/hadoop
$ sudo chown -R hduser:hadoop /opt/hadoop
- download and unzip data from stat-computing.org
$ sudo mv ~/Downloads/2008.csv /home/hduser/tmp/airlines
hdfs /user/hduser
and file system /home/hduser
not to be confused!
- on
hduser
account $ cd /opt/hadoop
$ bin/hdfs dfsadmin -safemode leave
$ bin/hdfs namenode -format
$ bin/hadoop fs -mkdir /user/hduser/airline
$ bin/hdfs dfs -copyFromLocal /home/hduser/tmp/airline/delay /user/hduser/airline/
$ bin/hdfs dfs -ls /user/hduser/airline/delay
Found 1 items
-rw-r--r-- 1 hduser hduser 689413344 2017-01-13 23:31 /user/hduser/airline/delay/2008.csv
- if datanode not started (check using
jps
) $ rm -r ~/hadoopinfra/hdfs/datanode/current
- test if datanode can be created
$ bin/hdfs datanode -regular
copy wikipedia data
mkdir /home/hduser/tmp/wikipedia
wget http://alaska.epfl.ch/~dockermoocs/bigdata/wikipedia.dat /home/hduser/tmp/wikipedia
cd $HADOOP_HOME
bin/hadoop fs -mkdir /user/hduser/wikipedia
bin/hdfs dfs -copyFromLocal /home/hduser/tmp/wikipedia /user/hduser/wikipedia/
bin/hadoop fs -mkdir /user/xps13
bin/hadoop fs -chown xps13:xps13 /user/xps13
[hduser@xps13 hadoop]$ /opt/hadoop/bin/hdfs dfs -ls /user/hive/warehouse 17/01/15 13:36:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Found 2 items drwxrwxr-x - hduser supergroup 0 2017-01-15 12:17 /user/hive/warehouse/employee drwxrwxr-x - hduser supergroup 0 2017-01-15 12:48 /user/hive/warehouse/flights
run spark query on non-`hduser` account, i.e. `xps13`
- need to export `$HADOOP_CONF_DIR` containing `core-site.xml` and `hdfs-site.xml`
$SPARK_HOME/bin/spark-shell val rdd = sc.textFile(“hdfs://localhost:9000/user/hduser/airline/delay/2007.csv”) val count = rdd.flatMap(line => line.split(“ “)) .map(word => (word, 1)) .reduceByKey(_ +_) count.foreach(println)
### Ports
- there are possible configuration variations for `fs.default.name` in `$HADOOP_HOME/etc/hadoop` aka `$HADOOP_CONF`
- both ports `8020` and `9000` are common values
- e.g. Pig uses port `8020` for HDFS and `8032` for YARN Resource Manager
### Apache Hive
- [tutorialspoint.com: Hive Installation](https://www.tutorialspoint.com/hive/hive_installation.htm)
- download hive from [apache.org](http://www.apache.org/dyn/closer.cgi/hive/)
extract and move folder contents to `/opt/hive`
: `$ sudo mv ~/Downloads/apache-hive-2.1.1-bin/* /opt/hive`
add permissions
: `$ sudo chown -R hduser:hadoop /opt/hive`
add hive environment variables to `/home/hduser/.bashrc` and source
export HIVE_HOME=/opt/hive export PATH=$PATH:$HIVE_HOME/bin export CLASSPATH=$CLASSPATH:/opt/hadoop/lib/:. export CLASSPATH=$CLASSPATH:/opt/hive/lib/:.
- copy `hive-env.sh.template` and add `HADOOP_HOME=/opt/hadoop`
### Configuring Metastore of Hive
Configuring Metastore means specifying to Hive where the database is stored. You can do this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the template file using the following command:
$ cd $HIVE_HOME $ cp conf/hive-default.xml.template conf/hive-site.xml
make following modificatios to `hive-site.xml`
`javax.jdo.option.ConnectionURL`
init the database
: `$ cd $HIVE_HOME`
`$ bin/schematool -initSchema -dbType derby`
### Make configuration available to spark
edit `~/.bashrc`
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_CONF_DIR
start `dfs` and `yarn`
: `$ $HADOOP_HOME/sbin/start-dfs.sh`
check availability at `http://localhost:50070/`
`$ $HADOOP_HOME/sbin/start-yarn.sh`
allow writing to HDFS
: `$ $HADOOP_HOME/bin/hdfs dfsadmin -safemode leave`
`$ $HADOOP_HOME/bin/hdfs dfsadmin -safemode wait`
should print `Safe mode is OFF`
test query from [spark.apache.org](http://spark.apache.org/docs/latest/running-on-yarn.html)
./bin/spark-submit –class org.apache.spark.examples.SparkPi \ –master yarn \ –deploy-mode cluster \ –driver-memory 4g \ –executor-memory 2g \ –executor-cores 1 \ examples/jars/spark-examples*.jar \ 10
stop `dfs` and `yarn`
: `$ $HADOOP_HOME/sbin/stop-dfs.sh`
`$ $HADOOP_HOME/sbin/stop-yarn.sh`
todo: use of `queue` flag `--queue hadoop`
- [`YarnClientSchedulerBackend` - SchedulerBackend for YARN in Client Deploy Mode](https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/yarn/spark-yarn-client-yarnclientschedulerbackend.adoc)
using interactive `spark-shell`
./bin/spark-shell \ –master yarn \ –deploy-mode client \ –driver-memory 2g \ –executor-memory 1g \ –executor-cores 1
fix
- `native-hadoop library for your platform... using builtin-java classes where applicable`
- `Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.`
- [excessive memory allocation](http://stackoverflow.com/questions/27792839/spark-fail-when-running-pi-py-example-with-yarn-client-mode)
add to `yarn-site.xml`
write permission to `/tmp/hive`
bin/hadoop fs -mkdir /user/xps13 bin/hadoop fs -chown xps13:xps13 /user/xps13 bin/hdfs dfs -chmod 777 /user/xps13
bin/hadoop fs -mkdir /tmp bin/hadoop fs -mkdir /tmp/hive bin/hadoop fs -chown xps13:xps13 /tmp/hive bin/hdfs dfs -chmod 777 /tmp/hive
### Test Sparklyr
set `HIVE_CONF_DIR`
: `export HIVE_CONF_DIR=/opt/hive/conf`
### Apache Derby (not required)
- download from [db.apache.org](https://db.apache.org/derby/derby_downloads.html)
extract and move folder contents to `/opt/derby`
: `$ sudo mv ~/Downloads/db-derby-10.13.1.1-bin/* /opt/derby`
add permissions
: `$ sudo chown -R hduser:hadoop /opt/derby`
- [db.apache.org: Step 4: Derby Network Server](http://db.apache.org/derby/papers/DerbyTut/ns_intro.html)
$ export DERBY_INSTALL=/opt/derby $ export CLASSPATH=$DERBY_INSTALL/lib/derbytools.jar:$DERBY_INSTALL/lib/derbynet.jar $ cd $DERBY_INSTALL/bin $ . setNetworkServerCP $ startNetworkServer Sat Jan 14 16:38:00 CET 2017 : Security manager installed using the Basic server security policy. Sat Jan 14 16:38:00 CET 2017 : Apache Derby Network Server - 10.13.1.1 - (1765088) started and ready to accept connections on port 1527
- http://localhost:50070/
- http://localhost:8088/
- read in Python using `read_csv_from_hdfs`
- http://nbviewer.jupyter.org/github/ofermend/IPython-notebooks/blob/master/blog-part-1.ipynb
## Apache Hadoop
- [Native Libraries Guide](https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/NativeLibraries.html)
- [fedora: Changes/Hadoop](https://fedoraproject.org/wiki/Changes/Hadoop)
### Download
- download source from [hadoop.apache.org/releases](http://hadoop.apache.org/releases.html)
- install `libprotoc 2.5.0` lower than `libprotoc 2.6.1` error [compile hadoop from source](http://codetips.coloza.com/compile-hadoop-from-source/)
install `snappy-devel.x86_64`
: `$ sudo dnf install -y snappy-devel.x86_64`
install `cmake`
: `$ sudo dnf install -y cmake`
clean
: `$ mvn clean`
build
: `$ mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy`
### Install
uninstall `hadoop-client`
: `$ sudo dnf remove -y hadoop-client`
### User Management
- [Install Apache Hadoop on CentOS 7](http://www.tecmint.com/install-configure-apache-hadoop-centos-7/)
remove group `hadoop`
: `$ sudo groupdel hadoop`
remove `/opt/hadoop`
: `$ sudo rm -r /opt/hadoop`
copy to `/opt/hadoop`
: `$ sudo mkdir /opt/hadoop`
`$ sudo cp -r /home/xps13/hadoop/hadoop-2.7.3-src/hadoop-dist/target/hadoop-2.7.3/* /opt/hadoop/`
create `hadoop` user account without root powers to use for Hadoop installation path and working environment; the new account home directory will reside in `/opt/hadoop` directory
sudo useradd -d /opt/hadoop hadoop
sudo passwd hadoop
sudo chown -R hadoop:hadoop /opt/hadoop/
HADOOP env variables
export HADOOP_HOME=/opt/hadoop export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native” export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin ~~~
- check installation
$ bin/hadoop checknative -a
Usage
- start hadoop
$ start-dfs.sh
Fedora 24
- Create a User
$ sudo useradd -m rstudio-user
$ sudo passwd rstudio-user
- Create new directory in hdfs
$ hadoop fs -mkdir /user/rstudio-user
$ hadoop fs -chmod 777 /user/rstudio-user