Environment Setup
Info
For the purpose of the environment normalization, we provide a simple docker image for you, which contains most of the software required by this course. We also provide a few scripts to install some optional packages.
The whole progress would seem as follow:
- Make sure you have enough resources:
- It requires at least 8GB of Physical RAM; 16GB or larger would be better
- It requires at least 15GB of hard disk storage
- Install a docker environment in the local machine
- Start Docker Service, pull images, and create an instance
- Just rock it!
- Destroy the containers and images if they are no longer needed
Warning
Since this docker image integrated many related services for the course, it requires at least 4GB RAM for this virtual machine. If you can not meet the minimum requirement, the system could randomly kill one or a few processes due to resource limitation, which causes a lot of strange errors which is even unable to reproduce.
DO NOT TRY TO DO THAT.
Instead, you can use cloud platforms such as Azure.
0. System requirements
You should have enough system resources if you plan to start a container in your local OS.
You need to reserve at least 4 GB RAM for Docker and some extra memory for the host machine. However, you can still start all the Hadoop-related services except Zeppelin, even if you only reserve 4GB for the virtual machine.
1. Install Docker
Docker is a software providing operating-system-level virtualization, also known as containers, promoted by the company Docker, Inc.. Docker uses the resource isolation features of the Linux kernel such as cgroups and kernel namespaces, and a union-capable file system such as OverlayFS and others to allow independent "containers" to run within a single Linux instance, avoiding the overhead of starting and maintaining virtual machines (VMs). (from Wikipedia) Basically, you can treat docker as a lightweight virtual machine with pretty high performance.
The installation instructions for different operating systems are provided in the links below. You can also check the official documentation here to get the latest news and detailed explanations.
Once the docker is installed, you can start your docker services and launch your docker container using the following commands:
docker # a tool to control docker
2. Run Docker image
2.1. Start the container
The command to start the container used in the following tutorials is:
docker run -it --privileged=true \
--cap-add=SYS_ADMIN \
-m 8192m -h bootcamp.local \
--name bigbox -p 2222:22 -p 9530:9530 -p 8888:8888\
-v /:/mnt/host \
sunlab/bigbox:latest \
/bin/bash
It may take a while for the system to download and extract the container.
In general, the syntax of docker run
is
docker run [options] image[:tag|@digest] [command] [args]
Below explains the options used above:
param -p <host-port>:<vm-port>
This option is used to map the TCP port vm-port
in the container to port host-port
on the Docker host.
Currently, the ports are reserved to:
- 8888 - Jupyter Notebook
- 9530 - Zeppelin Notebook
Once you have started the Zeppelin service, this service will keep listening port 9530
in docker.
You will be able to visit this service using http://127.0.0.1:9530
or http://DOCKER_HOST_IP:9530
.
The remote IP depends on the Docker Service you are running, which is described above.
- If you are using Linux or Docker.app in macOS, you can visit "localhost:9530", or other ports if you changed
host-port
- If you are using VirtualBox + macOS or Windows, you should get the Docker's IP first
param -v, --volume=[host-src:]container-dest[:<options>]
This option is used to mount a volume.
Currently, we are using -v /:/mnt/host
.
In this case, we can visit the root of your file system for your host machine.
If you are using macOS, /mnt/host/Users/<yourname>/
would be the $HOME
of your MacBook.
If you are using Windows, you can reach your C:
disk from /mnt/host/c
in docker.
Variable host-src
accepts absolute path only.
param -it
- -i : Keep STDIN open even if not attached
- -t : Allocate a pseudo-tty
param -h bootcamp.local
Once you enter this docker environment, you can ping this docker environment itself as bootcamp.local
. This variable is used in some configuration files for Hadoop ecosystems.
param -m 8192m
Memory limit (format: <number>[<unit>]
).
The number should be a positive integer.
The unit can be one of b
, k
, m
, or g
.
This docker image requires at least 4G of RAM, while 8G is recommended when your physical machine has more than 8G of RAM. The local machine is not the same as the remote server. If you are launching a remote server with 8G RAM, you can set this number as 7G.
Info
Please refer to the official documentation for more information and the detailed explanations.
2.2. Start all necessary services
When we have logged into the docker, we can first use the following script.
/scripts/start-services.sh
This script helps you to start the necessary services for the Hadoop ecosystems.
Info
In your terminal, you will generally meet two kinds of prompts.
Prompt '#' indices you are root aka the administrator of this environment now
> [root@bootcamp /]# whoami
> root
Prompt '$' indices you are an ordinary user
> [yu@bootcamp /]$ whoami
> yu
You are in the sudo
mode by default.
We also assume that you are always in the sudo
mode in the following discussions.
Warning
You probably will encounter the "Connection Refused" exception if you forget to start these services.
If you wish to host Zeppelin, you need to install it first by using the command:
/scripts/install-zeppelin.sh
And start the service with the command:
/scripts/start-zeppelin.sh
Then, Zeppelin will listen to the port 9530
.
Info
Please refer to the Zeppelin tutorial for more information about the configuration and usage of Zeppelin Notebook.
If you wish to host Jupyter, you can start it by using the command:
/scripts/start-jupyter.sh
Jupyter will listen to the port 8888
2.3. Stop all services
You can stop the running services with:
/scripts/stop-services.sh
2.4. Detach or Exit
To detach and suspend a Docker instance from the terminal, use the command:
ctrl + p, ctrl + q
To exit,
exit
2.5. Re-attach
If you detached an instance and want to re-attach it, you need to check the CONTAINER ID
or NAMES
of it first.
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
011547e95ef5 sunlab/bigbox:latest "/tini -- /bin/bash" 6 hours ago Up 4 seconds 0.0.0.0:8888->8888/tcp, 0.0.0.0:9530->9530/tcp, 0.0.0.0:2222->22/tcp bigbox
If the "STATUS" column is similar to "Exited (0) 10 hours ago", you can restart the container:
$ docker start <CONTAINER ID or NAMES>
And attach it with:
$ docker attach <CONTAINER ID or NAMES>
Every time you restart your container, you need to re-start the services (section 2.2) before any HDFS related operations.
2.6. Destroy instance
If you want to permanently remove the container:
$ docker rm <CONTAINER ID or NAMES>
2.7. Destroy images
If you want to permanently remove any images, you need to list images first:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
sunlab/bigbox latest bfd258e00de3 16 hours ago 2.65GB
And remove them by REPOSITORY or IMAGE ID using the command:
$ docker rmi <REPOSITORY or IMAGE ID>
2.8. Update images
$ docker pull sunlab/bigbox
2.9. Official documentations
Please refer to this link for the introduction of images, containers, and storage drivers.
2.10. Optional: use docker-compose
Docker Compose is a tool for defining and running multi-container Docker applications.
A simple docker-compose.yml
could simplify the parameters and make your life easier.
Please refer to this page for further instructions.
3. Configurations and logs
3.1. System Configurations
$ cat /proc/meminfo | grep Mem ## Current Memory
MemTotal: 8164680 kB ## Note: This value shoud no less than 4GB
MemFree: 175524 kB
MemAvailable: 5113340 kB
$ cat /proc/cpuinfo | grep 'model name' | head -1 ## CPU Brand
model name : Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz
$ cat /proc/cpuinfo | grep 'model name' | wc -l ## CPU Count
4
$ df -h ## List Current Hard Disk Usage
Filesystem Size Used Avail Use% Mounted on
overlay 32G 4.6G 26G 16% /
tmpfs 64M 0 64M 0% /dev
...
$ ps -ef ## List Current Running Process
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 01:38 pts/0 00:00:00 /tini -- /bin/bash
root 7 1 0 01:38 pts/0 00:00:00 /bin/bash
root 77 1 0 01:43 ? 00:00:00 /usr/sbin/sshd
zookeep+ 136 1 0 01:43 ? 00:00:14 /usr/lib/jvm/java-openjdk/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,ROLLINGFILE -cp /usr/lib/zookeeper/bin/../build/classes:/
yarn 225 1 0 01:43 ? 00:00:13 /usr/lib/jvm/java/bin/java -Dproc_proxyserver -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-yarn-pr
...
$ lsof -i:9530 ## Find the Process Listening to Some Specific Port
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 3165 zeppelin 189u IPv4 229945 0t0 TCP *:9530 (LISTEN)
3.2. Logs
- hadoop-hdfs -- /var/log/hadoop-hdfs/*
- hadoop-mapreduce -- /var/log/hadoop-mapreduce/*
- hadoop-yarn -- /var/log/hadoop-yarn/*
- hbase -- /var/log/hbase/*
- hive -- /var/log/hive/*
- spark -- /var/log/spark/*
- zookeeper -- /var/log/zookeeper/*
- zeppelin -- /usr/local/zeppelin/logs/*