Yiyu Jia's technical Blog: April 2012

Saturday, April 28, 2012

some notes on Big Data, NoSQL, and RDBMS

Recently, I read a blog post from Chris Swang. I pretty much agree with him. I particularly like his diagram that describes Big Data in a very straightforward way:

Clearly shown on the above diagram, Big Data, especially from the view of those NoSQL database, is the problem that has very large data volume and need relatively simple algorithm for analysis.

In other words, according to the above definition, it may not be called as Big Data problem if the problem can not be solved with simple algorithms (NoSQL and map/reduce ?). To help myself to have clear clue,I make a diagram as below,

I classify database techniques into four groups, which map to the "problem types" diagram.

Structural database I think those traditional RDBMSs should still be used in operational database, which asks for good support for transaction process. If the data volume is small, it will be ok to use RDBMS in data warehouse application as well. After all, structure database has been developed for decades to deal with even table that have to be partitioned.
Quant If advanced mathematics knowledge is required, quant analysis is demanded.
NoSQL NoSQL becomes popular as developer find that for some problems, they can dramatically speed up data I/O operation if they model the data in an nonstructural way, which against the rule of normalization in a RDBMS. Although traditional RDBMS also has techniques like materialized table, materialized view, and proxy table, they all have limitation and are all still behind SQL optimization engine.So, why dont we just directly got to hit data storage engine if we do not need SQL at all. Then, NoSQL comes into the stage.

Hybrid In fact, besides those popular NoSQL database like MongoDB, coutchDB, HBase etc, those traditional RDBMS vendors are also developing their NoSQL products too. For examples, MySQL cluster has key/value memcached and sockethandler that avoid overhead SQL stuff. In a real project, which needs to handle big data volume and transaction as well, I believe both structural database and NoSQL database are needed.

We can see that some NoSQL platform is developing their own query language, which looks like a structural query language too. However, we can also see that the result of query should be unstructured data. Otherwise, you are using NoSQL platform to do SQL platform's job. It will be interesting to know how those emerging products can handle the data better than traditional database vendor. If a distributed file system and mp/reduce can help on this, I can foresee traditional database vendor will add NoSQL platform into their products. In fact, we can see they are doing on this now. I even hope they can integrate NoSQL and SQL database in a more seamless way in terms of security, performance, cost etc.

I always try to be careful not to be fooled by marketing language. NoSQL and RDBMS should have their own position in a large system, where hybrid solution is required. We need to choose right tools for different tasks.

http://www.pythian.com/news/27367/hadoop-and-nosql-mythbusting/

HBase is Hadoop's NoSQL database
http://www.informationweek.com/news/software/info_management/232901601

http://www.clusterdb.com/mysql/dramatically-increased-mysql-cluster-join-performance-with-adaptive-query-localization/?utm_source=rss&utm_medium=rss&utm_campaign=dramatically-increased-mysql-cluster-join-performance-with-adaptive-query-localization
http://dev.mysql.com/tech-resources/articles/mysql-cluster-7.2.html

Wednesday, April 18, 2012

Steps to setup Hadoop Pig on Hadoop cluster environment

wget http://www.linuxtourist.com/apache/pig/stable/pig-x.y.z.tar.gz
cd /home/hadoopuser/app/
mv ~/Download/pig-x.y.z.tar.gz ./
tar -xvf pig-x.y.z.tar.gz
ln -s /home/hadoopuser/app/pig-x.y.z /pig

Edit ~/.bash_profile to add PIG_HOME and add its bin into PATH.

#java home
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_17-sun

#hadoop home
export HADOOP_HOME=/hadoop

#hive home
export HIVE_HOME=/hive

#pig home
export PIG_HOME=/pig

PATH=$PATH:$HOME/bin:$HIVE_HOME/bin:$PIG_HOME/bin

export PATH

Run the pig command and check if it works.

2013-01-04 03:14:30,310 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://cent63VM01:9000
2013-01-04 03:14:30,586 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: cent63VM01:9001
grunt>

Create a directory on HDFS

[hadoopuser@cent63VM01 pigTest]$ /hadoop/bin/hadoop fs -mkdir pig

Upload file on HDFS

[hadoopuser@cent63VM01 pigTest]$ /hadoop/bin/hadoop fs -put /home/hadoopuser/pigTest/passwd /user/hadoopuser/pig

Run a extremely simple Pig example.

grunt> A = load '/user/hadoopuser/pig/passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;

Tuesday, April 10, 2012

NTP server and client configuration (without security feature)

Before successfully install and start HBase, it is very important to make each nodes in hadoop cluster to sync time with each other as timestamp plays vital important roles in HBase. The following is my step to setup NTP in my Hadoop cluster.

install NTPD server

#yum install ntp
# chkconfig ntpd on

# vi /etc/ntp.conf
add the folowing lines in the file,

#default is refuse all connections.
restrict default ignore

# allow hosts in LAN to sync time. 

restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

# Use public servers from the pool.ntp.org project.
server 0.centos.pool.ntp.org
server 1.centos.pool.ntp.org
server 2.centos.pool.ntp.org

#allow uplevel server to comunicate with server.
restrict 0.centos.pool.ntp.org nomodify notrap noquery
restrict 1.centos.pool.ntp.org nomodify notrap noquery
restrict 2.centos.pool.ntp.org nomodify notrap noquery

# Undisciplined Local Clock. This is a fake driver intended for backup
# and when no outside source of synchronized time is available.
server  127.127.1.0     # local clock
fudge   127.127.1.0 stratum 10

enable UDP at port number 123 for NTPD server
# system-configure-firewal-tui
ntpdate 0.centos.pool.ntp.org (or ip address without prefix number)
make sure ntpd server is running
#service ntpd start
check ntpd status
ntpq -p
check network interface status and see if ntpd is listening on port 123
netstat -tupln

install ntp client

# yum install ntp
# chkconfig --list | grep ntpd
chkconfig --del ntpd
# cd /etc/cron.daily/
or
# cd /etc/cron.hourly/
create file as below,
# vi ntp.sh
put command in ntp.sh
#!/bin/bash /usr/sbin/ntpdate my.ntp.hostname
make it as a executable
# chmod 755 /etc/cron.daily/ntp.sh
Finally restart the cron daemon.
# service crond restart

However, it is better to use ntpd instead of ntpdate on "client" hosts

below is my example ntp.conf for "client" hosts.

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery

# Permit all access over the loopback interface.  This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1

# Hosts on local network are less restricted.
restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
#server 0.centos.pool.ntp.org
#server 1.centos.pool.ntp.org
#server 2.centos.pool.ntp.org
server 192.168.1.138

#broadcast 192.168.1.255 autokey        # broadcast server
#broadcastclient                        # broadcast client
#broadcast 224.0.1.1 autokey            # multicast server
#multicastclient 224.0.1.1              # multicast client
#manycastserver 239.255.254.254         # manycast server
#manycastclient 239.255.254.254 autokey # manycast client
restrict 192.168.1.138 nomodify notrap noquery

# Undisciplined Local Clock. This is a fake driver intended for backup
# and when no outside source of synchronized time is available.
server  127.127.1.0     # local clock
fudge   127.127.1.0 stratum 10

# Enable public key cryptography.
#crypto

includefile /etc/ntp/crypto/pw

# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys

# Specify the key identifiers which are trusted.
#trustedkey 4 8 42

# Specify the key identifier to use with the ntpdc utility.
#requestkey 8

# Specify the key identifier to use with the ntpq utility.
#controlkey 8

# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats

Below is my server ntpd.conf file

# For more information about this file, see the man pages
# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).

driftfile /var/lib/ntp/drift

restrict default ignore  //set default to refuse all accessing.

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery

# Permit all access over the loopback interface.  This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1

# Hosts on local network are less restricted.
restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
server 0.centos.pool.ntp.org
server 1.centos.pool.ntp.org
server 2.centos.pool.ntp.org

#broadcast 192.168.1.255 autokey        # broadcast server
#broadcastclient                        # broadcast client
#broadcast 224.0.1.1 autokey            # multicast server
#multicastclient 224.0.1.1              # multicast client
#manycastserver 239.255.254.254         # manycast server
#manycastclient 239.255.254.254 autokey # manycast client
restrict 0.centos.pool.ntp.org notrap noquery
restrict 1.centos.pool.ntp.org notrap noquery
restrict 2.centos.pool.ntp.org notrap noquery


# Undisciplined Local Clock. This is a fake driver intended for backup
# and when no outside source of synchronized time is available.
server  127.127.1.0     # local clock
fudge   127.127.1.0 stratum 10

# Enable public key cryptography.
#crypto

includefile /etc/ntp/crypto/pw

# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys

# Specify the key identifiers which are trusted.
#trustedkey 4 8 42

# Specify the key identifier to use with the ntpdc utility.
#requestkey 8

# Specify the key identifier to use with the ntpq utility.
#controlkey 8

# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats

Saturday, April 7, 2012

configure passphraseless SSH login among CentOS servers

wget http://pkgs.repoforge.org/rpmforge-release-0.5.2-2.el5.rf.x86_64.rpm
wget http://apt.sw.be/RPM-GPG-KEY.dag.txt
rpm -K rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
rpm -ivh rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
vi /etc/yum.repos.d/rpmforge.repo
add “priority=3″ or other right priority number after “enabled = 1″
yum repolist
yum install keychain
ssh-keygen -t dsa
chmod 755 .ssh
scp ~/.ssh/id_dsa.pub user@remotehost:.ssh/authorized_keys
Or:
cat ~/.ssh/id_dsa.pub | ssh hadoopuser@remotehost "cat - >> ~/.ssh/authorized_keys"
chmod 600 ~/.ssh/authorized_keys
vi .bash_profile
add following commands:
/usr/bin/keychain -q $HOME/.ssh/id_dsa
source $HOME/.keychain/$HOSTNAME-sh
Logout and login. password for key files is asked and will be asked at first time login only. or run command source .bash_profile
run ssh among servers without being asked for password

http://www.cyberciti.biz/faq/ssh-password-less-login-with-dsa-publickey-authentication/ http://www.ehowstuff.com/how-to-configure-rpmforge-repository-on-centos-6-3/ http://www.cyberciti.biz/faq/ssh-passwordless-login-with-keychain-for-scripts/

Tuesday, April 3, 2012

load flat file into Hive table step by step.

This is a very simple example to show how load a local flat file into hive table.

prepare a local table file named sample.csv having content as below.
```
35, Amy 
17, Ben 
5,Chris  
10,Don 
```
create table. -- creating a managed table that "knows" how the flat file are formated.
```
hive> use testdb;
hive> create table test_hive(age int, name string) row format delimited fields terminated by ',';
```
Note: explicitly specifying the delimited character like "row format delimited fields terminated by ','" is the key part here as the default delimited character identified by hive is ^A (ASCII code 1).
Loading file. -- load file from local file system into Hive.
```
hive>use testDB;
hive> load data local inpath '/home/hadoopuser/hiveTest/sample.csv' overwrite into table test_hive;
```
Note: this example is about loading file from linux local file system into Hive that is on HDFS file system. To load file from HDFS file system into Hive, we probably create an external table for it.
Check data. -- select from table.
```
hive> select * from test_hive;                                                                       
OK
35  Amy 
17  Ben 
5 Chris  
10 Don 
```
Note: You can see that "Amy" and "Ben" are not aligned with rest two rows. It is because hive takes all white spaces at the front of string and behind string as part of column values. So, we need to trim white spaces before and behind strings if white spaces are not expected in hive table.
To determine it. Run below queries to see different results.
```
select * from test_hive where name='Amy'; 
select * from test_hive where name=' Amy'; 
select * from test_hive where name=' Amy ';
```

The above example is an extremely simple one. To check all possible options when a table is created, please check Hive DDL document from Apache and Cloudera.