Yiyu Jia's technical Blog: 2012

Sunday, November 18, 2012

whoami and hadoop always send linux account name as hadoop user name

When we use hadoop eclipse plugin, we find that it always send local linux account name as hadoop account to the hadoop cluster. Who am I? The trick is that hadoop use linux command whoami to get account info. Relevant code could be find in class org.apache.hadoop.util.Shell.java

  /** a Unix command to get the current user's name */
  public final static String USER_NAME_COMMAND = "whoami";
  /** a Unix command to get the current user's groups list */
  public static String[] getGroupsCommand() {
    return new String[]{"bash", "-c", "groups"};
  }
  /** a Unix command to get a given user's groups list */
  public static String[] getGroupsForUserCommand(final String user) {
    //'groups username' command return is non-consistent across different unixes
    return new String [] {"bash", "-c", "id -Gn " + user};
  }
  /** a Unix command to get a given netgroup's user list */
  public static String[] getUsersForNetgroupCommand(final String netgroup) {
    //'groups username' command return is non-consistent across different unixes
    return new String [] {"bash", "-c", "getent netgroup " + netgroup};
  }
  /** a Unix command to set permission */
  public static final String SET_PERMISSION_COMMAND = "chmod";
  /** a Unix command to set owner */
  public static final String SET_OWNER_COMMAND = "chown";
  public static final String SET_GROUP_COMMAND = "chgrp";
  /** Return a Unix command to get permission information. */
  public static String[] getGET_PERMISSION_COMMAND() {
    //force /bin/ls, except on windows.
    return new String[] {(WINDOWS ? "ls" : "/bin/ls"), "-ld"};
  }

So, a simple way to solve this.

create a shell script file named as iamwho.sh and its content as simple as below,
```
echo "hadoopuser"
```
edit .bash_profile
```
 vi ~/.bash_profile 
```
add a line as below,
```
 alias whoami="~/local/bin/iamwho.sh" 
```
enable the modifition in current shell console,
```
 source ~/.bash_profile 
```

run command to check whoami command has been overwrited.

 
[yiyujia@localhost bin]$ whoami
hadoopuser
[yiyujia@localhost bin]$

Sunday, November 11, 2012

build hadoop eclipse plugin from the source code

1) install Eclipse

2)build hadoop from the source code.

3) edit build-contrib.xml to enable eclipse plugin building

vi $Hadoop_sr_home/src/contrib/build-contrib.xml

check the version number of built hadoop and add two line in the file. For example

4) got to diretory $Hadoop_sr_home/src/contrib/eclipse-plugin/

5) run ant command

6) get hadoop-eclipse-plugin-1.1.3-SNAPSHOT.jar under $Hadoop_sr_home/build/contrib/eclipse-plugin

Thursday, November 8, 2012

List out all Configursation entries in the Hadoop instance.

This is an extremely simple code that will give us a more straight view of what Hadoop Configuration object is. Extremely simple lines of code as below lists out entries of your hadoop instance's Configuraiton object. This tiny code could help me to figure out the configruation of a new Hadoop instance. Probably, I should expand this to be a real HadoopInfo.java that is similar as phpinfo.php.


import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;

import org.apache.hadoop.conf.Configuration;


public class HadoopInfo {
 
 public static void main(String[] args) throws Exception {
     Configuration conf = new Configuration();
     Iterator< Entry< String, String > > entries = conf.iterator();
     System.out.println("< table border=\"1\" width=\"760\" style=\"word-break:break-all;\" >" +
       "< caption>Hadoop defaul Configruation keys and values< /caption >  " +
       "< tr >< th >Key< /th >< th >Value< /th >< /tr >");
     while(entries.hasNext()){
      Map.Entry< String, String> en = (Map.Entry < String, String >)entries.next();    
      System.out.println("< tr >< td width=\"350\"> " + en.getKey() + " < /td >< td >" + en.getValue() + "< /td >< /tr >");
     }
     System.out.println("< / table >");
 } 
}

Sample output for a fresh standalone Hadoop instance.

Hadoop defaul Configruation keys and values
Key	Value
io.seqfile.compress.blocksize	1000000
hadoop.http.authentication.signature.secret.file	${user.home}/hadoop-http-auth-signature-secret
io.skip.checksum.errors	false
fs.checkpoint.size	67108864
hadoop.http.authentication.kerberos.principal	HTTP/localhost@LOCALHOST
fs.s3n.impl	org.apache.hadoop.fs.s3native.NativeS3FileSystem
fs.s3.maxRetries	4
webinterface.private.actions	false
hadoop.http.authentication.simple.anonymous.allowed	true
fs.s3.impl	org.apache.hadoop.fs.s3.S3FileSystem
hadoop.native.lib	true
fs.checkpoint.edits.dir	${fs.checkpoint.dir}
ipc.server.listen.queue.size	128
fs.default.name	file:///
hadoop.http.authentication.kerberos.keytab	${user.home}/hadoop.keytab
ipc.client.idlethreshold	4000
hadoop.tmp.dir	/tmp/hadoop-${user.name}
fs.hsftp.impl	org.apache.hadoop.hdfs.HsftpFileSystem
fs.checkpoint.dir	${hadoop.tmp.dir}/dfs/namesecondary
fs.s3.block.size	67108864
hadoop.security.authorization	false
io.serializations	org.apache.hadoop.io.serializer.WritableSerialization
hadoop.util.hash.type	murmur
io.seqfile.lazydecompress	true
io.file.buffer.size	4096
io.mapfile.bloom.size	1048576
fs.s3.buffer.dir	${hadoop.tmp.dir}/s3
hadoop.logfile.size	10000000
fs.webhdfs.impl	org.apache.hadoop.hdfs.web.WebHdfsFileSystem
ipc.client.kill.max	10
io.compression.codecs	org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec
topology.script.number.args	100
fs.har.impl	org.apache.hadoop.fs.HarFileSystem
io.seqfile.sorter.recordlimit	1000000
fs.trash.interval	0
hadoop.security.authentication	simple
local.cache.size	10737418240
hadoop.security.group.mapping	org.apache.hadoop.security.ShellBasedUnixGroupsMapping
ipc.server.tcpnodelay	false
hadoop.security.token.service.use_ip	true
fs.ramfs.impl	org.apache.hadoop.fs.InMemoryFileSystem
ipc.client.connect.max.retries	10
hadoop.rpc.socket.factory.class.default	org.apache.hadoop.net.StandardSocketFactory
fs.kfs.impl	org.apache.hadoop.fs.kfs.KosmosFileSystem
fs.checkpoint.period	3600
topology.node.switch.mapping.impl	org.apache.hadoop.net.ScriptBasedMapping
hadoop.http.authentication.token.validity	36000
hadoop.security.use-weak-http-crypto	false
hadoop.logfile.count	10
hadoop.security.uid.cache.secs	14400
fs.ftp.impl	org.apache.hadoop.fs.ftp.FTPFileSystem
fs.file.impl	org.apache.hadoop.fs.LocalFileSystem
fs.hdfs.impl	org.apache.hadoop.hdfs.DistributedFileSystem
ipc.client.connection.maxidletime	10000
io.mapfile.bloom.error.rate	0.005
io.bytes.per.checksum	512
fs.har.impl.disable.cache	true
ipc.client.tcpnodelay	false
fs.hftp.impl	org.apache.hadoop.hdfs.HftpFileSystem
hadoop.relaxed.worker.version.check	false
fs.s3.sleepTimeSeconds	10
hadoop.http.authentication.type	simple

Sunday, October 28, 2012

eclipse mapreduce plugin build for Hadoop 1.0.4

Here is a eclipse plugin built with Eclipse Juno and hadoop 1.0.4.

download jdk1.7 compatible from www.idatamining.org

download jdk1.6 compatible from www.idatamining.org

copy the hadoop-eclipse-plugin-1.0.4.jar into eclipse plugins directory and enjoy it.

[Edit] See how to build eclipse plugin from the source code here: build hadoop eclipse plugin from the source code

Saturday, October 6, 2012

Building hadoop 1.1 from the source code on centOS 6.3

1) prepare jdk

Download oracle jdk compressed version for linux
tar -xvf jdk-7u10-linux-x64.tar.gz into folder /usr/lib/jvm
alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.7.0_10-sun/bin/java 99 --slave /usr/bin/keytool keytool /usr/lib/jvm/jdk1.7.0_10-sun/bin/keytool --slave /usr/bin/rmiregistry rmiregistry /usr/lib/jvm/jdk1.7.0_10-sun/rmiregistry
alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.7.0_10-sun/bin/javac 99 --slave /usr/bin/jar jar /usr/lib/jvm/jdk1.7.0_10-sun/bin/jar --slave /usr/bin/rmic rmic /usr/lib/jvm/jdk1.7.0_10-sun/rmic
alternatives --config java
ln -s /usr/lib/jvm/jdk1.7.0_10-sun/ /usr/java/default (no need if we did not try oracle jdk rpm installer)

2) prepare ant and ivy

check ant installation: rpm -qa | grep ant
yum install ant
download ivy from apache
copy ivy jar files and dependent jars to /usr/share/ant/lib/

3) make sure automake is installed

yum install automake

4) make sure libtool is installed

yum install libtool

5) check out hadoop source code
svn checkout http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1/

6) run ant command to build hadoop 1.1

Saturday, May 19, 2012

CentOS network interface can not start because two network manager are enabled

Not sure why this happen on my machine. It is probably because of installing software. Anyway, below is the error message I saw when this happened. And commands to switch off one network manager.

[yiyujia@localhost ~]$ su
Password: 
[root@localhost yiyujia]# service network start
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:  Active connection state: activating
Active connection path: /org/freedesktop/NetworkManager/ActiveConnection/5
Error: Timeout 90 sec expired.
                                                           [FAILED]
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
[root@localhost yiyujia]# chkconfig --list | grep -i netw
NetworkManager  0:off 1:off 2:on 3:on 4:on 5:on 6:off
network         0:off 1:off 2:on 3:on 4:on 5:on 6:off
[root@localhost yiyujia]# chkconfig NetworkManager off
[root@localhost yiyujia]# chkconfig --list | grep -i netw
NetworkManager  0:off 1:off 2:off 3:off 4:off 5:off 6:off
network         0:off 1:off 2:on 3:on 4:on 5:on 6:off
[root@localhost yiyujia]# service NetworkManager stop
Stopping NetworkManager daemon:                            [  OK  ]
[root@localhost yiyujia]# chkcofnig network off
bash: chkcofnig: command not found
[root@localhost yiyujia]# chkconfig network off
[root@localhost yiyujia]# service network stop
Shutting down interface eth0:                              [  OK  ]
Shutting down loopback interface:                          [  OK  ]
[root@localhost yiyujia]# chkconfig NetworkManager on
[root@localhost yiyujia]# service NetworkManager start
Setting network parameters...                              [  OK  ]
Starting NetworkManager daemon:                            [  OK  ]
[root@localhost yiyujia]# chkconfig --list | grep -i netw
NetworkManager  0:off 1:off 2:on 3:on 4:on 5:on 6:off
network         0:off 1:off 2:off 3:off 4:off 5:off 6:off
[root@localhost yiyujia]#

Saturday, May 12, 2012

How to ssh into your home machine through company's http proxy

Sometimes, we want to remote login into our home PC/Server for fun. We can simply setup port forward on home router to expose home PC's SSH port outside. However, your office network environment may only allow you to access public Internet through HTTP/HTTPS proxy only. In this case, we need helps from corkscrew and ssh over tunnel. Below is my steps to setup corkscrew and ssh tunnel.

install corkscrew
1. wget http://www.agroman.net/corkscrew/corkscrew-2.0.tar.gz
2. tar -xvf corkscrew-2.0.tar.gz
3. enter untared corkscrew directory and run following command
  ./configure
4. make
5. make install
6. corkscrew should be installed under /usr/local/bin directory already.

Setup corkscrew in

vi ~/.ssh/config

Host pineHouse
Hostname my.home.ip.address
   KeepAlive yes
   ServerAliveInterval 30
   ForwardAgent yes
   ProxyCommand corkscrew proxy.example.com 8080 %h %p

ssh myName@pineHouse

For more advanced info, I read this blog post: "Build and Configure an HTTP-Proxy Application".

http://mtu.net/~engstrom/ssh-proxy.php

Saturday, May 5, 2012

using multiple ssh private key files

There is situation where I have to convert development enviroment from one machine to the other. It includes transfer ssh private key file to new machine. Then, I have to seek solution for having multiple ssh private keys. It is very simple. What I need to do is edit ~/.ssh/config file to have multiple lines for IdentifyFile as below,

IdentityFile ~/.ssh/id_dsa
IdentityFile ~/.ssh/id_dsa.1
IdentityFile ~/.ssh/id_dsa.2

In fact, I just follow this link to get job done.

Saturday, April 28, 2012

some notes on Big Data, NoSQL, and RDBMS

Recently, I read a blog post from Chris Swang. I pretty much agree with him. I particularly like his diagram that describes Big Data in a very straightforward way:

Clearly shown on the above diagram, Big Data, especially from the view of those NoSQL database, is the problem that has very large data volume and need relatively simple algorithm for analysis.

In other words, according to the above definition, it may not be called as Big Data problem if the problem can not be solved with simple algorithms (NoSQL and map/reduce ?). To help myself to have clear clue,I make a diagram as below,

I classify database techniques into four groups, which map to the "problem types" diagram.

Structural database I think those traditional RDBMSs should still be used in operational database, which asks for good support for transaction process. If the data volume is small, it will be ok to use RDBMS in data warehouse application as well. After all, structure database has been developed for decades to deal with even table that have to be partitioned.
Quant If advanced mathematics knowledge is required, quant analysis is demanded.
NoSQL NoSQL becomes popular as developer find that for some problems, they can dramatically speed up data I/O operation if they model the data in an nonstructural way, which against the rule of normalization in a RDBMS. Although traditional RDBMS also has techniques like materialized table, materialized view, and proxy table, they all have limitation and are all still behind SQL optimization engine.So, why dont we just directly got to hit data storage engine if we do not need SQL at all. Then, NoSQL comes into the stage.

Hybrid In fact, besides those popular NoSQL database like MongoDB, coutchDB, HBase etc, those traditional RDBMS vendors are also developing their NoSQL products too. For examples, MySQL cluster has key/value memcached and sockethandler that avoid overhead SQL stuff. In a real project, which needs to handle big data volume and transaction as well, I believe both structural database and NoSQL database are needed.

We can see that some NoSQL platform is developing their own query language, which looks like a structural query language too. However, we can also see that the result of query should be unstructured data. Otherwise, you are using NoSQL platform to do SQL platform's job. It will be interesting to know how those emerging products can handle the data better than traditional database vendor. If a distributed file system and mp/reduce can help on this, I can foresee traditional database vendor will add NoSQL platform into their products. In fact, we can see they are doing on this now. I even hope they can integrate NoSQL and SQL database in a more seamless way in terms of security, performance, cost etc.

I always try to be careful not to be fooled by marketing language. NoSQL and RDBMS should have their own position in a large system, where hybrid solution is required. We need to choose right tools for different tasks.

http://www.pythian.com/news/27367/hadoop-and-nosql-mythbusting/

HBase is Hadoop's NoSQL database
http://www.informationweek.com/news/software/info_management/232901601

http://www.clusterdb.com/mysql/dramatically-increased-mysql-cluster-join-performance-with-adaptive-query-localization/?utm_source=rss&utm_medium=rss&utm_campaign=dramatically-increased-mysql-cluster-join-performance-with-adaptive-query-localization
http://dev.mysql.com/tech-resources/articles/mysql-cluster-7.2.html

Wednesday, April 18, 2012

Steps to setup Hadoop Pig on Hadoop cluster environment

wget http://www.linuxtourist.com/apache/pig/stable/pig-x.y.z.tar.gz
cd /home/hadoopuser/app/
mv ~/Download/pig-x.y.z.tar.gz ./
tar -xvf pig-x.y.z.tar.gz
ln -s /home/hadoopuser/app/pig-x.y.z /pig

Edit ~/.bash_profile to add PIG_HOME and add its bin into PATH.

#java home
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_17-sun

#hadoop home
export HADOOP_HOME=/hadoop

#hive home
export HIVE_HOME=/hive

#pig home
export PIG_HOME=/pig

PATH=$PATH:$HOME/bin:$HIVE_HOME/bin:$PIG_HOME/bin

export PATH

Run the pig command and check if it works.

2013-01-04 03:14:30,310 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://cent63VM01:9000
2013-01-04 03:14:30,586 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: cent63VM01:9001
grunt>

Create a directory on HDFS

[hadoopuser@cent63VM01 pigTest]$ /hadoop/bin/hadoop fs -mkdir pig

Upload file on HDFS

[hadoopuser@cent63VM01 pigTest]$ /hadoop/bin/hadoop fs -put /home/hadoopuser/pigTest/passwd /user/hadoopuser/pig

Run a extremely simple Pig example.

grunt> A = load '/user/hadoopuser/pig/passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;

Tuesday, April 10, 2012

NTP server and client configuration (without security feature)

Before successfully install and start HBase, it is very important to make each nodes in hadoop cluster to sync time with each other as timestamp plays vital important roles in HBase. The following is my step to setup NTP in my Hadoop cluster.

install NTPD server

#yum install ntp
# chkconfig ntpd on

# vi /etc/ntp.conf
add the folowing lines in the file,

#default is refuse all connections.
restrict default ignore

# allow hosts in LAN to sync time. 

restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

# Use public servers from the pool.ntp.org project.
server 0.centos.pool.ntp.org
server 1.centos.pool.ntp.org
server 2.centos.pool.ntp.org

#allow uplevel server to comunicate with server.
restrict 0.centos.pool.ntp.org nomodify notrap noquery
restrict 1.centos.pool.ntp.org nomodify notrap noquery
restrict 2.centos.pool.ntp.org nomodify notrap noquery

# Undisciplined Local Clock. This is a fake driver intended for backup
# and when no outside source of synchronized time is available.
server  127.127.1.0     # local clock
fudge   127.127.1.0 stratum 10

enable UDP at port number 123 for NTPD server
# system-configure-firewal-tui
ntpdate 0.centos.pool.ntp.org (or ip address without prefix number)
make sure ntpd server is running
#service ntpd start
check ntpd status
ntpq -p
check network interface status and see if ntpd is listening on port 123
netstat -tupln

install ntp client

# yum install ntp
# chkconfig --list | grep ntpd
chkconfig --del ntpd
# cd /etc/cron.daily/
or
# cd /etc/cron.hourly/
create file as below,
# vi ntp.sh
put command in ntp.sh
#!/bin/bash /usr/sbin/ntpdate my.ntp.hostname
make it as a executable
# chmod 755 /etc/cron.daily/ntp.sh
Finally restart the cron daemon.
# service crond restart

However, it is better to use ntpd instead of ntpdate on "client" hosts

below is my example ntp.conf for "client" hosts.

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery

# Permit all access over the loopback interface.  This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1

# Hosts on local network are less restricted.
restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
#server 0.centos.pool.ntp.org
#server 1.centos.pool.ntp.org
#server 2.centos.pool.ntp.org
server 192.168.1.138

#broadcast 192.168.1.255 autokey        # broadcast server
#broadcastclient                        # broadcast client
#broadcast 224.0.1.1 autokey            # multicast server
#multicastclient 224.0.1.1              # multicast client
#manycastserver 239.255.254.254         # manycast server
#manycastclient 239.255.254.254 autokey # manycast client
restrict 192.168.1.138 nomodify notrap noquery

# Undisciplined Local Clock. This is a fake driver intended for backup
# and when no outside source of synchronized time is available.
server  127.127.1.0     # local clock
fudge   127.127.1.0 stratum 10

# Enable public key cryptography.
#crypto

includefile /etc/ntp/crypto/pw

# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys

# Specify the key identifiers which are trusted.
#trustedkey 4 8 42

# Specify the key identifier to use with the ntpdc utility.
#requestkey 8

# Specify the key identifier to use with the ntpq utility.
#controlkey 8

# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats

Below is my server ntpd.conf file

# For more information about this file, see the man pages
# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).

driftfile /var/lib/ntp/drift

restrict default ignore  //set default to refuse all accessing.

# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery

# Permit all access over the loopback interface.  This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict -6 ::1

# Hosts on local network are less restricted.
restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
server 0.centos.pool.ntp.org
server 1.centos.pool.ntp.org
server 2.centos.pool.ntp.org

#broadcast 192.168.1.255 autokey        # broadcast server
#broadcastclient                        # broadcast client
#broadcast 224.0.1.1 autokey            # multicast server
#multicastclient 224.0.1.1              # multicast client
#manycastserver 239.255.254.254         # manycast server
#manycastclient 239.255.254.254 autokey # manycast client
restrict 0.centos.pool.ntp.org notrap noquery
restrict 1.centos.pool.ntp.org notrap noquery
restrict 2.centos.pool.ntp.org notrap noquery


# Undisciplined Local Clock. This is a fake driver intended for backup
# and when no outside source of synchronized time is available.
server  127.127.1.0     # local clock
fudge   127.127.1.0 stratum 10

# Enable public key cryptography.
#crypto

includefile /etc/ntp/crypto/pw

# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys

# Specify the key identifiers which are trusted.
#trustedkey 4 8 42

# Specify the key identifier to use with the ntpdc utility.
#requestkey 8

# Specify the key identifier to use with the ntpq utility.
#controlkey 8

# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats

Saturday, April 7, 2012

configure passphraseless SSH login among CentOS servers

wget http://pkgs.repoforge.org/rpmforge-release-0.5.2-2.el5.rf.x86_64.rpm
wget http://apt.sw.be/RPM-GPG-KEY.dag.txt
rpm -K rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
rpm -ivh rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
vi /etc/yum.repos.d/rpmforge.repo
add “priority=3″ or other right priority number after “enabled = 1″
yum repolist
yum install keychain
ssh-keygen -t dsa
chmod 755 .ssh
scp ~/.ssh/id_dsa.pub user@remotehost:.ssh/authorized_keys
Or:
cat ~/.ssh/id_dsa.pub | ssh hadoopuser@remotehost "cat - >> ~/.ssh/authorized_keys"
chmod 600 ~/.ssh/authorized_keys
vi .bash_profile
add following commands:
/usr/bin/keychain -q $HOME/.ssh/id_dsa
source $HOME/.keychain/$HOSTNAME-sh
Logout and login. password for key files is asked and will be asked at first time login only. or run command source .bash_profile
run ssh among servers without being asked for password

http://www.cyberciti.biz/faq/ssh-password-less-login-with-dsa-publickey-authentication/ http://www.ehowstuff.com/how-to-configure-rpmforge-repository-on-centos-6-3/ http://www.cyberciti.biz/faq/ssh-passwordless-login-with-keychain-for-scripts/

Tuesday, April 3, 2012

load flat file into Hive table step by step.

This is a very simple example to show how load a local flat file into hive table.

prepare a local table file named sample.csv having content as below.
```
35, Amy 
17, Ben 
5,Chris  
10,Don 
```
create table. -- creating a managed table that "knows" how the flat file are formated.
```
hive> use testdb;
hive> create table test_hive(age int, name string) row format delimited fields terminated by ',';
```
Note: explicitly specifying the delimited character like "row format delimited fields terminated by ','" is the key part here as the default delimited character identified by hive is ^A (ASCII code 1).
Loading file. -- load file from local file system into Hive.
```
hive>use testDB;
hive> load data local inpath '/home/hadoopuser/hiveTest/sample.csv' overwrite into table test_hive;
```
Note: this example is about loading file from linux local file system into Hive that is on HDFS file system. To load file from HDFS file system into Hive, we probably create an external table for it.
Check data. -- select from table.
```
hive> select * from test_hive;                                                                       
OK
35  Amy 
17  Ben 
5 Chris  
10 Don 
```
Note: You can see that "Amy" and "Ben" are not aligned with rest two rows. It is because hive takes all white spaces at the front of string and behind string as part of column values. So, we need to trim white spaces before and behind strings if white spaces are not expected in hive table.
To determine it. Run below queries to see different results.
```
select * from test_hive where name='Amy'; 
select * from test_hive where name=' Amy'; 
select * from test_hive where name=' Amy ';
```

The above example is an extremely simple one. To check all possible options when a table is created, please check Hive DDL document from Apache and Cloudera.

Sunday, March 18, 2012

install eclipse on CentOS

1) Download eclipse

2) tar -xvf eclipse into directory /usr/lib/eclipse or /usr/share/eclipse

3) using alternatives configure eclipse command

alternatives --install /usr/bin/eclipse eclipse /usr/lib/eclipse/eclipse 99

4) vi /usr/share/applications/eclipse.desktop and add content

[Desktop Entry]
Name=Eclipse
Comment=Eclipse Juno
Exec=eclipse
Icon=/usr/lib/eclipse/icon.xpm
Terminal=false
Type=Application
Categories=GNOME;KDE;Application;Development;Java;
StartupNotify=true

Saturday, March 10, 2012

install egroupware 1.8 on CentOS 6 or Fedora

login and switch to su accoutn
Install EPEL: rpm -ivh http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-7.noarch.rpm
Note: version number may be changed
Make sure system is up to date: yum update
install PHP mcrypt package: yum install php-mcrypt
go to yum repository: cd /etc/yum.repos.d/
download eGroupware repo info: wget http://download.opensuse.org/repositories/server:eGroupWare/CentOS_6/server:eGroupWare.repo
Install eGroupware: yum install eGroupware and you will get error messaage, which says database connection problem
/usr/share/egroupware/doc/rpm-build
Edit file post_install.php to add password for root@locahost of local mysql server instance.
gedit post_install.php
for example:
'db_root' ='root', // mysql root user/pw to create database
'db_root_pw' = '123456',
run post install script: post_install.php.
script will show you user account name and password
access http://localhost/egroupware/ and done

Here is a list of competetors of eGroupware: competetors.

Saturday, March 3, 2012

Configure maven in netbeans and Eclipse to share the same local repository

Netbeans 7.1.1

On Windows machine, go to folder C:\Program Files\NetBeans 7.1.1\java\maven\conf
Edit file settings.xml

Eclipse Indigo

Open Eclipse, click menu Windows --> Preferences --> Maven --> User Settings
Input path to your settings.xml

You can either make netbeans and eclipse to share same settings.xml file or edit their settings.xml seperatly and make them to point to same local repository. On my machine, I make them to point to same local repository by adding the following in settings.xml files,

C:\yyjia\mavenLocalRepo

Saturday, February 25, 2012

install phpMyAdmin on Fedora or CentOS

Install HTTPD, PHP, MySQL first.
Download phpMyAdmin
unzip phpMyAdmin under directory /var/www/html/
make sure your mysql server and httpd are running: service mysqld start and service httpd start
if you have not set root password for new installed MySQL, running the following command to set a password to avoid error message "fedora Login without a password is forbidden by configuration (see AllowNoPassword)": $ mysqladmin -u root password NEWPASSWORD
running command system-config-services and system-config-firewall to doulbe check your httpd and mysqld is enabled and running

Friday, February 10, 2012

Spring singleton and singleton design pattern vs Ext JS 4 singleton and JavaScript Singleton

I discussed "Implementing Singleton object in JavaScript" and "Implementing Singleton pattern in Java and PHP shows different "static" modifers between Java and PHP". Today, I am going to compare the similarity between difference between Java Singleton pattern and Spring singleton and difference between Javascript singleton pattern and Ext JS 4 singleton object.

As described in Spring document, Spring's concept of a singleton bean is different from the Singleton patter, which could be called GoF (Gang of Four) Singleton. As discussed in my post "Java Class loader and static variable and JVM memory management", the GoF Singleton implements one and only one instance of a particular class will ever be created per ClassLoader. Meanwhile, the singleton object defined in Spring framework is single/unique object managed in Spring container. So, Spring singleton needs to be carefully used in multithread environment. Normally, we should use Spring prototype instead of singleton as service that will be requested by multi threads.

Now let's look at Ext JS 4 singleton. I will say it has similar implementation as Spring singleton. That is, Ext JS 4 develops a container to hold all Ext JS 4 class/object. JavaScript's multi-task programming is not popular yet. However, how about testing this through two tab in same browser?

This discussion reveals how Ext JS 4 implements singleton. It is pretty similar as the way Spring does. In Ext JS 4, a singleton is an object instead a class. It's not the same with GoF Singleton just like Spring singleton is not same with GoF Singleton. I think this is an interesting investigation.

http://stackoverflow.com/questions/3920689/plain-old-singleton-or-spring-singleton-bean http://www.bleext.com/blog/configurations-statics-and-singleton-in-ext-js-4/ http://www.sencha.com/forum/showthread.php?128646-Singleton-vs-class-with-all-static-members

Thursday, February 2, 2012

reset linux password for a forgot user account

It is funny that I forgot both user name and password for an OpenSuse virtual machine, which I created and had not used it. Fortunately, I remembered I can boot linux in single user mode to reset password. I write it down here in case i forget password for other copy of linux again.

1) When boot linux with GRUB, press key "e" or move cursor up/down to see boot option editing line.

2) add "single init=/bin/bash" at the end of boot option line

3) once booting up linux, type "passwd root" to reset root password.

4) For my case, I even forgot the user name I created before. So, I go to "/home" and list sub directory there. I saw a folder called "yiyu" there. So, I know the user account I created before is "yiyu".

5) type "passwd yiyu" to reset my user account password.

Saturday, January 28, 2012

Multi-threaded application (Java) or Multi-process application (PHP) for Hyper Threading enabled CPU

There are many factors affect an application's performance. Today, hyper threading enabled multi-core CPU becomes so popular. Naturally, we are expecting to have better performance on these modern CPU. I am not eligible to discuss about question about how to optimize application for a hyper threading multi-core CPU yet. But, I do ask myself this question: which one, a muli-threaded application or a multi-process application, can get more benefit from a hyperthreaded CPU? In other words, for a computing intensive task, should I design it as a multi-threaded application or a multi-process application? To be more specific, does Java, which support multi-threaded programming, has advantage over the PHP on a hyper threading enabled CPU or PHP actually has advantage over the Java? I had a discussion about "Can two processes simultaneously run on one CPU core?". Here, let me highlight some points I studied for answering my questions.

Software Thread
We know software thread is a lightweight process. Once process can contains multiple thread. Software thread is managed by OS. OS decide which CPU/CORE/ the thread will run in. Application programmer can also control where the thread can run by using affinity library. Typically, we need multiple thread application when it has I/O latency and we do not want to hang other computing task. For example, We do not want a desktop having GUI stop response user's input when it is running other computing or I/O task. Also, we normally need thread pool to have initialized ready to serve threads for a service application.

Hardware Thread
A hardware thread is pipeline for a software thread to reach CPU's physical core. In a HT CPU, a physical core can have two hardware threads (logical cores) as it has extra registers and execution units and it therefore stores the state of two threads.

What is shared among software threads and what is not
Multiple software threads (kernel thread) can live inside a process. In other words, they can not share anything out of process' resource. A kernel thread is the lightest unit for OS' kernel scheduling. Kernel threads do not share their stack, a copy of the registers including the program counter, and thread-local storage.

It is called "green thread" if the thread is implemented in "user space". green thread is not seen by kernel. It is normally useful for debugging a multi-threaded application.

What is shared among processes
I do not know what is shared among processes from point view of CPU. A process is the biggest unit of kernel scheduling. It has its own resources including memory, file handles, sockets, device handles etc. Processes has its own address spaces, shared by its containing threads. Of course, programmer can explicitly call methods to share resources with other process such as shared memory segments.

History about CPU evolution To better understand how to get most benefit from modern multi-core, HT CPU, I feel I need to study the history of CPU architecture evolution. But, I do not have time to do this yet. Let me deeply study it later.

Now, let's come back to my original questions. A normal application contains lots latency operations. For examples, a application normally contains network I/O, file I/O, or GUI interactivity. In this case, which is typical reason for us to use multi-threaded techniques, hyper threading can improve performance at little cost. However, hyper-threading is often turned off in high performance systems as hyper threading can impact performance when the two threads on the same core start competing for resources, such as the FPU, Level 1 cache, or CPU pipeline. We can see some motherboard actually disable hyper-threading by default.

Also, swap thread's context is expensive too. To switch threads system has to empty the registers into the cache, write that back to the main memory, then load up the cache with the new values and load up the registers. So, we need to be careful not to make overhead threads in our application.

The good way is probably keep same number of running threads as number of logical core. However, in the real world, many threads in an application are blocked thread. For a application having much more software thread than hardware thread, I will expect most of them are blocked thread. For example, a web server may have larger number of thread serving HTTP request. It has no problem as they are all blocked thread as it may be blocked at network I/O.

BTW, the coming up Reverse-Hyperthreading, which spread a thread's computing task among different logical core, may introduce new opportunity into multi-threaded programming world. But, who knows if it will actually bring in more challenges.

So, back to my original question, I still think java has chance to get more benefits from a hyper-threading enabled CPU than PHP does as core PHP does not directly support multi-threading programming. However, a bad designed multi-threading may even damage the performance. Here is a good document for me to better understand hyper threading technology: Performance Insights to Intel® Hyper-Threading Technology

http://stackoverflow.com/questions/1888160/distinguish-java-threads-and-os-threads http://stackoverflow.com/questions/8916723/can-two-processes-simultaneously-run-on-one-cpu-core http://stackoverflow.com/questions/4771205/dual-core-hyperthreading-should-i-use-4-threads-or-3-or-2?rq=1 http://stackoverflow.com/questions/360307/multicore-hyperthreading-how-are-threads-distributed?rq=1 http://stackoverflow.com/questions/508301/on-which-operationg-system-is-threaded-programming-sufficient-to-utilize-multipl http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/ http://stackoverflow.com/questions/2238272/java-thread-affinity http://java.dzone.com/articles/java-thread-affinity-support http://www.codeguru.com/cpp/sample_chapter/article.php/c13533/Why-Too-Many-Threads-Hurts-Performance-and-What-to-do-About-It.htm http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

Wednesday, January 18, 2012

Promotional Subspace Mining (PSM) --- an Overview

Promotional Subspace Mining (PSM) is a novel research topic. Its main focus is to find out outstanding subspaces for a given object among its competitors, and to discover meaningful rules from them.

PSM is distinguished from existing data mining problems because of its three unique characteristics: data model, study objective, and fundamental data mining problems. From the aspect of the data model, PSM takes multidimensional data as the input data set, in which a categorical object field and a numerical score field are two important constituent elements. From the aspect of the study objective, an ranking mechanism has to be defined to produce top subspaces for target object among its
peer objects. From the aspect of the data mining problems, PSM consists of three major problems, each of which consists of sub-problems:

1. Applying prior knowledge into the system

modeling domain-specific knowledge;
modeling the feedback rules from the previous iteration;
designing and constructing machine learning model.

2. Finding out top subspaces

ranking mechanism definition;
efficiently producing top subspaces based on the predefined ranking mechanism;
feature selection to avoid high-rank but meaningless subspaces.

3. Rule mining, evaluation, and integration

evaluating top subspaces that have the same rank;
producing interesting rules from top subspaces;
analyzing and combining similar rules.

The following figure shows a high level view of this PSM research topic.

PSM may be applied in many application domains. Here, we give two simple examples.

Example 1 A product retailer wants to find out the strength and weakness of a product. The sales manager finds that the sales of product A is ranked the 10th among all products in the same category. However, when breaking down the market into subspaces, such as Area, Category-of-Trade, and Year, it may be found that product A has the rank 2 sales in the subspace {Area = North New Jersey , Category-of-Trade = Restaurant, and Year = 2009}, and has the rank 15 sales in subspace {Area = South Florida, Category-of-Trade = Supermarket, and Year = 2008}.

Example 2 A pharmaceutical company wants to find out under what conditions a new drug has the best or worst effect. The researchers find out that drug A’s overall effect score is ranked 10th among all drugs in comparison. When examining into subspaces, drug A has the rank 2 effect score in the subspace {Temperature = Low , Moisture = Med, and Patient Age = young}, and has the rank 15 score in subspace {Temperature = Med, Moisture = Med, and Patient Age = Senior}.

These two examples indicate that a target object can be ranked not only in the global data dimensions, but also in various local subspaces. The global rank of an object indicates the overall position of this object, while the local ranks can show the outstanding subspaces this object is in. “Outstanding” here is measured by a predefined application-specific subspace ranking measure. To the product retailer, outstanding subspaces can be used to analyze the current position of the target product, adjust promotional campaign strategies, and reallocate marketing resources. To the pharmaceutical company, the outstanding subspaces can be used to evaluate the factors that affect target drug’s functioning.

Besides many unique features, PSM is also related to two other existing research directions: Interesting Subspace Mining (ISM) and instance selection/ranking. PSM is related to ISM as it also targets the multidimensional data, and one of its main focuses is to discover the potentially interesting subspaces. However, ISM has an entirely different objective than PSM. Specifically, it aims at detecting clusters that are hidden in any possible subspaces, but not showing up in the full attribute space.

PSM is also related to the research on prototype selection and instance ranking, because they both involve the component to rank the potentially interesting subspaces' instances. However, there are two essential distinctions between these two topics. First, the study objective and data model are different. The input data set of PSM contains a numerical attribute representing scores, and a categorical attribute representing objects. Each object may be described by multiple record instances. The study objective of PSM is to find top subspaces for a target object, and the score attribute takes part in this process. On the other hand, instance selection/ranking is for classification or clustering study. Each record instance in the input data set represents an individual object. Second, the objects that are being ranked are different. PSM studies on multidimensional data, which indicates that a subspace can either contain all the attributes in the data set, or any subset of the attributes. Instance selection's ranking however, always targets the full attribute spaces.

In addition, the problem that PSM solves is also related to top-k queries and reverse top-k queries. The top-k query problem aims to efficiently retrieve a ranked set of the k most interesting objects based on individual user’s preferences. A lot of research efforts have been carried out from different prospective, including query model, data & query certainty, ranking function, etc. The data model of PSM is essentially different from these research, as it switches the input and output of the query model. In other words, the target object is taken as an input in PSM, while it belongs to the output in the classic top-k query model. As a result, the main theme of PSM is on finding interesting subspaces, not the objects.

Compared with the top-k query problem where the output objects will be consumed by potential customers or buyers, the reverse top-k query problem aims to find out the subspace parameters of the most popular products for manufacturers’ reference. The popular products are those appearing more frequently in customers’ top-k result set than other products.

As a novel research topic, Promotional Subspace Mining (PSM) opens a new research direction for the interdisciplinary research of data mining and very large database systems. Particularly, in this Big Data age, PSM introduces a new topic to noSQL application research. A sophisticated PSM framework/algorithm could be, in an obvious way, very useful for those large retail companies such as Amazon, eBay, etc. Actually, it could be potentially applied to any domain, where is needed "promotional" analysis no matter the target object is a human being, a product, or a company.

To inquiry more detailed information, please contact Dr. Yan Zhang or myself. We are interested in hearing from you if you have large data set and want somebody to help on analyzing it. We are looking forward to hearing from any individual or organization who is interested in cooperating with us.

Sunday, January 8, 2012

Using jconsole to monitor Tomcat running on i5/OS

Shipped with JDK, there is an application monitoring tool called jconsole. Using jconsole, we can detect low memory, enable or disable GC and class loading verbose tracing, detect deadlocks, control the log level of any loggers in an application etc. This simple tutorial shows how we can start up Tomcat 6.x on i5/OS V5R4 and monitor it remotely through our desktop. In fact, this is not Tomcat's feature. It is JVM feature. We just use Tomcat as an example and see if JDK on i5/OS supports this feature or not.

1) Open the catalina.sh and add additional JVM properties after CATALINA_BASE env settings.

# Only set CATALINA_HOME if not already set
[ -z "$CATALINA_HOME" ] && CATALINA_HOME=`cd "$PRGDIR/.." >/dev/null; pwd`

# Copy CATALINA_BASE from CATALINA_HOME if not already set
[ -z "$CATALINA_BASE" ] && CATALINA_BASE="$CATALINA_HOME"

CATALINA_OPTS="-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.port=YourJMXPort 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=true 
-Djava.rmi.server.hostname=YourTomcatHostname
-Dcom.sun.management.jmxremote.password.file=$CATALINA_BASE/conf/jmx.psd 
-Dcom.sun.management.jmxremote.access.file=$CATALINA_BASE/conf/jmx.acl"

2) creating jmx.psd under folder $CATALINA_BASE/conf/ and put text content in it as below. So, we create a user name "controlRole" with password as "tomcat"

controlRole tomcat

3) creating jmx.acl under folder $CATALINA_BASE/conf/ and put text content in it as below. So, we assign user controlRole privilege of Read and Write.

controlRole readwrite

4) Open qshell and modify the file attribute as below. This will make sure that your account has read and write privilege to these access control files. We change this according to different account that start Tomcat.

call qp2term
chmod 600 jmx.acl  
chmod 600 jmx.psd

5) On a Windows PC, run jconsole as below.

c:\Program Files\Java\jdk1.6.0_21\bin>jconsole

6) After jconsole is running, we need to input host name, port number, user name, and password.

7)Now, we can see jconsole connects to remote tomcat running on i5/OS.

Enjoy it for monitoring and tuning your Tomcat and servlet application.