iDatamining.org

I am looking for projects to work on
Please contact with me at yiyu.jia@BostonInfoPro.com!

Saturday, November 9, 2013

why I can not see all EBS size after spin a EC2 instance

Suppose you newly spin an EC2 instance from an AMI image that 8GB ESB volume and you set up new EC2 instance to have larger volume as 100GB. However, you will see only 8GB size when you use command like df or parted. Below are example commands to determine.

[root@ ~]# file -s /dev/xvde
/dev/xvde: Linux rev 1.0 ext4 filesystem data (needs journal recovery) (extents) (large files) (huge files)
[root@0 ~]# file -s /dev/sda
/dev/sda: symbolic link to `xvde'

[root@ ~]# df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/xvde        8256952 1749732   6087792  23% /
tmpfs            7619396       0   7619396   0% /dev/shm
[root@ip-10-75-130-120 ~]# df -BG
Filesystem     1G-blocks  Used Available Use% Mounted on
/dev/xvde             8G    2G        6G  23% /
tmpfs                 8G    0G        8G   0% /dev/shm

(parted) unit kb                                                          
(parted) print free                                                       
Model: Xen Virtual Block Device (xvd)
Disk /dev/xvde: 107374182kB
Sector size (logical/physical): 512B/512B
Partition Table: loop

Number  Start   End          Size         File system  Flags
 1      0.00kB  107374182kB  107374182kB  ext4

(parted) unit gb                                                          
(parted) print free                                                       
Model: Xen Virtual Block Device (xvd)
Disk /dev/xvde: 107GB
Sector size (logical/physical): 512B/512B
Partition Table: loop

Number  Start   End    Size   File system  Flags
 1      0.00GB  107GB  107GB  ext4


[root@ip-10-75-130-120 ~]# cat /etc/fstab 
/dev/xvde  /        ext4      defaults         0 0
devpts     /dev/pts  devpts  gid=5,mode=620   0 0
tmpfs      /dev/shm  tmpfs   defaults         0 0
proc       /proc     proc    defaults         0 0
sysfs      /sys      sysfs   defaults         0 0

Solution is very simple: run command resize2fs!
[root@ip-10-33-53-115 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvde       7.9G  1.7G  5.9G  23% /
tmpfs           7.3G     0  7.3G   0% /dev/shm
[root@ip-10-33-53-115 ~]# resize2fs /dev/xvde
resize2fs 1.41.12 (17-May-2010)
Filesystem at /dev/xvde is mounted on /; on-line resizing required
old desc_blocks = 1, new_desc_blocks = 7
Performing an on-line resize of /dev/xvde to 26214400 (4k) blocks.
The filesystem on /dev/xvde is now 26214400 blocks long.

[root@ip-10-33-53-115 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvde        99G  1.7G   92G   2% /
tmpfs           7.3G     0  7.3G   0% /dev/shm
[root@ip-10-33-53-115 ~]# 

Saturday, November 2, 2013

Big Data tools and relational BI analysis tools used in different stage of data analysis

I was thinking about the relationship between traditional BI platforms and Big Data analytic tools. I think they should coexist and service for different stage of data analyzing. That is, with the Data first approach, people use Big Data analytic tools to explore data and find out right question. Then, with right and interesting question, people drill data and try to give out answers in relational database based BI platform.

I happened to watch a presentation given by Facebook. It looks they are use both Big data tools and relational DB tools for different stages of data analysis. Below is cited from the presentation and video on youtube.



Wednesday, October 30, 2013

setup ODBC access in R or RStudio

  1. make sure unixODBC development lib is installed.
    [root@localhost yiyujia]# yum list | grep unixODBC
    unixODBC.x86_64                         2.2.14-12.el6_3                @updates 
    freeradius-unixODBC.x86_64              2.1.12-4.el6_3                 base     
    unixODBC.i686                           2.2.14-12.el6_3                base     
    unixODBC-devel.i686                     2.2.14-12.el6_3                base     
    unixODBC-devel.x86_64                   2.2.14-12.el6_3                base     
    unixODBC-kde.i686                       2.2.14-12.el6_3                base     
    unixODBC-kde.x86_64                     2.2.14-12.el6_3                base 
    
    [root@localhost yiyujia]# yum install unixODBC-devel.x86_64
    
    
  2. Inside R or Rstudio, run the following command to install RODBC package.
    install.packages("RODBC", dependencies = TRUE)
    

Thursday, September 19, 2013

time series analysis using R



  1. download and install MRO from https://mran.revolutionanalytics.com/download/ 
  2. install.packages("TTR")
  3. library("TTR")
  4. usindices $lt;- read.table("http://www2.stat.duke.edu/~mw/data-sets/ts_data/industrial_production", skip=20, header = F)
  5. colnames(usindices) $lt;- c("YR","MN", "IP", "MFG", "MFGD", "MFGN", "MIN", "UTIL", "P", "MAT")
  6. head(usindices)
  7. install.packages("dplyr") 
  8. library(dplyr)
  9. avg1 = summarise(group_by(usindices,YR),avg = mean(MFG))
    
  10. plot(avg1)
  11. install.packages("reshape")
  12. library("reshape")
  13. 
    
tutorial: http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html

Thursday, September 12, 2013

result of cloudera Inspect hosts for my experiement env in basement




Cluster Installation

Inspect hosts for correctness

Validations

Inspector failed on the following hosts...
Individual hosts resolved their own hostnames correctly.
No errors were found while looking for conflicting init scripts.
No errors were found while checking /etc/hosts.
All hosts resolved localhost to 127.0.0.1.
All hosts checked resolved each other's hostnames correctly.
Host clocks are approximately in sync (within ten minutes).
Host time zones are consistent across the cluster.
No users or groups are missing.
No kernel versions that are known to be bad are running.
No performance concerns with Transparent Huge Pages settings.
0 hosts are running CDH3 and 4 hosts are running CDH4.
All checked hosts are running the same version of components.
All managed hosts have consistent versions of Java.
All checked Cloudera Management Daemons versions are consistent with the server.
All checked Cloudera Management Agents versions are consistent with the server.

Version Summary

Group 1 (CDH4)
Hosts
cent63VM01, cent63VM02, cent63VM03, cent63VM04
Component Version CDH Version
Impala 1.1.1 Not applicable
Lily HBase Indexer (CDH4 only) 1.2+2 Not applicable
Solr (CDH4 only) 4.4.0+69 Not applicable
Flume NG 1.4.0+23 CDH4
MapReduce 1 (CDH4 only) 2.0.0+1475 CDH4
HDFS (CDH4 only) 2.0.0+1475 CDH4
HttpFS (CDH4 only) 2.0.0+1475 CDH4
MapReduce 2 (CDH4 only) 2.0.0+1475 CDH4
Yarn (CDH4 only) 2.0.0+1475 CDH4
Hadoop 2.0.0+1475 CDH4
HBase 0.94.6+132 CDH4
HCatalog (CDH4 only) 0.5.0+13 CDH4
Hive 0.10.0+198 CDH4
Mahout 0.7+21 CDH4
Oozie 3.3.2+92 CDH4
Pig 0.11.0+33 CDH4
Sqoop 1.4.3+62 CDH4
Sqoop2 (CDH4 only) 1.99.2+85 CDH4
Whirr 0.8.2+15 CDH4
Zookeeper 3.4.5+23 CDH4
Hue 2.5.0+139 CDH4
Java java version "1.6.0_31" Java(TM) SE Runtime Environment (build 1.6.0_31-b04) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode) Not applicable
Cloudera Manager Agent 4.7.1 Not applicable

Wednesday, September 11, 2013

Collection of lecture on Markov Chains & Hidden Markov Models

The first set of lectures introduce markov model, markov chains, and extend the markov chain to HMMs

(ML 14.1) Markov models - motivating examples
http://www.youtube.com/watch?v=7KGdE2AK_MQ

(ML 14.2) Markov chains (discrete-time) (part 1)
http://www.youtube.com/watch?v=WUjt98HcHlk

(ML 14.3) Markov chains (discrete-time) (part 2)
http://www.youtube.com/watch?v=j6OUj9tleVM

(ML 14.4) Hidden Markov models (HMMs) (part 1)  
http://www.youtube.com/watch?v=TPRoLreU9lA

===========================================
The following lecture is a more detailed, intuitive explaination about HMMs

Very intuitive
http://www.youtube.com/watch?v=jY2E6ExLxaw

=============================================
 HMM & stock prediction
http://www.slideshare.net/ChiuYW/hidden-markov-model-stock-prediction




  • TOOL KIT
  • R Package– HMM– RHMM
    JAVA– JHMM
    Python– Scikit Learn
  • DEMO
  • GET DATASET
  • library(quantmod)
    getSymbols("^TWII")
    chartSeries(TWII)
    TWII_Subset<- p="" start="as.Date(" window="">TWII_Train <- -="" cbind="" lose="" olume="" p="" pen="" twii_subset="" ubset="">



  • BUILD HMM MODEL
  • # Include RHMM Library
    library(RHmm)

    # Baum-Welch Algorithm
    hm_model <- hmmfit="" nstates="5)</p" obs="TWII_Train">
    # Viterbi Algorithm
    VitPath <- hm_model="" p="" twii_train="" viterbi="">


  • SCATTER PLOT
  • TWII_Predict <- cbind="" lose="" p="" states="" ubset="" vitpath=""> chartSeries(TWII_Predict[,1])
    addTA(TWII_Predict[TWII_Predict[,2]==1,1],on=1,type="p",col=5,pch=25)
    addTA(TWII_Predict[TWII_Predict[,2]==2,1],on=1,type="p",col=6,pch=24)
    addTA(TWII_Predict[TWII_Predict[,2]==3,1],on=1,type="p",col=7,pch=23)
    addTA(TWII_Predict[TWII_Predict[,2]==4,1],on=1,type="p",col=8,pch=22)
    addTA(TWII_Predict[TWII_Predict[,2]==5,1],on=1,type="p",col=10,pch=21)


    Sunday, September 8, 2013

    HBase install and performance test

    1. download hbase.
      [hadoopuser@cent63VM01 app]$ wget http://apache.osuosl.org/hbase/stable/hbase-0.94.2.tar.gz
      
    2. untar the file
      [hadoopuser@cent63VM01 app]$ tar xvf hbase-0.94.2.tar.gz
      
    3. create a soft link for HBase.
      [hadoopuser@cent63VM01 app]$ ln -s /hadoop/hbase-0.94.2/ /hbase
      
    4. copy sample configuration file.
      [hadoopuser@cent63VM01 app]$ cp /hbase/sr/resources/hbase-default.xml /hbase/conf/hbase-site.xml
      
    5. Edit hbase-site.xml as below.
      sfds
      
    6. copy configuration files to all nodes..
      sfdsrsync -avz ./hbase-0.94.2 hadoopuser@cent63V4.corp.ybusa.net:/hadoop/
      
    7. Zookeeper's port numbers are troublesome. To be simple, I disable the firewall.
      service iptables status
      service save iptables
      service stop iptables
      chkconfig iptables off
      

    system-config-firewall open HBase REST port 8080; open port 60000 and 60010 for master. for eegional server open port 60020 and port 60030; for zookeeper, open port 2888, 3888, 2181

    check file /etc/hosts


    rm -Rf /tmp/hadoop-username  clean data fo

    service iptables status
    service save iptables
    service stop iptables
    chkconfig iptables off

    HBase performance testing wity ycsb 0.1.4 http://johnjianfang.blogspot.com/2012/09/hbase-performance-testing-wity-ycsb-014.html

    Hbase 错误记录及修改方法
    http://blog.csdn.net/kntao/article/details/7642547

    yum install java-1.6.0-openjdk java-1.6.0-openjdk-devel

    Sunday, August 18, 2013

    FlywayFile, a maven plugin for generating migration script version number for flywayDB

    I thought I do not need DB migration tool as long as I am good at handling version control and I use OR mapping in the java world. But, after using Ruby DB migration tool for a while in a Big Data project, I feel it will make use peaceful mind if we use an agile DB migration tool. In Ruby, we have nice db migration tool from active record migrations. I use it and it is pretty handy in an agile environment. In the java world, we have agile DB migration tools too. What I find recently is FlywayDB. It looks neat and well documented. Also, I like its feature comparison on its homepage.

    But, it seems like FlywayDB does not have a tool for generate a migration files fitting its name convention for migration files. This post is about introducing my quick work that creatign a maven plugin to help user generate script file name having proper version info as prefix.

    The code has be shared on Github. It named as FlywayFile as I am thinking it could be expanded to handle file operation stuff for FlywayDB in the future. Actually, I will ask if FlywayDB will accept this as one of their Maven plugin.

    There are two eclipse maven projects in the repository. One the the main part, which is about maven plugin. The second is an extremely simple servlet, which can generate version number based on system time and random number.

    The reason for me to add this servlet here is that I think there might be case where we have developer from different time zone. But they share one version control system like subversion or git. So, in order to avoid confusion, we can use a central server to generate the migration script file name. However, this version code does not support complicate network environment yet. It is just for a prototype. I can improve it later if working in different time zones is really popular case. Below is a picture that can interprets this idea quickly.


    To use this maven plugin, we need to do the following steps
    1. Installing this maven plugin into your local maven repository as I have not registered it with central maven plugin repository
    2. Adding "org.idatamining" plugin group into your project pom.xml file or global settings.xml file
    3. Adding property "flyway.SQL.directory" in your project pom.xml. This property specifies where the SQL migration script file will be created.
    4. Adding property "flyway.filename.generator" in your project pom.xml. This property is optional. It specifies the URL, where VersionNumber servlet listens.
    5. call the generate goal: mvn flywayFile:generate -DmyFilename=jia
    Below is a sample pom.xml file I used to test this maven plugin. The property "flyway.filename.generator" is optional. We need to set its value only when we need to get version number from a central server.
    
      4.0.0
      org.idatamining
      testFly
      jar
      1.0-SNAPSHOT
      testFly
      http://maven.apache.org
      
          /home/yiyujia/workingDir/eclipseWorkspace/testFly
          http://localhost:8080/versionGenerator/VersionNumber
      
      
        
          junit
          junit
          3.8.1
          test
        
      
    
    
    
    mvn package mvn install:install-file -Dfile=./flyway-version-maven-plugin-1.0.jar -DgroupId=com.nokia.flyway -DartifactId=flyway-version-maven-plugin -Dversion=1.0 -Dpackaging=jar mvn flyway-version:flyway-version mvn com.nokia.flyway:flyway-version:1.0:create -e -DuniqueVersion=false -DmyFile=testFile mvn archetype:generate -DgroupId=idatamining.org -DartifactId=maven-flyway-generator-plugin -DarchetypeGroupId=org.apache.maven.archetypes -DarchetypeArtifactId=maven-archetype-mojo mvn -o flywayGenerator:create -DmyFile=yiyuFun -e -X mvn archetype:generate -DgroupId=org.idatamining -DartifactId=testFly -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false mvn -o flywayGenerator:create -DmyFilename="yiyuFunny" -e -X mvn -o flywayGenerator:create -DmyFilename="yiyuFunny" -Dprefix=A mvn archetype:generate -DgroupId=org.idatamining -DartifactId=versionGenerator -Dversion=1.0 -DarchetypeArtifactId=maven-archetype-webapp -DinteractiveMode=false

    Saturday, August 17, 2013

    create an EC2 instance with additional storage (EBS)

    1. Create EBS volume.
    2. Attach EBS volume to /dev/sdf (EC2's external name for this particular device number).
    3. Format file system /dev/xvdf (Ubuntu's internal name for this particular device number):
      sudo mkfs.ext4 /dev/xvdf
    4. Mount file system (with update to /etc/fstab so it stays mounted on reboot):
      sudo mkdir -m 000 /vol
      echo "/dev/xvdf /vol auto noatime 0 0" | sudo tee -a /etc/fstab
      sudo mount /vol
      
    Here is AWS relevant doc.
    groups visudo uncomment grant privilidge to wheel group ln -s /source/dir ./softlinkName ssh-keygen 1) I will send you an private key file in separate email. Say, its name is damodar.pem. you download and run “chmod 600 damodar.pem” 2) Ssh into ec2 instance: “ssh -i ./damodar.pem damodar@ec2-54-216-247-239.eu-west-1.compute.amazonaws.com “ 3) After you login, you will see a soft link called workingDir under your home directory. It linked to folder /vol/damodarSandbox. /vol is mount with an EBS volume, which gives you larger storage size. You can run df command to check it. 4) Your account damodar is in sudoer list. So, you can run sudo command. 5) To connect to redshift, you can run command “ psql -h ddw-stats.cjbo3ahdtzbo.eu-west-1.redshift.amazonaws.com -p 5439 -U YourDBname -d ddwstore” wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" "http://download.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz"

    Wednesday, August 14, 2013

    setup Mahout environment

    
    svn co http://svn.apache.org/repos/asf/mahout/trunk
    svn co http://svn.apache.org/repos/asf/mahout/branches/mahout-0.8
    
    mv trunk/ mahoutTrunk/
    
    wget http://apache.mesi.com.ar/mahout/0.8/mahout-distribution-0.8.tar.gz
    
    ln -s /home/yiyujia/workingDir/mahoutTrunk mahout
    
    vi .bash_profile
    source .bash_profile
    
    export HADOOP_HOME=/home/yiyujia/workingDir/hadoop-1.1.1
    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
    export MAHOUT_HOME=/mahout
    export PATH=$PATH:$MAHOUT_HOME
    
    
    
    
    $HADOOP_HOME/bin/hadoop fs -mkdir testdata
    
    $HADOOP_HOME/bin/hadoop fs -put synthetic_control.data testdata
    
    mvn dependency::tree
    
    $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
    $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
    
    $HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples
    
    $MAHOUT_HOME/bin/mahout clusterdump --input output/clusters-10 --pointsDir output/clusteredPoints --output $MAHOUT_HOME/examples/output/clusteranalyze.txt
    
    
    http://harish11g.blogspot.com/2012/02/configuring-mahout-clustering-hadoop.html https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation http://chimpler.wordpress.com/2013/02/20/playing-with-the-mahout-recommendation-engine-on-a-hadoop-cluster/ http://stackoverflow.com/questions/9072223/how-to-add-maven-dependencies-for-mahout-and-hadoop

    Monday, August 5, 2013

    install hive

    
    wget http://mirrors.gigenet.com/apache/hive/stable/hive-0.8.1-bin.tar.gz
    
    tar -xvf hive-0.8.1-bin.tar.gz 
    
    ln -s /hadoop/hive-0.8.1-bin /hive
    
    
    .bash_profile
    
    #set HADOOP_HOME env
    export HADOOP_HOME=/hadoop/hadoop
    
    #set HIVE_HOME env
    export HIVE_HOME=/hive
    
    PATH=$PATH:$HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin
     
    export PATH
    
    
    yum grouplist | grep -i mysql
    yum groupinfo "MySQL Database server"
    yum groupinstall "MySQL Database server"
    
    
    
    service mysqld start chkconfig mysqld on && service mysqld restart && chkconfig --list | grep mysqld mysql -uroot -p create database hive; create user 'hive'@'%' identified by '123456'; GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'%' WITH GRANT OPTION; /etc/my.cf system-configure-firewall open 3066 hive-site.xml hive.metastore.local true controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM javax.jdo.option.ConnectionURL jdbc:mysql://yiyuCent1.corp.ybusa.net:3306/hive?createDatabaseIfNotExist=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore javax.jdo.option.ConnectionUserName hive username to use against metastore database javax.jdo.option.ConnectionPassword 123456 password to use against metastore database
    http://blog.milford.io/2010/06/installing-apache-hive-with-a-mysql-metastore-in-centos/

    Thursday, August 1, 2013

    setup linux VM in Azure cloud



    1. log into Microsoft Azure control port https://manage.windowsazure.com/
    2. click the New button as shown in the picture. 

    3. Input necessary info into fields as shown in the picture.  To lauch CentOS VM,I use OpenLogic CentoOS 7 image.  
    4. Click side bar Virtual Machine icon. And, click Dashboard. You can find the DNS name at the place where red arrow pointing at. 

    5. Using SSH client to login created VM:  ssh UserName@DNSName.cloudapp.net
    6. Using command to reset root account password:   a) sudo -s    b) passwd 
    7. If ssh key is not uploaded when the VM is created and you want to have passphraseless login, please refer to another blog post "configure passphraseless SSH login among CentOS servers" (no keychain installation needed).

    Monday, July 29, 2013

    Getting started with amazon Redshift

    I am recently starting learning Amazon Redshift. AWS Redsift is actually a revised version of ParAccel. There are voices in the industry to yell that the Big Data is hyped. Now, it looks like the MPP (Massive Parallel Processing) is the answer to the hyped Big Data. Of course, it depends on definition of Big Data. But, it is true that not every company need Big Data solutions as Google or Facebook needs. AWS Redsift is type of MPP DBMS, which is support SQL interface, which enterprise data warehosue developer are familiar with. But, there are some limitation as well. Redshift is often compared with Hadoop Hive (For example, the slide here. ). But, I think it is not fair as Hive is designed to be a batch processing platform from the begin. It will be more interesting to see the comparison between Hadoop HBase and Redshift or Cassandra. They are all column oriented database. Furthermore, we have other near real time interactive analyzing platform in the industry. For examples, we have Google Bigquery (commercial), Google Dremel (open source), Apache Drill (open source) etc. The common thing among them is that they are use different parallel computing architecture in stead of Map/Reduce framework. There are many other choice of MPP databases, for examples, Netezza, Dataextreme etc. Compare all of these is beyond just technical stuff. I am reading a book published by PacktPub as a beginner of Redshift. The book name is "Getting Started with Amazon Redshift". I think this is a good book for beginner of Redshift. It has less than 200 pages. It is good handbook to quickly setup AWS cluster, loading data into Redshift and practise on it. Especially, it highlighted out some features that Redshift DOES NOT support. I particularly like those parts as I am interested in knowing more about what a new product can not do instead of hearing tons of marketing language to promote it. It is good learning material in additional to Amazaon's online doc. Below is the link to the book,
    Also, here is a link from Amazon to show how to load data into Redshift.
    http://oksoft.blogspot.com/2013/03/redshift-tips.html Redshift Tips # List all tables: select db_id, id, name, sum(rows) as mysum from stv_tbl_perm where db_id = 100546 group by db_id, id, name order by mysum desc; # list all running processes: select pid, query from stv_recents where status = 'Running'; # describe table select * from PG_TABLE_DEF where tablename='audit_trail'; select * from pg_tables where schemaname = 'public' # Disk space used: select sum(used-tossed) as used, sum(capacity) as capacity from stv_partitions # Query log select query, starttime , substring from svl_qlog where substring like '%tbl_name%' order by starttime desc limit 50; # command history select * from stl_ddltext where text like '%ox_data_summary_hourly_depot%' limit 10 # last load errors select starttime, filename, err_reason from stl_load_errors order by starttime desc limit 100 select filename, count(*) as cnt from stl_load_errors group by filename # create table from another table select * into newevent from event; # Check how columns are compressed ANALYZE COMPRESSION # ANALYZE and VACUUM If you insert, update, or delete a significant number of rows in a table, run the ANALYZE and VACUUM commands against the table. "analyze compression tbl_name" command produce a report with the suggested column encoding. # To find and diagnose load errors for table 'event' create view loadview as (select distinct tbl, trim(name) as table_name, query, starttime, trim(filename) as input, line_number, field, err_code, trim(err_reason) as reason from stl_load_errors sl, stv_tbl_perm sp where sl.tbl = sp.id); select * from loadview where table_name='event'; # Query to find blocks used select stv_tbl_perm.name, count(*) from stv_blocklist, stv_tbl_perm where stv_blocklist.tbl = stv_tbl_perm.id and stv_blocklist.slice = stv_tbl_perm.slice group by stv_tbl_perm.name order by stv_tbl_perm.name; Load tips: # While loading data you can specify "empty as null", "blanks as null" allow "max error 5", "ignore blank lines", "remove quotes", "use zip". Use the keywords: emptyasnull blanksasnull maxerror 5 ignoreblanklines removequotes gzip # use NULL AS '\000' to fix the import from specific files # use BLANKASNULL in the original COPY statement so that no empty strings are loaded into VARCHAR fields which might ultimately be converted to numeric fields. # Use the NOLOAD keyword with a COPY command to validate the data in the input files before actually loading the data. # use COMPUPDATE to enable automatic compression # FILLRECORD to fill missing columns at the end with blanks or NULLs # TRIMBLANKS Removes the trailing whitespace characters from a VARCHAR string. # ESCAPE the backslash character (\) in input data is treated as an escape character. (useful for delimiters and embedded newlines) # ROUNDEC a value of 20.259 is loaded into a DECIMAL(8,2) column is changed to 20.26. or else 20.25 # TRUNCATECOLUMNS Truncates data in columns to the appropriate number. # IGNOREHEADER to ignore first row _____ If you are using JDBC, can you try adding the keepalive option to your connect string. E.g., jdbc:postgresql://instance.amazonaws.com:8192/database?tcpkeepalive=true You can have AUTOCOMMIT set in your Workbench client. _____ In order to avoid timeout error while using workbench on Windows, use the following setting: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime 30000 HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveInterval 1000 HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpMaxDataRetransmission 10 _____ # Consider using DISTKEY and SORTKEY - There can be multiple sortkeys but only one primary key. # wlm_query_slot_count - This will set aside more memory for query, which may avoid operations spilling to disk # the isolation level for Redshift is SERIALIZABLE _____ // There is no equivalent of "show create table tbl_name" select from the PG_TABLE_DEF table to gather all the necessary schema information // convert to and from unixtime select extract (epoch from timestamp '2011-08-08 11:11:58'); select TIMESTAMP 'epoch' + starttime * INTERVAL '1 second' starting from tbl_name; // Update a joined table: update abcd set ser_area_code=abcd_update.ser_area_code, preferences=abcd_update.preferences, opstype=abcd_update.opstype, phone_type=abcd_update.phone_type from abcd_update join abcd nc on nc.phone_number = abcd_update.phone_number http://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html#concept_upsert _____ // install postgresql yum install postgresql postgresql-server chkconfig postgresql on // You will now create a file where the redshift password will be stored. vi ~/.pgpass c.us-east-1.redshift.amazonaws.com:5439:mydb:root:Passwd chmod 0600 ~/.pgpass // load data to redshift cat to_psql.txt | psql -hc.us-east-1.redshift.amazonaws.com -Uroot -p5439 mydb > to_save.csv // send the file as an attachment echo "report file attached. " | mutt -s "result data " -a to_save.csv -- some_address@gmail.com // mysqldump command that will generate the required statements to be used in redshift mysqldump db_name tbl_name --where='1=1 limit 10' --compact --no-create-info --skip-quote-names > to_psql.txt _____ Amazon data types are different than of MySQL. For e.g. literals can be saved only as varchar type and upto 65000 bytes. http://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html Here is a script that will do this conversion automatically. https://gist.github.com/shantanuo/5115366 _____ If postgresql client is installed, we can connect to redshift using something like this... # PGPASSWORD=Fly8946392085 psql -U fsb_user_85_22719249 -h flydata-sandbox-cluster.clroanynhqjo.us-east-1.redshift.amazonaws.com -p 5439 -d flydatasandboxdb Welcome to psql 8.1.23 (server 8.0.2), the PostgreSQL interactive terminal.

    Wednesday, July 24, 2013

    setup a Maven project in Eclipse for Mahout application

    1. mvn archetype:generate -DgroupId=jia.blog -DartifactId=mahoutStart -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
    2. cd mahoutStart/
      mvn eclipse:eclipse
    3. edit pom.xml
      
                  org.apache.hadoop
                  hadoop-core
                  1.1.1 
              
         
              org.apache.mahout
              mahout-core
              0.8
          
      
      

    Saturday, June 8, 2013

    expanding virtual machine's hard drive size

    Here is my steps to expand my virtual machine's hard drive size. I use KVM. The guest OS is CentOS. My virtual machine does not use LVM.

    1. My VM does not have LVM group. great!
      lvdisplay.
    2. shutdown virtual machine and run command in host machine to resize virtual disc:
      qemu-img resize vmdisk.img +20G .
      This is probably only difference for different vitalization platform. Using different ways to resize image size.
    3. boot up virtual machine and run command to check hard disc size.
      fdisk -l .
    4. turn off swap:
      swapoff -a
    5. we use parted as partition tool. so call command
      parted
    6. check virtual hard drive's partition info as below.
      (parted) print free                                                       
      Model: ATA QEMU HARDDISK (scsi)
      Disk /dev/sda: 42.9GB
      Sector size (logical/physical): 512B/512B
      Partition Table: msdos
      
      Number  Start   End     Size    Type     File system     Flags
              32.3kB  1049kB  1016kB           Free Space
       1      1049kB  316MB   315MB   primary  ext4            boot
       2      316MB   19.4GB  19.0GB  primary  ext4
       3      19.4GB  21.5GB  2114MB  primary  linux-swap(v1)
      
              21.5GB  42.9GB  21.5GB           Free Space
      
    7. remove swap partition.
      (parted) rm 3
    8. set unit to be KB to avoid wasting too much space.
      (parted) unit kb
    9. make a primary partition in the unallocated space.
      (parted) mkpart primary 19360907kB 38949673kB
    10. Make all rest unallocated space as swap partition.
      mkpart primary 38949673kB 42949670kB
    11. reboot virtual machine
    12. turn off swap
      swapoff -a
    13. Format new created partition as ext4
      mkfs.ext4 /dev/sda3
    14. Format swap partition
      mkswap /dev/sda4
    15. I will mount /home direction to new partion (sda3). So, I backup my home directory first and create new /home directory.
      mv /home /homebackup
      mkdir home
    16. double check partition info again.
      cat /proc/partitions
    17. check the auto mount configurion.
      cat /etc/fstab
    18. Format swap partition
      mkswap /dev/sda4
    19. find out sda3 UUID
      blkid /dev/sda3
    20. add sda3 into auto mount
      vi /etc/fstab
      add line: UUID=e4805e9c-869a-48d5-8783-0bfcb5b65a3e /home ext4 defaults 1 3
    21. reboot
    22. move old home directory content into new home directory
      cp -rf /homebackup/* /home/
    23. change the ownership of content to different user account. for example,
      chown -R yiyujia:yiyujia ./yiyujia/
    24. check disc usage info,
      df

    Also, adding a new disck to existing virtual guest could be good option too.
    1. cd into folder to find guest's configuration file as xml and backup it.
      /etc/libvirt/qemu
    2. create a new raw disk image.
      qemu-img create -f raw newDisk.img 20G
    3. Attach newly created disk image
      #virsh attach-disk
    4. login guest machine and partition new disck.

    Saturday, May 11, 2013

    how to get Maxium or Minium value from Salesforce object in pentaho Kettle

    We may get funny result if we apply MIN() or MAX() function in Salesforce SOQL query. It may be because the field past into MIN()/MAX() function is a picklist type. "If you use the MIN() or MAX() functions on a picklist field, the function uses the sort order of the picklist values instead of alphabetical order".

    So, how can we get Maxium or Minium values through SOQL in kettle ETL process? I have achieve this by using Kettle in memory group by step. Below shows how to step by step.


    1.  As discussed in previous post, we can get distinct fields values from Salesforce.
    2. After getting the result, the trick and simple step is using Javascript step to insert a dummy field with dummy value.
    3. Now, using Kettle's in memory group by step to get Max or Min value of Attr1__c field from Salesforce object.

    Thursday, May 9, 2013

    Implement SELECT DISTINCT in Salesforce SOQL

    As far as I know, Salesforce SOQL does not support SELECT DISTINCT. In stead of implementing distinct function in your own application, here is a tip to do SELECT DISTINCT in SOQL.

    Suppose we have a Customized salesforce object called Dummy__c, which have attributes Attr1__c, Attr2__c, Attr3__c etc, We want to get distinct values of Attr1__c. To achieve this, we can easily get distinct values of Attr1__c by using GROUP BY with COUNT_DISTINCT function:


    SELECT Attr1__c, COUNT_DISTINCT(Attr__c) FROM Dummy__c GROUP BY Attr1__c
    

    That's all. We do not need to load large number of records into our own application's memory and count unique values.

    Monday, April 29, 2013

    step by step to setup Oracle instant client and Oracle express 11

    First, go to Oracle web site to download rpm files used below.
    Below is just a raw copy of steps for installing Oracle instant client and Oracle Express 11 on a CentOS 64 machine.
    19:45 yiyujia@localhost:~ $ hostname
    localhost.localdomain
    19:45 yiyujia@localhost:~ $ su
    Password: 
    [root@localhost yiyujia]# rpm -ivh /home/yiyujia/Downloads/oracle-instantclient11.2-basic-11.2.0.4.0-1.x86_64.rpm 
    Preparing...                ########################################### [100%]
       1:oracle-instantclient11.########################################### [100%]
    [root@localhost yiyujia]# rpm -ivh /home/yiyujia/Downloads/oracle-instantclient11.2-devel-11.2.0.4.0-1.x86_64.rpm 
    Preparing...                ########################################### [100%]
       1:oracle-instantclient11.########################################### [100%]
    [root@localhost yiyujia]# rpm -ivh /home/yiyujia/Downloads/oracle-instantclient11.2-odbc-11.2.0.4.0-1.x86_64.rpm 
    Preparing...                ########################################### [100%]
       1:oracle-instantclient11.########################################### [100%]
    [root@localhost yiyujia]# rpm -ivh /home/yiyujia/Downloads/Disk1/oracle-xe-11.2.0-1.0.x86_64.rpm 
    Preparing...                ########################################### [100%]
       1:oracle-xe              ########################################### [100%]
    Executing post-install steps...
    You must run '/etc/init.d/oracle-xe configure' as the root user to configure the database.
    
    [root@localhost yiyujia]# /etc/init.d/oracle-xe configure
    
    Oracle Database 11g Express Edition Configuration
    -------------------------------------------------
    This will configure on-boot properties of Oracle Database 11g Express 
    Edition.  The following questions will determine whether the database should 
    be starting upon system boot, the ports it will use, and the passwords that 
    will be used for database accounts.  Press  to accept the defaults. 
    Ctrl-C will abort.
    
    Specify the HTTP port that will be used for Oracle Application Express [8080]:8888
    
    Specify a port that will be used for the database listener [1521]:
    
    Specify a password to be used for database accounts.  Note that the same
    password will be used for SYS and SYSTEM.  Oracle recommends the use of 
    different passwords for each database account.  This can be done after 
    initial configuration:
    Confirm the password:
    
    Do you want Oracle Database 11g Express Edition to be started on boot (y/n) [y]:n
    
    Starting Oracle Net Listener...Done
    Configuring database...Done
    Starting Oracle Database 11g Express Edition instance...Done
    Installation completed successfully.
    [root@localhost yiyujia]# rpm -ivh /home/yiyujia/Downloads/sqldeveloper-3.2.20.09.87-1.noarch.rpm 
    Preparing...                ########################################### [100%]
       1:sqldeveloper           ########################################### [100%]
    [root@localhost yiyujia]# source /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh 
    [root@localhost yiyujia]# vi /etc/bashrc
    #export LD_LIBRARY_PATH=/usr/lib/oracle/11.2/client64/lib:/usr/local/lib
    #source /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh
    [root@localhost yiyujia]# usermod -a -G oracle yiyujia
    usermod: group 'oracle' does not exist
    [root@localhost yiyujia]# usermod -a -G dba yiyujia
    [root@localhost yiyujia]# usermod -a -G dba root
    [root@localhost yiyujia]# exit
    exit
    21:11 yiyujia@localhost:~ $ sqlplus / as sysdba
    bash: sqlplus: command not found
    21:11 yiyujia@localhost:~ $ source /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh
    
    21:11 yiyujia@localhost:~ $ sqlplus / as sysdba
    
    SQL*Plus: Release 11.2.0.2.0 Production on Tue Oct 29 21:11:55 2013
    
    Copyright (c) 1982, 2011, Oracle.  All rights reserved.
    
    
    Connected to:
    Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production
    
    SQL> select * from v$database;                                     
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
    2711238071 XE      29-OCT-13      353178 29-OCT-13
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
            1 28-AUG-11    NOARCHIVELOG      357224
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
           0 CURRENT 29-OCT-13        430
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
          357257 29-OCT-13        NOT ALLOWED 29-OCT-13
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
    READ WRITE      MAXIMUM PERFORMANCE  UNPROTECTED        ENABLED
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
     2711229879  2711229879 PRIMARY      0 DISABLED
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
    NOT ALLOWED      DISABLED NONE    NO       NO  NO  NO     13
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
    Linux x86 64-bit
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
          2     2  358239
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
    NO     NO  NO  XE
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
        0 DISABLED
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
          0
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
    
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
    NO       NO
    
          DBID NAME      CREATED  RESETLOGS_CHANGE# RESETLOGS_TIME
    ---------- --------- ------------------ ----------------- ------------------
    PRIOR_RESETLOGS_CHANGE# PRIOR_RESETLOGS_TI LOG_MODE CHECKPOINT_CHANGE#
    ----------------------- ------------------ ------------ ------------------
    ARCHIVE_CHANGE# CONTROL CONTROLFILE_CREATE CONTROLFILE_SEQUENCE#
    --------------- ------- ------------------ ---------------------
    CONTROLFILE_CHANGE# CONTROLFILE_TIME   OPEN_RESETL VERSION_TIME
    ------------------- ------------------ ----------- ------------------
    OPEN_MODE      PROTECTION_MODE   PROTECTION_LEVEL     REMOTE_A
    -------------------- -------------------- -------------------- --------
    ACTIVATION# SWITCHOVER# DATABASE_ROLE  ARCHIVELOG_CHANGE# ARCHIVEL
    ----------- ----------- ---------------- ------------------ --------
    SWITCHOVER_STATUS    DATAGUAR GUARD_S SUPPLEME SUP SUP FOR PLATFORM_ID
    -------------------- -------- ------- -------- --- --- --- -----------
    PLATFORM_NAME
    --------------------------------------------------------------------------------
    RECOVERY_TARGET_INCARNATION# LAST_OPEN_INCARNATION# CURRENT_SCN
    ---------------------------- ---------------------- -----------
    FLASHBACK_ON    SUP SUP DB_UNIQUE_NAME
    ------------------ --- --- ------------------------------
    STANDBY_BECAME_PRIMARY_SCN FS_FAILOVER_STATUS   FS_FAILOVER_CURRENT_TARGET
    -------------------------- ---------------------- ------------------------------
    FS_FAILOVER_THRESHOLD FS_FAIL
    --------------------- -------
    FS_FAILOVER_OBSERVER_HOST
    --------------------------------------------------------------------------------
    CON PRIMARY_DB_UNIQUE_NAME    SUP MIN_REQUIRED_CAPTURE_CHANGE#
    --- ------------------------------ --- ----------------------------
    
    
    SQL> 
    
    
    

    Add new user for SQL Developer to use for connecting.
    
    23:45 yiyujia@localhost:~ $ sqlplus / as sysdba
    
    SQL*Plus: Release 11.2.0.2.0 Production on Tue Oct 29 23:51:40 2013
    
    Copyright (c) 1982, 2011, Oracle.  All rights reserved.
    
    
    Connected to:
    Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production
    
    SQL> startup
    ORA-01081: cannot start already-running ORACLE - shut it down first
    SQL> shutdown immediate
    Database closed.
    Database dismounted.
    ORACLE instance shut down.
    SQL> startup
    ORACLE instance started.
    
    Total System Global Area 1068937216 bytes
    Fixed Size      2233344 bytes
    Variable Size    616565760 bytes
    Database Buffers   444596224 bytes
    Redo Buffers      5541888 bytes
    Database mounted.
    Database opened.
    SQL> create user yiyujia identified by 123456;
    
    User created.
    
    SQL> grant dba to yiyujia;
    
    Grant succeeded.
    
    
    

    Saturday, April 20, 2013

    How to reconfigure Oracle Express listeners after changing hostname

    After the computer has hostname changed, the Oracle Express 11 may not work as its configuration still point to old hostname. Hostname may be changed when the machine's network work status is changed. For example, I used NAT mode for my virtual machine. After I change its NIC to have Bridge mode, the hostname may be changed. Here is my steps to reconfigure Oracle Express listeners to bind on new hostname.

    1. Run “hostname” to get your server's new hostname from DNS server.
    2. Run “cd /u01/app/oracle/product/11.2.0/xe/network/admin”
    3. Replace “localhost.localdomain” in file listener.ora with your new hostname.
    4. Replace “localhost.localdomain” in file tnsnames.ora with your new hostname.
    5. tnsping yurNewHostName and see XE instance is running.

    Saturday, March 30, 2013

    OAuth 1.0 and OAuth 2.0, three-legged authenticaton and two-legged authentication

    It is easy for a Web developer to understand why we need OAuth1.0. It is a three legs scenario: 1) Web browser user, 2) third party service provider that want to access user's info from 3) OAuth provider that hosts user's valuable info.

    So, for the reason of security, web browser user should not just gives third party web site his/her user name and password to get info on behalf of himself/herself. You know, not every web site could be trusted. It is not only because of immoral web sites may use your personal info in wrong way but also because of their capability of securing your privacy. Therefore, URL redirecting comes into OAuth framework. Also, web browser use may not want to allow the web site to access all info hosted in OAuth service provider's site. Therefore, the OAuther plays as authorization framework too. Furthermore, for the reason of light encryption and signature, HMAC was employed. But, OAuth2.0 drop the signatures and asks the authorrization request to be send over SSL/STL, in which securet key is delivered and directly validated. So, in OAuth2.0, there is no requirements for the order of parameters. However, the benefit is not cost free as the server and client have to spend more resource to handle SSL communication.

    To be used in native app, like movbile app, OAuth2.0 standardize the extension about two legs authorization, which I think there is security problem. You may think a server or user's own server could be trusted to have user's cridential info. But, it will be highly risk to implement two leg flow in a mobile app, which may be used on large amount device and different people. Here is an article about OAuth2.0 and the road to hell. But, in case of users want to make their own application to get info through OAuth service provider. It doesnot matter to use password flow. And put this flow under the name of OAuth (OAuth 2.0 actually).


    Below is three different flows from Salesforce as examples. It is important for developer to understand which flow they need to start with according to their use case. It is interesting to see they declare that their flow is for authenticating purpose. What is the difference between authenticationa nd authorization if all resources are expose once authentication is past.

    OAuth 2.0 User-Agent Flow


    OAuth 2.0 Web server


    OAuth 2.0 Username-Password Flow
    http://stackoverflow.com/questions/7561631/oauth-2-0-benefits-and-use-cases-why http://hueniverse.com/2012/07/oauth-2-0-and-the-road-to-hell/ http://www.wolfe.id.au/2012/10/20/what-is-hmac-and-why-is-it-useful/

    Sunday, March 17, 2013

    anatomy of hadoop Map/Reduce process


    In Hadoop MapReduce framework, besides map and reduce phases, the shuffling phases is actually more important as lots of optimizations jobs are done in shuffling phase. Below diagram shows what may happens in the Shuffling step. Each steps there are chance for speeding up processing or slowing down processing.


    Above sub steps happens in spill thread. Spill thread read Map output from buffer (100MB by default) and do these sub steps based on if those respondent functions are defined. Eventually, all small split files will be merged into a big one as Map output. 


    http://blog.sina.com.cn/s/blog_62186b4601012fty.html Shuffle是指从Map产生输出开始,包括系统执行排序以及传送Map输出到Reducer作为输入的过程。 首先从Map 端开始分析。当Map开始产生输出时,它并不是简单的把数据写到磁盘,因为频繁的磁盘操作会导致性能严重下降。它将数据首先写到内存中的一个缓冲区,并做了一些预排序,以提升效率。这个缓冲区默认大小是100MB。当缓冲区中的数据量达到一个特定阀值(默认是0.80)时,系统将会把缓冲区中的内容spill到磁盘。在spill过程中,Map 的输出将会继续写入到缓冲区,但如果缓冲区已满,Map 就会被阻塞直到spill完成。spill 线程在把缓冲区的数据写到磁盘前,会首先根据数据所属的partition 排序,(partition用于确定每部分数据之后交由哪个reduce继续处理),然后每个partition中的数据再按Key 排序。如果设定了Combiner,将在排序输出的基础上运行Combiner。Combiner 就是一个本地Reducer,使得Map的输出更紧凑,更少的数据会被写入磁盘和传送到Reducer。 每当内存中的数据达到spill 阀值的时候,都会产生一个新的spill 文件,所以在Map任务写完它的最后一个输出记录时,可能会有多个spill文件。在Map任务完成前,所有的spill文件将会被归并为一个已分区和已排序的输出文件。 当spill 文件归并完毕后,Map将删除所有的临时spill 文件,并告知TaskTracker 任务已完成。Map 的输出文件放置在运行Map 任务的TaskTracker 的本地磁盘上,它是运行Reduce 任务的TaskTracker 所需要的输入数据。Reducers 通过HTTP 来拷贝获取对应的输入数据。当所有的Map 输出都被拷贝后,Reduce任务进入排序阶段(更恰当的说应该是归并阶段,因为排序在Map 端就已经完成),这个阶段会对所有的Map 输出进行归并排序。 之后在Reduce 阶段,Reduce 函数会作用在排序输出的每一个key/value对上。这个阶段的输出被直接写到输出文件系统,一般是HDFS。到此,MapReduce 的Shuffle 和Sort 分析完毕。

    Wednesday, March 13, 2013

    Blocking Third-Party Cookies may not be such a big deal for online adv industry

    It is announced that Firefox will block third-party cookie by default as what Safari have been done so for years. (See blog posts Firefox getting smarter about third-party cookies and the New Firefox Cookie Policy).

    This is big news in the online advertisement industry. There are many reports on the internet. As post "Imagining a World Without Cookies" points out, there will be winners and losers if there is no third-party cookies. I pretty much agree with their points. But, before publishers start cheering for this, they need to seriously think about what will be their solution to replace the third-party cookies from giants in the industry. Obviously, publisher should be able to produce solution to collect, store, analyze, secure, and share user behavior data.

    But, if publishers are not able to implement such solution. What could happen? I think there are two ways to go at least. Publishers can either license solution from third-party software vendor or host javascript code distributed by traditional cookie collectors. I will call this javascript code as cookie proxy.


    Figure 1. Javascript proxy convert third-party cookie to first-party cookie


    As shown in Figure 1, current third-party cookie can be easily converted into first-party cookie. The difference/questions are,
    1.  Publishers need to host a third-party cookie Javascript proxy code on their domain. The proxy code expose API for traditional third party cookie generators to call for generate cookies.
    2. In other words, now, it is publisher's right to say if they allow third-party Javascript to be hosted on their domain. 
    3. One interesting questions is, to allow hosting third-party Javascript code, who will take the responsibility to promise the safety of third-party Javascript code. We can see normally, third-party cookie generators have deep dynamical Javascript loading, which make tracing code to be more difficult.
    4. The resource allocated by Web browser for certain domain is limited. So, how much do these resource cost for traditional third-party cookie generator? Will they pay for it? 
    5. Some experiments need to be done in the future to figure out if sub-domain can walk around the limitation of resource allocated for cookies. According to RFC, this may work. 
    Anyway, from the technical side, this third-party cookie issue belongs to cross domain script (XSS). I listed several ways for cross domain script in post "seven ways for cross domain script programming". I dont know how Firefox will handle techniques listed there against blocking third-party cookie by default. yet. It will be interesting to explore it. But, one thing should be sure that Javascript loaded from third-party domain is allowed. Otherwise, WWW distributed computing model would be damaged and something wonderful is stopped, for example, CDN etc.

    This event clearly determines that our Web is still evolving quickly. WWW obviously has more and more job to do after define HTML5. For example, one thing I observed and post as title "A war to occupy Internet user's Web browser " could be interesting stuff to be solved from the RFC level.

    Monday, February 18, 2013

    Removing Sencha Eclipse plugin license file from my linux platform


    I am using spket eclipse plugin for editing ExtJS code. But, I tried Sencha Eclipse plugin before. I need to delete some directories and files if I want to completely remove Sencha Eclipse plugin.

    For example, I need to remove its license file as below,

    rm $HOME/.local/share/Sencha/.ftp


    Btw, I am listing two books about ExtJS 4.x here. I am happy to serve as technical reviewer for these two books. They will save your time that costs more money than these books.

    Saturday, February 9, 2013

    shell script to block main script exit before all other spined backend script exit.

    Sometimes, we need to write shell script to spin multiple process running parallel. Also, we want to wait all parallel process finish before we can process to next step. How to achieve this? Below is a simple Bash shell sample script to do this.

    n=10
    for ((i=1; i < $n; i++));
     do
     #compose the command
     realCmd="nohup yourCmdWithArguments >/dev/null 2>&1 &"
     #excute the command
     eval $realCmd
     #store command pid in an array.
      pidArray[$i]=$!
    done
    
    #wait for each process to be done
    for ((i=1;i<$n;i++));
     do
       wait ${pidArray[$i]}
    done
    
    #continue to do next step that depends on results of all nohup spin processes above.
    
    

    Sunday, January 27, 2013

    HIVE Select count performance

    I read a discussion about performance of Hive Select, that explains why Select count(*) slower than slect *. I note it here as one of hints for tuning hive query.
    1. A hive query can be a metadata only request.
    2. A hive query can be an hdfs get request.
    3. A hive query can be a Map Reduce job.

    Also, we can control number of map/reducer for better performance by rewriting query. For example, writing query in hive like this:

     SELECT COUNT(DISTINCT id) ....
    
    
    It will always result in using only one reducer that slow down query. We can rewrite query as sub query to better performance:
    1. use this command to set desired number of reducers:
      set mapred.reduce.tasks=50
    2. rewrite query as following:
      SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;

    Wednesday, January 23, 2013

    setup hadoop development environment in eclipse

    1. Finding the version of Maven from maven homepage and running the wget command from the dir you want to extract maven too.
    2. untar the file: tar xvf apache-maven-3.0.4-bin.tar.gz
    3. Next add the env variables to your ~/.bashrc file
      export M2_HOME=/usr/local/apache-maven/apache-maven-3.0.4 
      export M2=$M2_HOME/bin 
      export PATH=$M2:$PATH 
      
    4. Verify everything is working with the following command
       mvn -version
      
    5. make sure g++ is installed
    6. download and install protocol buffer compiler and install protocol buffer java library in mvn repository
      export MAVEN_OPTS='-Xms384M -Xmx512M -XX:MaxPermSize=256M'
      
      /home/yiyujia/.m2/repository/com/google/protobuf/protobuf-java/{versionNum} 
      
    7. Fix classpath variable error in Eclipse if needed.
      Open the Eclipse Preferences: Window -> Preferences
      Go to [Java - Build Path - Classpath Variables]
      Click New and set its name as M2_REPO
      Click Folder and select your Maven repository folder. For example, /home/yiyujia/.m2/repository
      A little bit more about maven environment settting.
      
    8. Treat hadoop programming as normal maven project. Adding hadoop dependencies, coding and testing.
    Note: "mvn eclipse:eclipse" is not needed for new version eclipse.

    Friday, January 18, 2013

    Video collection on machine learning lecture

    I like this video series presented by Andrew Ng of Standford University very much.

    Following link to the free Stanford machine learning course which you can sign up. https://www.coursera.org/course/ml