iDatamining.org

I am looking for projects to work on
Please contact with me at yiyu.jia@BostonInfoPro.com!

Saturday, November 1, 2014

prepare Azure management certification for Vagrant


On Linux


  • Creating certification
    1. openssl genrsa -out azureMgt.key 2048
    2. openssl req -new -key azureMgt.key -out azureMgt.csr
    3. openssl x509 -req -days 730 -in azureMgt.csr -signkey azureMgt.key -out tempAzureMgt.pem
    4. cat azureMgt.key tempAazureMgt.pem > azureMgt.pem
    5. openssl x509 -inform pem -in azureMgt.pem -outform der -out azureMgt.cer
    6. openssl pkcs12 -export -out azureMgt.pfx -in azureMgt.pem -inkey azureMgt.key -name "Azure Certification"
  • Uploading file azureMgt.cer to Azure management portal
  • Using file azureMgt.pem on your azure client side. for example, using this file in Vagrant (the pfx format has problem based on my testing).
  • Generating SSH key for vagarant to launch VM on azure.
    1. openssl req -x509 -nodes -days 1095 -newkey rsa:2048 -keyout azuresshkey.key -out azuresshcert.pem
    2. Change the permissions on the private key and certificate for security. chmod 600 azuresshcert.pem ; chmod 600 azuresshkey.key
    3. azuresshcert.pem is vm's ssh certifiction. azuresshkey.key is the ssh key that we use to access VM

On Windows

work on this later. 

Tuesday, October 21, 2014

executing sequence of JOIN, WHERE, GROUP BY HAVING clauses in Hive

Understanding the sequence of executing sequence of clauses in Hive is very helpful for optimizing query. It will be idea if we can make every inter steps to generate data set as small as possible. The order of executing part of a query:


  1. FROM & JOINs determine & filter rows
  2. WHERE more filters on the rows
  3. GROUP BY combines those rows into groups
  4. HAVING filters groups
  5. ORDER BY arranges the remaining rows/groups


Monday, October 6, 2014

install vagrant on CentOS


[root@vmhost01 vagrant]# wget https://dl.bintray.com/mitchellh/vagrant/vagrant_1.6.3_x86_64.rpm

[root@vmhost01 vagrant]# yum install vagrant_1.6.3_x86_64.rpm 

[root@vmhost01 vagrant]# vagrant plugin install vagrant-mutate

[root@vmhost01 vagrant]# mkdir /vm1/vm10

[root@vmhost01 vagrant]# cd /vm1/vm10

[root@vmhost01 vm10]# vagrant box add centos65 http://puppet-vagrant-boxes.puppetlabs.com/centos-65-x64-virtualbox-puppet.box
==> box: Adding box 'centos65' (v0) for provider: 
    box: Downloading: http://puppet-vagrant-boxes.puppetlabs.com/centos-65-x64-virtualbox-puppet.box
==> box: Successfully added box 'centos65' (v0) for 'virtualbox'!

[root@vmhost01 vm10]# vagrant mutate centos65 libvirt
You have qemu 0.12.1 installed. This version cannot read some virtualbox boxes. If conversion fails, see below for recommendations. https://github.com/sciurus/vagrant-mutate/wiki/QEMU-Version-Compatibility
Converting centos65 from virtualbox to libvirt.
    (98.57/100%)
The box centos65 (libvirt) is now ready to use.
[root@vmhost01 vm10]# pwd
/opt/vm1/vm10
[root@vmhost01 vm10]# vagrant init centos65
A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
`vagrantup.com` for more information on using Vagrant.

[root@vmhost01 vm10]# vagrant plugin install vagrant-libvirt  --plugin-version 0.0.19


[root@vmhost01 vm10]#  vagrant up --no-parallel

http://spin.atomicobject.com/2013/06/03/ovf-virtual-machine/ https://ttboj.wordpress.com/2014/05/13/vagrant-on-fedora-with-libvirt-reprise/ https://ttboj.wordpress.com/2013/12/09/vagrant-on-fedora-with-libvirt/

Tuesday, September 23, 2014

OAuth2.0 refresh token and access token


When we implement OAuth2, we have access token and refresh token. Why do we need both refresh token and access token? Using plain English, below are reasons in simplified version.


  • For security reason, OAuth2 has both refresh token and access token. access token is something close to one-time password, which is ideally secure. access token may be expired shortly. Refresh token may last for long time and even will not expire until it will be revoked. 


  • For performance and scalability reason, it better to verify HTTP request on the resource server instead of on central authorization server for every HTTP request.


  • Basically, access_token is kind of temporary password and refresh_token is pass to get temporary password from central authentication server. "temporary" password, access_token, is verified on resource server.

    Thursday, September 11, 2014

    install phpMyAdmin on CentOS 7

        1  yum update
        2  df
        3  yum -y install mariadb-server mariadb
        4  systemctl start mariadb.service
        5  systemctl enable mariadb.service
        6  mysql_secure_installation
        7  yum -y install httpd
        8  systemctl start httpd.service
        9  systemctl enable httpd.service
       10  firewall-cmd --permanent --zone=public --add-service=http 
       11  firewall-cmd --permanent --zone=public --add-service=https
       12  firewall-cmd --reload
       13  ifconfig 
       14  yum -y install php
       15  systemctl restart httpd.service
       16  cd /var/www/html/
       17  vi info.php
       18  yum -y install php-mysql
       19  yum -y install php-gd php-ldap php-odbc php-pear php-xml php-xmlrpc php-mbstring php-snmp php-soap curl curl-devel
       20  cd /tmp/ 
       21  wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
       22  yum install epel-release-latest-7.noarch.rpm 
       23  yum install phpMyAdmin.noarch
    
    

    Wednesday, September 3, 2014

    install Oracle jdk on CentOS

    yum groupinstall -y development
    
    https://www.if-not-true-then-false.com/2014/install-oracle-java-8-on-fedora-centos-rhel/
    
    
    
    alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.7.0_65/jre/bin/java 200000
     alternatives --install /usr/bin/javaws javaws /usr/lib/jvm/jdk1.7.0_65/jre/bin/javaws 200000
    alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/lib/jvm/jdk1.7.0_65/jre/lib/amd64/libnpjp2.so 200000
    alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.7.0_65/bin/javac 200000
    alternatives --install /usr/bin/jar jar /usr/lib/jvm/jdk1.7.0_65/bin/jar 200000
    
    
    
    /usr/lib/jvm/jdk1.8.0_25
    
    alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.8.0_25/jre/bin/java 300000
    alternatives --install /usr/bin/javaws javaws /usr/lib/jvm/jdk1.8.0_25/jre/bin/javaws 300000
    alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/lib/jvm/jdk1.8.0_25/jre/lib/amd64/libnpjp2.so 300000
    alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.8.0_25/bin/javac 300000
    alternatives --install /usr/bin/jar jar /usr/lib/jvm/jdk1.8.0_25/bin/jar 300000
    alternatives --install /usr/bin/jvisualvm jvisualvm /usr/lib/jvm/jdk1.8.0_25/bin/jvisualvm 300000
    
    
    su -i 
    
    wget --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
    
    
    /usr/lib/jvm/jdk1.8.0_131
    
    alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.8.0_131/jre/bin/java 300000
    alternatives --install /usr/bin/javaws javaws /usr/lib/jvm/jdk1.8.0_131/jre/bin/javaws 300000
    alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/lib/jvm/jdk1.8.0_131/jre/lib/amd64/libnpjp2.so 300000
    alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.8.0_131/bin/javac 300000
    alternatives --install /usr/bin/jar jar /usr/lib/jvm/jdk1.8.0_131/bin/jar 300000
    alternatives --install /usr/bin/jvisualvm jvisualvm /usr/lib/jvm/jdk1.8.0_131/bin/jvisualvm 300000
    
    
    

    Thursday, July 24, 2014

    Asgard 1.5 Config.groovy file format

    I tried to setup Asgard 1.5 today and find its initializing process has been changed a little bit. I remember I will be asked for aws credential information when I launched asGard first time. But, for Asgard 1.5 (ASGard 1.4.2 has same situation), I got error message like "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain". After searching online, I make it work by setting up environment variable AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

    In fact, we can just directly create the asgard configuration file ~/.asgard/Config.groovy. The format of the file is as below.

    
    grails {
            awsAccounts=['YourAccountIDasDigitalNumber']
            awsAccountNames=['YourAccountIDasDigitalNumber':'YourAccountName']
    }
    secret {
            accessId='YourAccessID'
            secretKey='YourAccessSecretKey'
    }
    cloud {
            accountName='YourAccountName'
            publicResourceAccounts=['amazon']
    }
    
    
    Hope above tip will save others time. In fact, I solve the problem after reading this discussion on google asgard group.

    Friday, July 11, 2014

    can we leave missing data as null when load source data into data warehouse?

    No! 


    because

    NULL != NULL

    Think about when you do a outer join and some NULL was created after outer join. Then, SQL programmers, who will create SQL query on aggregate tables, will not be able to identify the source of NULL.

    Tuesday, June 10, 2014

    set up Grail on Groovy environment on CentOS

    
    yiyujia@hadoopDev:~ $ curl -s get.gvmtool.net | bash
    yiyujia@hadoopDev:~ $ source "/root/.gvm/bin/gvm-init.sh"
    yiyujia@hadoopDev:~ $ gvm help
    yiyujia@hadoopDev:~ $ gvm install groovy
    yiyujia@hadoopDev:~ $ gvm install grails
    yiyujia@hadoopDev:~ $ gvm install grails 2.2.4
    yiyujia@hadoopDev:~ $ grails
    | Enter a script name to run. Use TAB for completion: 
    grails> quit
    | Goodbye
    yiyujia@hadoopDev:~ $ 
    
    https://github.com/Netflix/asgard/blob/master/application.properties #Grails Metadata file #Wed Jun 05 11:45:00 PST 2014 app.grails.version=2.2.4 app.name=asgard app.servlet.version=2.4 app.version=1.5

    Monday, June 2, 2014

    Tuning redshift query performance

    We can tune Redshift query by the following four steps literately

    1. Identify slow queries

    select trim(t3.usename) AS username , t1.query AS query_ID, t1.starttime, t1.elapsed, t2.querytxt from svl_qlog t1 join STL_QUERY t2 ON t1.query = t2.query JOIN pg_user t3 ON t1.userid = t3.usesysid  order by t1.elapsed desc limit 50;

    2. Table Design

    select correct sort key, distribution key, and column compression algorithms (see blog post "Redshift table optimizing with SORTKEY and DISTKEY.")

    3. Query Design



    4. WLM Design


    http://docs.aws.amazon.com/redshift/latest/dg/t_determine_if_query_runs_on_disk.html http://docs.aws.amazon.com/redshift/latest/dg/r_SVL_QLOG.html http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html select query, elapsed, substring from svl_qlog order by elapsed desc limit 5; select query, step, rows, workmem, label, is_diskbased from svl_query_summary where query = 4839858 order by workmem desc; select owner as node, diskno, used, capacity from stv_partitions order by 1, 2, 3, 4; select owner, host, diskno, used, capacity, (used-tossed)/capacity::numeric *100 as pctused from stv_partitions order by owner; SELECT userid, xid, pid, query, segment, locus, datediff(ms, starttime, endtime) as duration, compile from svl_compile where query = 35878 or query = 35879 order by query, segment; Redshift Tips http://oksoft.blogspot.com/2013/03/redshift-tips.html http://www.slideshare.net/AmazonWebServices/amazon-redshift-best-practices

    Tuesday, May 20, 2014

    shell script to move files under one directory to multiple other directory.

    dte=$2
    #rm -rf "$HOME/temp/files/raw/$dte"
    rawDir="$HOME/temp/files/raw/"
    mkdir -p "$rawDir"
    #s3cmd sync s3://bucketName/import/nokia/store/downloads_completed/"$dte" "$rawDir"
    cd "$rawDir$dte"
    for a in *.gz; do gunzip $a; done
    cd "$HOME/temp/backfillTemp/"
    a=`find "$rawDir$dte" -type f | wc -l`
    #echo "$a"
    n=$1
    b=$(((a+n-1)/n)) 
    #echo "$b"
    for ((i=1;i<$n;i++));
    do 
      #echo "$i"
      targetDir="$HOME/temp/files/part$i"
      #echo "$targetDir"  
      mkdir -p "$targetDir"  
      finalCmd="find $rawDir$dte -type f | head -n $b | xargs  -r sh -c 'mv \"\$0\" \"\$@\" $targetDir'"
      #echo "$finalCmd"
      eval $finalCmd
      realCmd="nohup $HOME/etl/transformScripts/ETLBackfill/doit.sh '$i' '$dte' >/dev/null 2>&1 &"
      #realCmd="nohup $HOME/etl/transformScripts/ETLBackfill/doit.sh '$i' '$dte' > $i.txt &"
      #echo "$realCmd"
      eval $realCmd
      pidArray[$i]=$!
    done
    ### moving rest files
    #echo "$i"
    targetDir="$HOME/temp/files/part$i"
      #echo "$targetDir"
      mkdir -p "$targetDir"  
      finalCmd="find $rawDir$dte -type f | xargs  -r sh -c 'mv \"\$0\" \"\$@\" $targetDir'"
      #echo "$finalCmd"
      eval $finalCmd
      realCmd="nohup $HOME/etl/transformScripts/ETLBackfill/doit.sh '$i' '$dte' >/dev/null 2>&1 &"
      #realCmd="nohup $HOME/etl/transformScripts/ETLBackfill/doit.sh '$i' '$dte' > $i.txt &"
      #echo "$realCmd"
      eval $realCmd
      pidArray[$i]=$!
    for ((i=1;i<$n;i++));
    do
     wait ${pidArray[$i]} 
    done
    

    Thursday, May 15, 2014

    setup scala version manager on CentOS

    
    [root@new-host yiyujia]#  yum install coreutils 
    
    
    [root@new-host yiyujia]# wget https://raw.githubusercontent.com/yuroyoro/svm/master/svm
    
    [root@new-host yiyujia]# vi /usr/local/bin/svm
    [root@new-host yiyujia]# chmod +x /usr/local/bin/svm 
    [root@new-host yiyujia]# chmod 755 /usr/local/bin/svm 
    [root@new-host yiyujia]# vi ~/.bash_profile
    
    export SCALA_HOME=~/.svm/current/rt
    export PATH=$SCALA_HOME/bin:$PATH
    
    source ~/.bash_profile
    01:26 yiyujia@new-host:~ $ svm install 2.11.2
    
    
    
    http://scala.masanorihashimoto.com/2014/04/how-to-install-the-scala-version-manager-on-your-mac/

    Monday, April 7, 2014

    several methods collecting for time series analysis

    It is normal for data analysis processing have missing data for reasons. Once the data is missing. It is nature for people ask how the gap on the plot can be "fixed/smoothed". Also, it is very often to be asked to predict next several points ased on previous 7 to 8 points. For example, given history stock level, I am asked to predict stock level for beverage market data analysis.
    With the asspumption of whatever has occured in the past will continue to happen in the future, we can treat these questions as time series analysis. Below are list of four methods to do forcecasting for a short time series.

    1. Moving averages
      install.packages("zoo")
      library(zoo)
      ma = rollmean(ts, k)
      
    2. For data without any systematic trend or seasonal components, we can use exponential smoothing
    3. Accounting for data seasonality using Winter's smoothing
    4. Accounting for data trend using Holt's smoothing
    5. install.packages("forecast")
      library(forecast)
      x = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5 )
      x = ts(x, frequency = 5)
      plot(hw(x, 5), byt = "1")
      
    Below a few collected videos about exponential smoothing and Holt-Winter forecasting method

    1.11 Time Series- exponential smoothing


    1.12 Time Series- moving averages



    Saturday, March 29, 2014

    setup android ADT environment on CentOS 6.5 64 bit.


    1. download ADT from http://developer.android.com/sdk/index.html
    2. mkdir ~/local/tools/android and copy bundle zip file into and unzip it.
    3. your/adt/unziped/folder/eclipse/eclipse.
    4. Create Hello World app and get error like "Error executing aapt. Please check aapt is present at ~/local/tools/android/adt-bundle-linux-x86_64-20140321/sdk/build-tools/android-4.4.2/aapt appcompat_v7 Unknown Android Packaging Problem" etc.
    5. yum groupinstall "Compatibility libraries"
      yum -y install --skip-broken glibc.i686 arts.i686 audiofile.i686 bzip2-libs.i686 cairo.i686 cyrus-sasl-lib.i686 dbus-libs.i686 directfb.i686 esound-libs.i686 fltk.i686 freeglut.i686 gtk2.i686 hal-libs.i686 imlib.i686 lcms-libs.i686 lesstif.i686 libacl.i686 libao.i686 libattr.i686 libcap.i686 libdrm.i686 libexif.i686 libgnomecanvas.i686 libICE.i686 libieee1284.i686 libsigc++20.i686 libSM.i686 libtool-ltdl.i686 libusb.i686 libwmf.i686 libwmf-lite.i686 libX11.i686 libXau.i686 libXaw.i686 libXcomposite.i686 libXdamage.i686 libXdmcp.i686 libXext.i686 libXfixes.i686 libxkbfile.i686 libxml2.i686 libXmu.i686 libXp.i686 libXpm.i686 libXScrnSaver.i686 libxslt.i686 libXt.i686 libXtst.i686 libXv.i686 libXxf86vm.i686 lzo.i686 mesa-libGL.i686 mesa-libGLU.i686 nas-libs.i686 nss_ldap.i686 cdk.i686 openldap.i686 pam.i686 popt.i686 pulseaudio-libs.i686 sane-backends-libs-gphoto2.i686 sane-backends-libs.i686 SDL.i686 svgalib.i686 unixODBC.i686 zlib.i686 compat-expat1.i686 compat-libstdc++-33.i686 openal-soft.i686 alsa-oss-libs.i686 redhat-lsb.i686 alsa-plugins-pulseaudio.i686 alsa-plugins-oss.i686 alsa-lib.i686 nspluginwrapper.i686 libXv.i686 libXScrnSaver.i686 qt.i686 qt-x11.i686 pulseaudio-libs.i686 pulseaudio-libs-glib2.i686 alsa-plugins-pulseaudio.i686 for Debian or Ubuntu: sudo apt-get install ia32-libs


    Monday, February 10, 2014

    Redshift table optimizing with SORTKEY and DISTKEY.

    People ask if it is good idea to split a big table into multiple tables based on values of column, for example date. This is an interesting question as Redshift supplies DISTKEY option for table definition already.

    Keeping in mind that Redshift is a MPP database management system, which distributes data among its cluster. "The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed." (See detail from Redshift doc). So, "partition" a huge table in Redshift should be taken cared by DISTKEY.

    Another key for optimizing Redshift table design is SORTKEY. I will think DISTKEY and SORTKEY together as "index" in Redshift. Also, it is good to keep in mind the order of sort keys. For example, if sort key is defined as SORTKEY(col1, col2), it should be good practice to know col1 has less variant values than col2 does. Also, making joined columns as sort key will help on performance of table join as a merge join may happen (see detail from Redshift doc).

    Redshift have awesome online document to describe how to tune table. BTW, Redshift is a relational database. But, it is most likely not good idea to use it as your transaction database. Better to use Redshift for data warehouse instead of data mart platform.

    http://stackoverflow.com/questions/17115069/what-does-it-mean-to-have-multiple-sortkey-columns/17510565#17510565

    Monday, January 6, 2014

    Clustering algorithms types: partitional clustering and hierarchical clustering

    1. Flat or Partitional clustering:
    (K-means, Gaussian mixture models, etc.)
    Partitions are independent of each other

    2. Hierarchical clustering:
    (e.g., agglomerative clustering, divisive clustering)
    - Partitions can be visualized using a tree structure (a dendr
    ogram)
    - Does not need the number of clusters as input
    - Possible to view partitions at different levels of granularities
    (i.e., can refine/coarsen clusters) using different K


     K-means variants:
    -Hartigan’s k-means algorithm
    -Lloyd’s k-means algorithm
    -Forgy’s k-means algorithm
    -McQueen’s k-means algorithm


    a good article about cluster analysis in R.
    ============================================
    Read up on Gower's Distance measures (available in the ecodist
    package) which can combine numeric and categorical data
    =======
    What do you mean by representing the categorical fields by 1:k?

    becomes


    That guarantees your results are worthless unless your categories
    have an inherent order (e.g. tiny, small, medium, big, giant).
    Otherwise it should be four (k-1) indicator/dummy variables (e.g.):


    Then you can use Euclidean distance.

    -------------------------------------
    David L Carlson
    Associate Professor of Anthropology
    Texas A&M University
    College Station, TX 77840-4352
    ===============
    Do also note that a generalized Gower's distance (+ weighting of
    variables) is available from the ('recommended' hence always
    installed) package 'cluster' :

      require("cluster")
      ?daisy
      ## notably  daisy(*,  metric="gower")

    Note that daisy() is more sophisticated than most users know, using the 'type = *' specification allowing, notably for binary variables (as your a. dummies above) allowing asymmetric behavior which maybe quite important in "rare event" and similar cases.

    Martin
    ===============================
    The first step is calculating a distance matrix. For a data set with n observations, the distance matrix will have n rows and n columns; the (i,j)th element of the distance matrix will be the difference between observation i and observation j. There are two functions that can be used to calculate distance matrices in R; the dist function, which is included in every version of R, and the daisy function, which is part of the cluster library.
    ==================================
    The daisy function in the cluster library will automatically perform standardization, but it doesn't give you complete control. If you have a particular method of standardization in mind, you can use the scale function.
    source: http://www.stat.berkeley.edu/classes/s133/Cluster2a.html
    ===================================

    Saturday, January 4, 2014

    Redshift convert integer to Timestamp type and convert epoch time to timestamp on the fly

    Currently, Redshift does not support to_timestamp function, which is convenient for converting String or Integer into Timestamp format. Below is a quick notes on how to convert String or integer to be timestamp in Redshift.
    Simple, using substring function to get parts for Year, Month, Date, Hour, Minute, Second from String or Integer and concatenate them into Timestamp format and cast to timestamp. Below are examples,
    Suppose Table t1 has a column (col1) have integer value in pattern YYMMDDhhmm. For example, it value is 0403051030. It means Year 2014, March fifth, and thirty minutes past ten o'clock. We can use below two methods to convert it to be Timestamp value.
    SELECT 
    cast(('20' || substring(col1,1, 2) || substring(col1,3, 2) || 
          substring(col1,5, 2) || ' ' || substring(col1,7, 2) || ':' 
          || substring(col1,9, 2)) as timestamp) 
    FROM t1;
    
    Or

    SELECT 
    cast(to_date(col1, 'YYMMDD') || ' ' 
        || substring(col1,7,2) || ':' 
    || substring(col1,9,2) as timestamp) 
    FROM t1;
    
    Also, get idea from a blog about how to convert epoch time to timestamp on fly in Redshift.

    SELECT  
        (TIMESTAMP 'epoch' + myunixtime * INTERVAL '1 Second ') 
    AS   mytimestamp 
    FROM  
        example_table