Saturday, November 1, 2014

prepare Azure management certification for Vagrant

On Linux

Creating certification
1. openssl genrsa -out azureMgt.key 2048
2. openssl req -new -key azureMgt.key -out azureMgt.csr
3. openssl x509 -req -days 730 -in azureMgt.csr -signkey azureMgt.key -out tempAzureMgt.pem
4. cat azureMgt.key tempAazureMgt.pem > azureMgt.pem
5. openssl x509 -inform pem -in azureMgt.pem -outform der -out azureMgt.cer
6. openssl pkcs12 -export -out azureMgt.pfx -in azureMgt.pem -inkey azureMgt.key -name "Azure Certification"
Uploading file azureMgt.cer to Azure management portal
Using file azureMgt.pem on your azure client side. for example, using this file in Vagrant (the pfx format has problem based on my testing).
Generating SSH key for vagarant to launch VM on azure.
1. openssl req -x509 -nodes -days 1095 -newkey rsa:2048 -keyout azuresshkey.key -out azuresshcert.pem
2. Change the permissions on the private key and certificate for security. chmod 600 azuresshcert.pem ; chmod 600 azuresshkey.key
3. azuresshcert.pem is vm's ssh certifiction. azuresshkey.key is the ssh key that we use to access VM

On Windows

work on this later.

Tuesday, October 21, 2014

executing sequence of JOIN, WHERE, GROUP BY HAVING clauses in Hive

Understanding the sequence of executing sequence of clauses in Hive is very helpful for optimizing query. It will be idea if we can make every inter steps to generate data set as small as possible. The order of executing part of a query:

FROM & JOINs determine & filter rows
WHERE more filters on the rows
GROUP BY combines those rows into groups
HAVING filters groups
ORDER BY arranges the remaining rows/groups

Monday, October 6, 2014

install vagrant on CentOS


[root@vmhost01 vagrant]# wget https://dl.bintray.com/mitchellh/vagrant/vagrant_1.6.3_x86_64.rpm

[root@vmhost01 vagrant]# yum install vagrant_1.6.3_x86_64.rpm 

[root@vmhost01 vagrant]# vagrant plugin install vagrant-mutate

[root@vmhost01 vagrant]# mkdir /vm1/vm10

[root@vmhost01 vagrant]# cd /vm1/vm10

[root@vmhost01 vm10]# vagrant box add centos65 http://puppet-vagrant-boxes.puppetlabs.com/centos-65-x64-virtualbox-puppet.box
==> box: Adding box 'centos65' (v0) for provider: 
    box: Downloading: http://puppet-vagrant-boxes.puppetlabs.com/centos-65-x64-virtualbox-puppet.box
==> box: Successfully added box 'centos65' (v0) for 'virtualbox'!

[root@vmhost01 vm10]# vagrant mutate centos65 libvirt
You have qemu 0.12.1 installed. This version cannot read some virtualbox boxes. If conversion fails, see below for recommendations. https://github.com/sciurus/vagrant-mutate/wiki/QEMU-Version-Compatibility
Converting centos65 from virtualbox to libvirt.
    (98.57/100%)
The box centos65 (libvirt) is now ready to use.
[root@vmhost01 vm10]# pwd
/opt/vm1/vm10
[root@vmhost01 vm10]# vagrant init centos65
A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
`vagrantup.com` for more information on using Vagrant.

[root@vmhost01 vm10]# vagrant plugin install vagrant-libvirt  --plugin-version 0.0.19


[root@vmhost01 vm10]#  vagrant up --no-parallel

Tuesday, September 23, 2014

OAuth2.0 refresh token and access token

When we implement OAuth2, we have access token and refresh token. Why do we need both refresh token and access token? Using plain English, below are reasons in simplified version.

For security reason, OAuth2 has both refresh token and access token. access token is something close to one-time password, which is ideally secure. access token may be expired shortly. Refresh token may last for long time and even will not expire until it will be revoked.

For performance and scalability reason, it better to verify HTTP request on the resource server instead of on central authorization server for every HTTP request.

Basically, access_token is kind of temporary password and refresh_token is pass to get temporary password from central authentication server. "temporary" password, access_token, is verified on resource server.

Thursday, September 11, 2014

install phpMyAdmin on CentOS 7

    1  yum update
    2  df
    3  yum -y install mariadb-server mariadb
    4  systemctl start mariadb.service
    5  systemctl enable mariadb.service
    6  mysql_secure_installation
    7  yum -y install httpd
    8  systemctl start httpd.service
    9  systemctl enable httpd.service
   10  firewall-cmd --permanent --zone=public --add-service=http 
   11  firewall-cmd --permanent --zone=public --add-service=https
   12  firewall-cmd --reload
   13  ifconfig 
   14  yum -y install php
   15  systemctl restart httpd.service
   16  cd /var/www/html/
   17  vi info.php
   18  yum -y install php-mysql
   19  yum -y install php-gd php-ldap php-odbc php-pear php-xml php-xmlrpc php-mbstring php-snmp php-soap curl curl-devel
   20  cd /tmp/ 
   21  wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
   22  yum install epel-release-latest-7.noarch.rpm 
   23  yum install phpMyAdmin.noarch

Wednesday, September 3, 2014

install Oracle jdk on CentOS

yum groupinstall -y development

https://www.if-not-true-then-false.com/2014/install-oracle-java-8-on-fedora-centos-rhel/



alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.7.0_65/jre/bin/java 200000
 alternatives --install /usr/bin/javaws javaws /usr/lib/jvm/jdk1.7.0_65/jre/bin/javaws 200000
alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/lib/jvm/jdk1.7.0_65/jre/lib/amd64/libnpjp2.so 200000
alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.7.0_65/bin/javac 200000
alternatives --install /usr/bin/jar jar /usr/lib/jvm/jdk1.7.0_65/bin/jar 200000



/usr/lib/jvm/jdk1.8.0_25

alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.8.0_25/jre/bin/java 300000
alternatives --install /usr/bin/javaws javaws /usr/lib/jvm/jdk1.8.0_25/jre/bin/javaws 300000
alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/lib/jvm/jdk1.8.0_25/jre/lib/amd64/libnpjp2.so 300000
alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.8.0_25/bin/javac 300000
alternatives --install /usr/bin/jar jar /usr/lib/jvm/jdk1.8.0_25/bin/jar 300000
alternatives --install /usr/bin/jvisualvm jvisualvm /usr/lib/jvm/jdk1.8.0_25/bin/jvisualvm 300000


su -i 

wget --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz


/usr/lib/jvm/jdk1.8.0_131

alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.8.0_131/jre/bin/java 300000
alternatives --install /usr/bin/javaws javaws /usr/lib/jvm/jdk1.8.0_131/jre/bin/javaws 300000
alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/lib/jvm/jdk1.8.0_131/jre/lib/amd64/libnpjp2.so 300000
alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.8.0_131/bin/javac 300000
alternatives --install /usr/bin/jar jar /usr/lib/jvm/jdk1.8.0_131/bin/jar 300000
alternatives --install /usr/bin/jvisualvm jvisualvm /usr/lib/jvm/jdk1.8.0_131/bin/jvisualvm 300000

Thursday, July 24, 2014

Asgard 1.5 Config.groovy file format

I tried to setup Asgard 1.5 today and find its initializing process has been changed a little bit. I remember I will be asked for aws credential information when I launched asGard first time. But, for Asgard 1.5 (ASGard 1.4.2 has same situation), I got error message like "com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain". After searching online, I make it work by setting up environment variable AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

In fact, we can just directly create the asgard configuration file ~/.asgard/Config.groovy. The format of the file is as below.


grails {
        awsAccounts=['YourAccountIDasDigitalNumber']
        awsAccountNames=['YourAccountIDasDigitalNumber':'YourAccountName']
}
secret {
        accessId='YourAccessID'
        secretKey='YourAccessSecretKey'
}
cloud {
        accountName='YourAccountName'
        publicResourceAccounts=['amazon']
}

Hope above tip will save others time. In fact, I solve the problem after reading this discussion on google asgard group.

Friday, July 11, 2014

can we leave missing data as null when load source data into data warehouse?

No!

because

NULL != NULL

Think about when you do a outer join and some NULL was created after outer join. Then, SQL programmers, who will create SQL query on aggregate tables, will not be able to identify the source of NULL.

Tuesday, June 10, 2014

set up Grail on Groovy environment on CentOS


yiyujia@hadoopDev:~ $ curl -s get.gvmtool.net | bash
yiyujia@hadoopDev:~ $ source "/root/.gvm/bin/gvm-init.sh"
yiyujia@hadoopDev:~ $ gvm help
yiyujia@hadoopDev:~ $ gvm install groovy
yiyujia@hadoopDev:~ $ gvm install grails
yiyujia@hadoopDev:~ $ gvm install grails 2.2.4
yiyujia@hadoopDev:~ $ grails
| Enter a script name to run. Use TAB for completion: 
grails> quit
| Goodbye
yiyujia@hadoopDev:~ $

Monday, June 2, 2014

Tuning redshift query performance

We can tune Redshift query by the following four steps literately

1. Identify slow queries

select trim(t3.usename) AS username , t1.query AS query_ID, t1.starttime, t1.elapsed, t2.querytxt from svl_qlog t1 join STL_QUERY t2 ON t1.query = t2.query JOIN pg_user t3 ON t1.userid = t3.usesysid order by t1.elapsed desc limit 50;

2. Table Design

select correct sort key, distribution key, and column compression algorithms (see blog post "Redshift table optimizing with SORTKEY and DISTKEY.")

3. Query Design

4. WLM Design

Tuesday, May 20, 2014

shell script to move files under one directory to multiple other directory.

dte=$2
#rm -rf "$HOME/temp/files/raw/$dte"
rawDir="$HOME/temp/files/raw/"
mkdir -p "$rawDir"
#s3cmd sync s3://bucketName/import/nokia/store/downloads_completed/"$dte" "$rawDir"
cd "$rawDir$dte"
for a in *.gz; do gunzip $a; done
cd "$HOME/temp/backfillTemp/"
a=`find "$rawDir$dte" -type f | wc -l`
#echo "$a"
n=$1
b=$(((a+n-1)/n)) 
#echo "$b"
for ((i=1;i<$n;i++));
do 
  #echo "$i"
  targetDir="$HOME/temp/files/part$i"
  #echo "$targetDir"  
  mkdir -p "$targetDir"  
  finalCmd="find $rawDir$dte -type f | head -n $b | xargs  -r sh -c 'mv \"\$0\" \"\$@\" $targetDir'"
  #echo "$finalCmd"
  eval $finalCmd
  realCmd="nohup $HOME/etl/transformScripts/ETLBackfill/doit.sh '$i' '$dte' >/dev/null 2>&1 &"
  #realCmd="nohup $HOME/etl/transformScripts/ETLBackfill/doit.sh '$i' '$dte' > $i.txt &"
  #echo "$realCmd"
  eval $realCmd
  pidArray[$i]=$!
done
### moving rest files
#echo "$i"
targetDir="$HOME/temp/files/part$i"
  #echo "$targetDir"
  mkdir -p "$targetDir"  
  finalCmd="find $rawDir$dte -type f | xargs  -r sh -c 'mv \"\$0\" \"\$@\" $targetDir'"
  #echo "$finalCmd"
  eval $finalCmd
  realCmd="nohup $HOME/etl/transformScripts/ETLBackfill/doit.sh '$i' '$dte' >/dev/null 2>&1 &"
  #realCmd="nohup $HOME/etl/transformScripts/ETLBackfill/doit.sh '$i' '$dte' > $i.txt &"
  #echo "$realCmd"
  eval $realCmd
  pidArray[$i]=$!
for ((i=1;i<$n;i++));
do
 wait ${pidArray[$i]} 
done

Thursday, May 15, 2014

setup scala version manager on CentOS


[root@new-host yiyujia]#  yum install coreutils 


[root@new-host yiyujia]# wget https://raw.githubusercontent.com/yuroyoro/svm/master/svm

[root@new-host yiyujia]# vi /usr/local/bin/svm
[root@new-host yiyujia]# chmod +x /usr/local/bin/svm 
[root@new-host yiyujia]# chmod 755 /usr/local/bin/svm 
[root@new-host yiyujia]# vi ~/.bash_profile

export SCALA_HOME=~/.svm/current/rt
export PATH=$SCALA_HOME/bin:$PATH

source ~/.bash_profile
01:26 yiyujia@new-host:~ $ svm install 2.11.2

Monday, April 7, 2014

several methods collecting for time series analysis

It is normal for data analysis processing have missing data for reasons. Once the data is missing. It is nature for people ask how the gap on the plot can be "fixed/smoothed". Also, it is very often to be asked to predict next several points ased on previous 7 to 8 points. For example, given history stock level, I am asked to predict stock level for beverage market data analysis.
With the asspumption of whatever has occured in the past will continue to happen in the future, we can treat these questions as time series analysis. Below are list of four methods to do forcecasting for a short time series.

Moving averages

install.packages("zoo")
library(zoo)
ma = rollmean(ts, k)

For data without any systematic trend or seasonal components, we can use exponential smoothing
Accounting for data seasonality using Winter's smoothing
Accounting for data trend using Holt's smoothing

install.packages("forecast")
library(forecast)
x = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5 )
x = ts(x, frequency = 5)
plot(hw(x, 5), byt = "1")

Below a few collected videos about exponential smoothing and Holt-Winter forecasting method

1.11 Time Series- exponential smoothing

1.12 Time Series- moving averages

Saturday, March 29, 2014

setup android ADT environment on CentOS 6.5 64 bit.

download ADT from http://developer.android.com/sdk/index.html
mkdir ~/local/tools/android and copy bundle zip file into and unzip it.
your/adt/unziped/folder/eclipse/eclipse.
Create Hello World app and get error like "Error executing aapt. Please check aapt is present at ~/local/tools/android/adt-bundle-linux-x86_64-20140321/sdk/build-tools/android-4.4.2/aapt appcompat_v7 Unknown Android Packaging Problem" etc.
yum groupinstall "Compatibility libraries"
yum -y install --skip-broken glibc.i686 arts.i686 audiofile.i686 bzip2-libs.i686 cairo.i686 cyrus-sasl-lib.i686 dbus-libs.i686 directfb.i686 esound-libs.i686 fltk.i686 freeglut.i686 gtk2.i686 hal-libs.i686 imlib.i686 lcms-libs.i686 lesstif.i686 libacl.i686 libao.i686 libattr.i686 libcap.i686 libdrm.i686 libexif.i686 libgnomecanvas.i686 libICE.i686 libieee1284.i686 libsigc++20.i686 libSM.i686 libtool-ltdl.i686 libusb.i686 libwmf.i686 libwmf-lite.i686 libX11.i686 libXau.i686 libXaw.i686 libXcomposite.i686 libXdamage.i686 libXdmcp.i686 libXext.i686 libXfixes.i686 libxkbfile.i686 libxml2.i686 libXmu.i686 libXp.i686 libXpm.i686 libXScrnSaver.i686 libxslt.i686 libXt.i686 libXtst.i686 libXv.i686 libXxf86vm.i686 lzo.i686 mesa-libGL.i686 mesa-libGLU.i686 nas-libs.i686 nss_ldap.i686 cdk.i686 openldap.i686 pam.i686 popt.i686 pulseaudio-libs.i686 sane-backends-libs-gphoto2.i686 sane-backends-libs.i686 SDL.i686 svgalib.i686 unixODBC.i686 zlib.i686 compat-expat1.i686 compat-libstdc++-33.i686 openal-soft.i686 alsa-oss-libs.i686 redhat-lsb.i686 alsa-plugins-pulseaudio.i686 alsa-plugins-oss.i686 alsa-lib.i686 nspluginwrapper.i686 libXv.i686 libXScrnSaver.i686 qt.i686 qt-x11.i686 pulseaudio-libs.i686 pulseaudio-libs-glib2.i686 alsa-plugins-pulseaudio.i686 for Debian or Ubuntu: sudo apt-get install ia32-libs

Monday, February 10, 2014

Redshift table optimizing with SORTKEY and DISTKEY.

People ask if it is good idea to split a big table into multiple tables based on values of column, for example date. This is an interesting question as Redshift supplies DISTKEY option for table definition already.

Keeping in mind that Redshift is a MPP database management system, which distributes data among its cluster. "The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed." (See detail from Redshift doc). So, "partition" a huge table in Redshift should be taken cared by DISTKEY.

Another key for optimizing Redshift table design is SORTKEY. I will think DISTKEY and SORTKEY together as "index" in Redshift. Also, it is good to keep in mind the order of sort keys. For example, if sort key is defined as SORTKEY(col1, col2), it should be good practice to know col1 has less variant values than col2 does. Also, making joined columns as sort key will help on performance of table join as a merge join may happen (see detail from Redshift doc).

Redshift have awesome online document to describe how to tune table. BTW, Redshift is a relational database. But, it is most likely not good idea to use it as your transaction database. Better to use Redshift for data warehouse instead of data mart platform.

Monday, January 6, 2014

Clustering algorithms types: partitional clustering and hierarchical clustering

1. Flat or Partitional clustering:
(K-means, Gaussian mixture models, etc.)
Partitions are independent of each other

2. Hierarchical clustering:
(e.g., agglomerative clustering, divisive clustering)
- Partitions can be visualized using a tree structure (a dendr
ogram)
- Does not need the number of clusters as input
- Possible to view partitions at different levels of granularities
(i.e., can refine/coarsen clusters) using different K

K-means variants:
-Hartigan’s k-means algorithm
-Lloyd’s k-means algorithm
-Forgy’s k-means algorithm
-McQueen’s k-means algorithm

a good article about cluster analysis in R.

Saturday, January 4, 2014

Redshift convert integer to Timestamp type and convert epoch time to timestamp on the fly

Currently, Redshift does not support to_timestamp function, which is convenient for converting String or Integer into Timestamp format. Below is a quick notes on how to convert String or integer to be timestamp in Redshift.
Simple, using substring function to get parts for Year, Month, Date, Hour, Minute, Second from String or Integer and concatenate them into Timestamp format and cast to timestamp. Below are examples,
Suppose Table t1 has a column (col1) have integer value in pattern YYMMDDhhmm. For example, it value is 0403051030. It means Year 2014, March fifth, and thirty minutes past ten o'clock. We can use below two methods to convert it to be Timestamp value.

SELECT 
cast(('20' || substring(col1,1, 2) || substring(col1,3, 2) || 
      substring(col1,5, 2) || ' ' || substring(col1,7, 2) || ':' 
      || substring(col1,9, 2)) as timestamp) 
FROM t1;

SELECT 
cast(to_date(col1, 'YYMMDD') || ' ' 
    || substring(col1,7,2) || ':' 
|| substring(col1,9,2) as timestamp) 
FROM t1;

Also, get idea from a blog about how to convert epoch time to timestamp on fly in Redshift.

SELECT  
    (TIMESTAMP 'epoch' + myunixtime * INTERVAL '1 Second ') 
AS   mytimestamp 
FROM  
    example_table