Sunday, January 27, 2013

HIVE Select count performance

I read a discussion about performance of Hive Select, that explains why Select count(*) slower than slect *. I note it here as one of hints for tuning hive query.
  1. A hive query can be a metadata only request.
  2. A hive query can be an hdfs get request.
  3. A hive query can be a Map Reduce job.

Also, we can control number of map/reducer for better performance by rewriting query. For example, writing query in hive like this:

 SELECT COUNT(DISTINCT id) ....

It will always result in using only one reducer that slow down query. We can rewrite query as sub query to better performance:
  1. use this command to set desired number of reducers:
    set mapred.reduce.tasks=50
  2. rewrite query as following:
    SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;

Wednesday, January 23, 2013

setup hadoop development environment in eclipse

  1. Finding the version of Maven from maven homepage and running the wget command from the dir you want to extract maven too.
  2. untar the file: tar xvf apache-maven-3.0.4-bin.tar.gz
  3. Next add the env variables to your ~/.bashrc file
    export M2_HOME=/usr/local/apache-maven/apache-maven-3.0.4 
    export M2=$M2_HOME/bin 
    export PATH=$M2:$PATH 
    
  4. Verify everything is working with the following command
     mvn -version
    
  5. make sure g++ is installed
  6. download and install protocol buffer compiler and install protocol buffer java library in mvn repository
    export MAVEN_OPTS='-Xms384M -Xmx512M -XX:MaxPermSize=256M'
    
    /home/yiyujia/.m2/repository/com/google/protobuf/protobuf-java/{versionNum} 
    
  7. Fix classpath variable error in Eclipse if needed.
    Open the Eclipse Preferences: Window -> Preferences
    Go to [Java - Build Path - Classpath Variables]
    Click New and set its name as M2_REPO
    Click Folder and select your Maven repository folder. For example, /home/yiyujia/.m2/repository
    A little bit more about maven environment settting.
    
  8. Treat hadoop programming as normal maven project. Adding hadoop dependencies, coding and testing.
Note: "mvn eclipse:eclipse" is not needed for new version eclipse.

Friday, January 18, 2013

Video collection on machine learning lecture

I like this video series presented by Andrew Ng of Standford University very much.

Following link to the free Stanford machine learning course which you can sign up. https://www.coursera.org/course/ml