Sunday, January 27, 2013

HIVE Select count performance

I read a discussion about performance of Hive Select, that explains why Select count(*) slower than slect *. I note it here as one of hints for tuning hive query.
  1. A hive query can be a metadata only request.
  2. A hive query can be an hdfs get request.
  3. A hive query can be a Map Reduce job.

Also, we can control number of map/reducer for better performance by rewriting query. For example, writing query in hive like this:

 SELECT COUNT(DISTINCT id) ....

It will always result in using only one reducer that slow down query. We can rewrite query as sub query to better performance:
  1. use this command to set desired number of reducers:
    set mapred.reduce.tasks=50
  2. rewrite query as following:
    SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;

No comments:

Post a Comment