iDatamining.org

I am looking for projects to work on
Please contact with me at yiyu.jia@BostonInfoPro.com!

Saturday, January 28, 2012

Multi-threaded application (Java) or Multi-process application (PHP) for Hyper Threading enabled CPU

There are many factors affect an application's performance. Today, hyper threading enabled multi-core CPU becomes so popular. Naturally, we are expecting to have better performance on these modern CPU. I am not eligible to discuss about question about how to optimize application for a hyper threading multi-core CPU yet. But, I do ask myself this question: which one, a muli-threaded application or a multi-process application, can get more benefit from a hyperthreaded CPU? In other words, for a computing intensive task, should I design it as a multi-threaded application or a multi-process application? To be more specific, does Java, which support multi-threaded programming, has advantage over the PHP on a hyper threading enabled CPU or PHP actually has advantage over the Java? I had a discussion about "Can two processes simultaneously run on one CPU core?". Here, let me highlight some points I studied for answering my questions.

Software Thread
We know software thread is a lightweight process. Once process can contains multiple thread. Software thread is managed by OS. OS decide which CPU/CORE/ the thread will run in. Application programmer can also control where the thread can run by using affinity library. Typically, we need multiple thread application when it has I/O latency and we do not want to hang other computing task. For example, We do not want a desktop having GUI stop response user's input when it is running other computing or I/O task. Also, we normally need thread pool to have initialized ready to serve threads for a service application.

Hardware Thread
A hardware thread is pipeline for a software thread to reach CPU's physical core. In a HT CPU, a physical core can have two hardware threads (logical cores) as it has extra registers and execution units and it therefore stores the state of two threads.

What is shared among software threads and what is not
Multiple software threads (kernel thread) can live inside a process. In other words, they can not share anything out of process' resource. A kernel thread is the lightest unit for OS' kernel scheduling. Kernel threads do not share their stack, a copy of the registers including the program counter, and thread-local storage.

It is called "green thread" if the thread is implemented in "user space". green thread is not seen by kernel. It is normally useful for debugging a multi-threaded application.

What is shared among processes
I do not know what is shared among processes from point view of CPU. A process is the biggest unit of kernel scheduling. It has its own resources including memory, file handles, sockets, device handles etc. Processes has its own address spaces, shared by its containing threads. Of course, programmer can explicitly call methods to share resources with other process such as shared memory segments.

History about CPU evolution To better understand how to get most benefit from modern multi-core, HT CPU, I feel I need to study the history of CPU architecture evolution. But, I do not have time to do this yet. Let me deeply study it later.

Now, let's come back to my original questions. A normal application contains lots latency operations. For examples, a application normally contains network I/O, file I/O, or GUI interactivity. In this case, which is typical reason for us to use multi-threaded techniques, hyper threading can improve performance at little cost. However, hyper-threading is often turned off in high performance systems as hyper threading can impact performance when the two threads on the same core start competing for resources, such as the FPU, Level 1 cache, or CPU pipeline. We can see some motherboard actually disable hyper-threading by default.

Also, swap thread's context is expensive too. To switch threads system has to empty the registers into the cache, write that back to the main memory, then load up the cache with the new values and load up the registers. So, we need to be careful not to make overhead threads in our application.

The good way is probably keep same number of running threads as number of logical core. However, in the real world, many threads in an application are blocked thread. For a application having much more software thread than hardware thread, I will expect most of them are blocked thread. For example, a web server may have larger number of thread serving HTTP request. It has no problem as they are all blocked thread as it may be blocked at network I/O.

BTW, the coming up Reverse-Hyperthreading, which spread a thread's computing task among different logical core, may introduce new opportunity into multi-threaded programming world. But, who knows if it will actually bring in more challenges.

So, back to my original question, I still think java has chance to get more benefits from a hyper-threading enabled CPU than PHP does as core PHP does not directly support multi-threading programming. However, a bad designed multi-threading may even damage the performance. Here is a good document for me to better understand hyper threading technology: Performance Insights to Intel® Hyper-Threading Technology
http://stackoverflow.com/questions/1888160/distinguish-java-threads-and-os-threads http://stackoverflow.com/questions/8916723/can-two-processes-simultaneously-run-on-one-cpu-core http://stackoverflow.com/questions/4771205/dual-core-hyperthreading-should-i-use-4-threads-or-3-or-2?rq=1 http://stackoverflow.com/questions/360307/multicore-hyperthreading-how-are-threads-distributed?rq=1 http://stackoverflow.com/questions/508301/on-which-operationg-system-is-threaded-programming-sufficient-to-utilize-multipl http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/ http://stackoverflow.com/questions/2238272/java-thread-affinity http://java.dzone.com/articles/java-thread-affinity-support http://www.codeguru.com/cpp/sample_chapter/article.php/c13533/Why-Too-Many-Threads-Hurts-Performance-and-What-to-do-About-It.htm http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

Wednesday, January 18, 2012

Promotional Subspace Mining (PSM) --- an Overview

Promotional Subspace Mining (PSM) is a novel research topic. Its main focus is to find out outstanding subspaces for a given object among its competitors, and to discover meaningful rules from them.

PSM is distinguished from existing data mining problems because of its three unique characteristics: data model, study objective, and fundamental data mining problems. From the aspect of the data model, PSM takes multidimensional data as the input data set, in which a categorical object field and a numerical score field are two important constituent elements. From the aspect of the study objective, an ranking mechanism has to be defined to produce top subspaces for target object among its
peer objects. From the aspect of the data mining problems, PSM consists of three major problems, each of which consists of sub-problems:

1. Applying prior knowledge into the system
  • modeling domain-specific knowledge;
  • modeling the feedback rules from the previous iteration;
  • designing and constructing machine learning model.

2. Finding out top subspaces
  • ranking mechanism definition;
  • efficiently producing top subspaces based on the predefined ranking mechanism;
  • feature selection to avoid high-rank but meaningless subspaces.

3. Rule mining, evaluation, and integration
  • evaluating top subspaces that have the same rank;
  • producing interesting rules from top subspaces;
  • analyzing and combining similar rules.

The following figure shows a high level view of this PSM research topic.

PSM may be applied in many application domains. Here, we give two simple examples.

Example 1 A product retailer wants to find out the strength and weakness of a product. The sales manager finds that the sales of product A is ranked the 10th among all products in the same category. However, when breaking down the market into subspaces, such as Area, Category-of-Trade, and Year, it may be found that product A has the rank 2 sales in the subspace {Area = North New Jersey , Category-of-Trade = Restaurant, and Year = 2009}, and has the rank 15 sales in subspace {Area = South Florida, Category-of-Trade = Supermarket, and Year = 2008}.

Example 2 A pharmaceutical company wants to find out under what conditions a new drug has the best or worst effect. The researchers find out that drug A’s overall effect score is ranked 10th among all drugs in comparison. When examining into subspaces, drug A has the rank 2 effect score in the subspace {Temperature = Low , Moisture = Med, and Patient Age = young}, and has the rank 15 score in subspace {Temperature = Med, Moisture = Med, and Patient Age = Senior}.

These two examples indicate that a target object can be ranked not only in the global data dimensions, but also in various local subspaces. The global rank of an object indicates the overall position of this object, while the local ranks can show the outstanding subspaces this object is in. “Outstanding” here is measured by a predefined application-specific subspace ranking measure. To the product retailer, outstanding subspaces can be used to analyze the current position of the target product, adjust promotional campaign strategies, and reallocate marketing resources. To the pharmaceutical company, the outstanding subspaces can be used to evaluate the factors that affect target drug’s functioning.

Besides many unique features, PSM is also related to two other existing research directions: Interesting Subspace Mining (ISM) and instance selection/ranking. PSM is related to ISM as it also targets the multidimensional data, and one of its main focuses is to discover the potentially interesting subspaces. However, ISM has an entirely different objective than PSM. Specifically, it aims at detecting clusters that are hidden in any possible subspaces, but not showing up in the full attribute space.

PSM is also related to the research on prototype selection and instance ranking, because they both involve the component to rank the potentially interesting subspaces' instances. However, there are two essential distinctions between these two topics. First, the study objective and data model are different. The input data set of PSM contains a numerical attribute representing scores, and a categorical attribute representing objects. Each object may be described by multiple record instances. The study objective of PSM is to find top subspaces for a target object, and the score attribute takes part in this process. On the other hand, instance selection/ranking is for classification or clustering study. Each record instance in the input data set represents an individual object. Second, the objects that are being ranked are different. PSM studies on multidimensional data, which indicates that a subspace can either contain all the attributes in the data set, or any subset of the attributes. Instance selection's ranking however, always targets the full attribute spaces.

In addition, the problem that PSM solves is also related to top-k queries and reverse top-k queries. The top-k query problem aims to efficiently retrieve a ranked set of the k most interesting objects based on individual user’s preferences. A lot of research efforts have been carried out from different prospective, including query model, data & query certainty, ranking function, etc. The data model of PSM is essentially different from these research, as it switches the input and output of the query model. In other words, the target object is taken as an input in PSM, while it belongs to the output in the classic top-k query model. As a result, the main theme of PSM is on finding interesting subspaces, not the objects.

Compared with the top-k query problem where the output objects will be consumed by potential customers or buyers, the reverse top-k query problem aims to find out the subspace parameters of the most popular products for manufacturers’ reference. The popular products are those appearing more frequently in customers’ top-k result set than other products.

As a novel research topic, Promotional Subspace Mining (PSM) opens a new research direction for the interdisciplinary research of data mining and very large database systems. Particularly, in this Big Data age, PSM introduces a new topic to noSQL application research. A sophisticated PSM framework/algorithm could be, in an obvious way, very useful for those large retail companies such as Amazon, eBay, etc. Actually, it could be potentially applied to any domain, where is needed "promotional" analysis no matter the target object is a human being, a product, or a company.

To inquiry more detailed information, please contact Dr. Yan Zhang or myself. We are interested in hearing from you if you have large data set and want somebody to help on analyzing it. We are looking forward to hearing from any individual or organization who is interested in cooperating with us.

Sunday, January 8, 2012

Using jconsole to monitor Tomcat running on i5/OS

Shipped with JDK, there is an application monitoring tool called jconsole. Using jconsole, we can detect low memory, enable or disable GC and class loading verbose tracing, detect deadlocks, control the log level of any loggers in an application etc. This simple tutorial shows how we can start up Tomcat 6.x on i5/OS V5R4 and monitor it remotely through our desktop. In fact, this is not Tomcat's feature. It is JVM feature. We just use Tomcat as an example and see if JDK on i5/OS supports this feature or not.

1) Open the catalina.sh and add additional JVM properties after CATALINA_BASE env settings.
# Only set CATALINA_HOME if not already set
[ -z "$CATALINA_HOME" ] && CATALINA_HOME=`cd "$PRGDIR/.." >/dev/null; pwd`

# Copy CATALINA_BASE from CATALINA_HOME if not already set
[ -z "$CATALINA_BASE" ] && CATALINA_BASE="$CATALINA_HOME"

CATALINA_OPTS="-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.port=YourJMXPort 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=true 
-Djava.rmi.server.hostname=YourTomcatHostname
-Dcom.sun.management.jmxremote.password.file=$CATALINA_BASE/conf/jmx.psd 
-Dcom.sun.management.jmxremote.access.file=$CATALINA_BASE/conf/jmx.acl"

2) creating jmx.psd under folder $CATALINA_BASE/conf/ and put text content in it as below. So, we create a user name "controlRole" with password as "tomcat"
controlRole tomcat

3) creating jmx.acl under folder $CATALINA_BASE/conf/ and put text content in it as below. So, we assign user controlRole privilege of Read and Write.
controlRole readwrite

4) Open qshell and modify the file attribute as below. This will make sure that your account has read and write privilege to these access control files. We change this according to different account that start Tomcat.
call qp2term
chmod 600 jmx.acl  
chmod 600 jmx.psd  

5) On a Windows PC, run jconsole as below.
c:\Program Files\Java\jdk1.6.0_21\bin>jconsole

6) After jconsole is running, we need to input host name, port number, user name, and password.

7)Now, we can see jconsole connects to remote tomcat running on i5/OS.

Enjoy it for monitoring and tuning your Tomcat and servlet application.