PSM is distinguished from existing data mining problems because of its three unique characteristics: data model, study objective, and fundamental data mining problems. From the aspect of the data model, PSM takes multidimensional data as the input data set, in which a categorical object field and a numerical score field are two important constituent elements. From the aspect of the study objective, an ranking mechanism has to be defined to produce top subspaces for target object among its
peer objects. From the aspect of the data mining problems, PSM consists of three major problems, each of which consists of sub-problems:
1. Applying prior knowledge into the system
- modeling domain-specific knowledge;
- modeling the feedback rules from the previous iteration;
- designing and constructing machine learning model.
2. Finding out top subspaces
- ranking mechanism definition;
- efficiently producing top subspaces based on the predefined ranking mechanism;
- feature selection to avoid high-rank but meaningless subspaces.
3. Rule mining, evaluation, and integration
- evaluating top subspaces that have the same rank;
- producing interesting rules from top subspaces;
- analyzing and combining similar rules.
The following figure shows a high level view of this PSM research topic.
PSM may be applied in many application domains. Here, we give two simple examples.
Example 1 A product retailer wants to find out the strength and weakness of a product. The sales manager finds that the sales of product A is ranked the 10th among all products in the same category. However, when breaking down the market into subspaces, such as Area, Category-of-Trade, and Year, it may be found that product A has the rank 2 sales in the subspace {Area = North New Jersey , Category-of-Trade = Restaurant, and Year = 2009}, and has the rank 15 sales in subspace {Area = South Florida, Category-of-Trade = Supermarket, and Year = 2008}.
Example 2 A pharmaceutical company wants to find out under what conditions a new drug has the best or worst effect. The researchers find out that drug A’s overall effect score is ranked 10th among all drugs in comparison. When examining into subspaces, drug A has the rank 2 effect score in the subspace {Temperature = Low , Moisture = Med, and Patient Age = young}, and has the rank 15 score in subspace {Temperature = Med, Moisture = Med, and Patient Age = Senior}.
These two examples indicate that a target object can be ranked not only in the global data dimensions, but also in various local subspaces. The global rank of an object indicates the overall position of this object, while the local ranks can show the outstanding subspaces this object is in. “Outstanding” here is measured by a predefined application-specific subspace ranking measure. To the product retailer, outstanding subspaces can be used to analyze the current position of the target product, adjust promotional campaign strategies, and reallocate marketing resources. To the pharmaceutical company, the outstanding subspaces can be used to evaluate the factors that affect target drug’s functioning.
Besides many unique features, PSM is also related to two other existing research directions: Interesting Subspace Mining (ISM) and instance selection/ranking. PSM is related to ISM as it also targets the multidimensional data, and one of its main focuses is to discover the potentially interesting subspaces. However, ISM has an entirely different objective than PSM. Specifically, it aims at detecting clusters that are hidden in any possible subspaces, but not showing up in the full attribute space.
PSM is also related to the research on prototype selection and instance ranking, because they both involve the component to rank the potentially interesting subspaces' instances. However, there are two essential distinctions between these two topics. First, the study objective and data model are different. The input data set of PSM contains a numerical attribute representing scores, and a categorical attribute representing objects. Each object may be described by multiple record instances. The study objective of PSM is to find top subspaces for a target object, and the score attribute takes part in this process. On the other hand, instance selection/ranking is for classification or clustering study. Each record instance in the input data set represents an individual object. Second, the objects that are being ranked are different. PSM studies on multidimensional data, which indicates that a subspace can either contain all the attributes in the data set, or any subset of the attributes. Instance selection's ranking however, always targets the full attribute spaces.
In addition, the problem that PSM solves is also related to top-k queries and reverse top-k queries. The top-k query problem aims to efficiently retrieve a ranked set of the k most interesting objects based on individual user’s preferences. A lot of research efforts have been carried out from different prospective, including query model, data & query certainty, ranking function, etc. The data model of PSM is essentially different from these research, as it switches the input and output of the query model. In other words, the target object is taken as an input in PSM, while it belongs to the output in the classic top-k query model. As a result, the main theme of PSM is on finding interesting subspaces, not the objects.
Compared with the top-k query problem where the output objects will be consumed by potential customers or buyers, the reverse top-k query problem aims to find out the subspace parameters of the most popular products for manufacturers’ reference. The popular products are those appearing more frequently in customers’ top-k result set than other products.
As a novel research topic, Promotional Subspace Mining (PSM) opens a new research direction for the interdisciplinary research of data mining and very large database systems. Particularly, in this Big Data age, PSM introduces a new topic to noSQL application research. A sophisticated PSM framework/algorithm could be, in an obvious way, very useful for those large retail companies such as Amazon, eBay, etc. Actually, it could be potentially applied to any domain, where is needed "promotional" analysis no matter the target object is a human being, a product, or a company.
To inquiry more detailed information, please contact Dr. Yan Zhang or myself. We are interested in hearing from you if you have large data set and want somebody to help on analyzing it. We are looking forward to hearing from any individual or organization who is interested in cooperating with us.
No comments:
Post a Comment