OR/MS Today - October 2002|
Data Mining Components
Numerical Algorithms Group product offers a range of business applications
By Auroop R. Ganguly
Large- and mid-sized companies have expended significant resources in acquiring and maintaining databases and data warehouses for information storage, representation and retrieval. Managers and business planners need tools to extract meaningful knowledge about their business and their customers from these data repositories. Business applications ranging from prediction of customer churn rates for phone companies to demand management in the presence of promotions benefit from tools like data mining (DM) and traditional statistics (TS). The Data Mining Components (DMC) product from Numerical Algorithms Group (NAG) targets these areas, and is meant primarily for developers of enterprise-scale business applications.
While the distinction between TS and DM is neither rigid nor well defined, one could argue that DM extends TS by discovering hidden patterns in data and by focusing on deployment to large data sets. The operations research and management science community (and indeed the scientific community in general) is beginning to acknowledge that well-designed DM tools can complement, and in certain situations outperform, TS. Recent research (Cooper and Giuffrida, 2000; Ganguly and Gupta, 2002; Reyes-Aldasoro et al., 1999) indicates that well-designed combinations of TS and DM often work better than either tool alone. While TS methods are theoretically rigorous, DM methods can often find "interesting" patterns where TS methods cannot (Fayyad et al., 1996). The advantage of NAG's software is that the relatively new DM and the existing TS components can be used in conjunction by both application developers and researchers and professionals in OR/MS.
All About NAG
NAG is a leading provider of software that solves complex mathematical problems. Founded in 1970 as a University of Nottingham project in the United Kingdom, they now have offices in the U.K., Germany, Japan and the United States. In 1971, NAG developed the first mathematical software library that now has a significant worldwide customer base and has expanded into numerous mathematical and statistical functions. The range of products and services that NAG offers has continually expanded into statistical, symbolic, visualization and numerical simulation software, compilers and application development tools and wide-ranging consulting.
The NAG-DMC software is a collection of data cleaning, preparation, model building and utility functions for DM applications. They are designed to be called from an application program (typically C, C++, Visual Basic [VB] and Java) and are aimed at the professional developer either for building customized applications or for use in commercial software. NAG provides C++ and VB Application Programming Interfaces (APIs) for the components (which are written in C); they have developed some Java interfaces as well. The current (and first) version of the product that was released in early December of 2001 can be run on Windows 95/98/NT/2000/XP, Linux and Sun Solaris.
Functionality and Technical Considerations
The functionality available to an application developer using NAG-DMC can be categorized into three broad groups: information acquisition and submission; data pre- and post-processing; and DM techniques (see Table 1). Sample NAG-DMC function calls for some of the DM tools are shown in Table 2. Detailed descriptions and exhaustive lists are provided in the documentation and in the example files available with the product.
Generic guidelines can be issued to the application developer based on considerations like business requirements, problem complexity, the nature and type of information as well as the CPU clockspeed and memory requirements. However, statistical and DM solutions often tend to be problem and data specific. One approach is to develop focused applications that use analytic tools in pre-defined ways, for use by the business experts and managers. Another possibility is to embed analytic toolboxes within applications, which can then be utilized by TS or DM experts.
The first step in the utilization of analytic tools for customized or general-purpose enterprise applications is to understand the business requirements, as well as the ability of these tools to address those requirements. The DM functionality available from NAG-DMC can solve four types of commonly encountered business problems, which are described in this section using examples from the retail, telecommunications and service sectors.
Clustering or the reduction and categorization of cases. One business problem addressed by NAG-DMC is a reduction of the total number of "cases" without significant loss of information, or categorization of these cases into meaningful groups. Examples of cases are: (1) all the stock keeping units (SKUs) for a retail company; (2) all current, previous and potential customers of a phone company; and (3) a list of all customer issues that need to be addressed by a service provider maintaining a call center. Reduction of the number of cases can help in information analyses and processing. Categorization into meaningful groups can help the user understand, predict and, at times, favorably influence the behavior of these groups.
The call center service provider might want to categorize the resolved and anticipated customer issues into a few groups based on the nature of the problem and the solution. The retail company might want to group the SKUs based on similarities in sales patterns or based on revenue contributions. The phone company might want to group customers based on how loyal they are expected to be, i.e., the anticipated "churn rates."
The reduction or categorization of all available cases into fewer groups is called "clustering" in the DM literature. As the examples suggest, clustering is often a prerequisite for classification and prediction tasks. While association of causality might not always be appropriate for the results of clustering, the reduction of the number of cases can help reduce the intrinsic problem's dimensionality, thus making it more tractable.
NAG-DMC offers two sets of functionality to solve these problems. K-means clustering is applicable to high, medium or low data volumes (i.e., number of cases), provided the number of clusters or groups are known in advance. Hierarchical clustering is applicable when the number of cases is low, but no advance knowledge is available on the number of groups. While iterative or trial-by-error schemes could be designed to estimate the number of groups for k-means clustering, these are not computationally efficient. As NAG's documentation points out, there is often a trade-off for DM applications between computational speed and accuracy.
Reducing the number of variables. Certain variables might not be relevant in the context of the analyses or prediction problem. Multiple variables or fields might not always contain additional information and can often be combined into fewer variables. Additional variables that convey no useful information can burden the analytic engine, increase the likelihood of spurious results and reduce the possibility of discovering meaningful patterns. A business requirement addressed by NAG-DMC is to reduce the number of fields or variables (and hence the dimensionality of the problem) while retaining much of the original information. The functionality provided for this is Principal Component Analysis (PCA).
As in clustering, association of causality to the results of PCA might not always be appropriate but could be useful to illustrate the business requirement. For example, the phone company might have certain data regarding their customers (e.g., demographics, location, income, marital status, credit history, previous history of churn) available. Of these, variables pertaining to location alone might not be relevant to the analysis of churn rates, and there might not be any additional information in income or marital status other than what is already contained in credit history.
Similarly, if a certain set of product attributes are available, not all of them might be useful for analyzing sales patterns, and a few of these might be more useful than others. For the service provider, certain details of the calls like customers phone numbers might not be relevant to the solutions provided.
Classification or assigning cases to known groups. Another class of business problems that NAG-DMC addresses is classification of cases into groups. The phone company needs to be able to understand whether a new or an existing customer belongs to a category that is more likely to churn; the retail company needs to understand whether a newly introduced product would behave like existing products in terms of future sales; and the service provider needs to know if a new customer issue is similar to a known issue. These requirements entail analyzing new or existing cases, and assigning them to known groups.
If the majority of the variables that need to be used for classification are discrete (e.g., customer's gender) then NAG-DMC offers logistic regression for classification into two groups, and decision trees for multiple groups. On the other hand, if a majority of the variables are continuous (e.g., customer's income) then NAG-DMC offers logistic regression for classification into two groups, and k-nearest neighbors for multiple groups. Multi-layer perceptrons are also available for continuous variables, and are useful if the number of cases is low and each class approximately contains an equal number of cases.
Regression and rediction or assigning value to cases. Prediction and quantification are the end goals of many business analyses and planning endeavors. These problems are known to be difficult. However, most companies like to get an advance understanding of the future state of their business for tactical, operational and strategic planning needs. They might also want to quantify the revenue generated from product or customer categories. These are regression problems, and are referred to by NAG as "assigning values to cases."
In the earlier examples, once the phone company has identified the number of groups that the customers need to belong to, extracted the relevant variables for estimating churn and classified the customers into groups, it might want to generate predictions for customer behavior based on anticipated corporate or competitive policies, and design campaigns accordingly. The retail company might want to group SKUs by sales patterns and profitability, and then forecast future demand. That might help to plan promotions and discounts, optimize and allocate inventory, ensure adequate supply of raw materials, and take longer-term financial and human resource decisions. The service provider might need to forecast the type of calls that they are likely to get on various issues, to enable them to allocate resources appropriately.
NAG classifies the regression problem into "simple" or "complicated," depending on whether the underlying problem is linear or non-linear. Note that certain kinds of non-linearity might be adequately dealt with by transformation of variables (e.g., by taking a logarithm or an exponent), while more involved treatment might be useful for others. NAG-DMC offers linear regression for simple problems, which uses little of your CPU and memory. The k-nearest neighbor and the multi-layer perceptron can handle complicated problems. While the former can work for low, medium or high data volumes (number of cases), the latter is more optimal for low volumes. The memory requirements are high for both, while the CPU clockspeed requirements are medium for the k-nearest neighbor approach and high for the multi-layer perceptron.
Performance and Scalability Considerations
The ability to handle large data sets is a key requirement for DM functionality designed for commercial developers. Where possible, the algorithms in DMC have been designed by NAG to handle large sets of data. There are at least two issues in handling large data sets: speed and storage (memory). The hierarchical clustering algorithm uses modified criteria to allow the analysis of more data than the traditional algorithm would allow in the same time. The nearest neighbor routine uses a tree search algorithm to reduce computation time from the order O[n*n] to O[n*log(n)] for "n data."
The regression functions (linear and logistic) can fit a model by holding only a chunk of data in memory at any one time. This allows the regression routines to handle larger data sets than the traditional algorithms (by reducing the burden on "virtual memory," which in turn increases speed). However, NAG does not yet have any performance benchmarks for DMC, due to it being in the very early stages of the development life cycle. This information is crucial to developers and should be made available as soon as possible for future versions of the product.
Ease of Use, Documentation and Technical Support
NAG provides C++ and VB APIs for DMC. Given that their focus has been (and remains) on providing the greatest capabilities to custom developers, the lack of a traditional graphical user interface is understandable. The examples provided by NAG clearly demonstrate how the functions can be used in an application, and the documentation for NAG-DMC has additional details. Various existing components of NAG, like the ones on statistics, operations research and optimization, can also be used in conjunction with DMC to solve different business requirements.
The current set of documentation includes Install and User Guides and Frequently Asked Questions (FAQ). The installation process is relatively straightforward, and the FAQ and User Guides should be easy to understand for a technical user who is an expert on DM. The documentation is occasionally terse but always informative. When a customer buys a perpetual developer's license for NAG-DMC, one year of technical support via phone, fax or e-mail is automatically included with the purchase. Any updates or enhancements to the product that may come out during the first year period will be sent at no charge. Support for subsequent years is 15 percent of the purchase price and includes the same support and update components. There is only one level of support offered with the product. According to NAG, when working with software partners and/or organizations that may require specific levels of support for their clients and/or systems, NAG has worked out specific support metrics that meet or exceed the requirements. The author of this review was particularly impressed with the help and support he received from several members of the NAG team during the review process. Questions ranging from management vision to specific product and deployment issues were answered promptly.
Conclusions and Recommendations
The applications enabled by NAG's DMC could be custom solutions to a specific business problem or a generic solution that targets multiple companies across industry verticals. Tools like DMC are not designed to be complete business solutions, although they can be used to build these solutions. Also, DMC was not intended to be a sophisticated toolbox for research purposes, and does not provide elaborate features or programming capabilities optimized for the advanced TS or DM expert. The focus is on eventual use (including pre-processing and extraction of data) rather than on the ability to address DM research problems.
TS and DM vendors are judged by criteria such as the depth and goodness of functionality, vendor reputation, ease of implementation, and the quality of deliverables like documentation and training. NAG-DMC scores high points on vendor reputation and prior experience, as well as on most other criteria. However, DM tools and methodologies have continued to evolve over the years (Fayyad, et al., 1996; Kiang and Kumar, 2001), and additional functionality would make DMC more useful for the expert DM analyst.
Simple examples of custom applications with real or simulated data are needed for demonstration purposes and to reduce the learning curve for first-time developers. Business users would benefit from seeing specific examples from enterprise resource planning, advanced planning and scheduling, and customer relationship management. Best practices documents could also be included to illustrate the advantages of DMC for business managers and provide guidance for its use. And to ensure DMC's market acceptance, NAG needs to work with early adopters to develop customer references, deployment tips and statistics, and performance benchmarks, and to identify software glitches and potential design and implementation issues.
While this first release of NAG-DMC is not exhaustive in terms of DM functionality, it provides significant depth for the development of customized or generic enterprise applications. A developer of analytic applications should carefully consider NAG's DMC, and (depending on the nature of the problem) the use of DMC in conjunction with NAG's offerings in statistics, optimization and operations research. The features offered by DMC have been carefully selected by NAG to handle a variety of business requirements, and the implementation issues of speed, storage, scalability and ease of use have been considered in the design. The DMC development team and NAG's management appear committed to react quickly to customer needs, and to come up with significant new functionality in their upcoming releases.
Auroop R. Ganguly is a research associate at the MIT Sloan School of Management and the MIT School of Engineering. In addition, he is the product manager for Oracle's Demand Planning, which is a component of their e-business applications suite. He thanks Professor Amar Gupta of the MIT Sloan School of Management for his encouragement and support on this review.
OR/MS Today copyright © 2002 by the Institute for Operations Research and the Management Sciences. All rights reserved.