OR/MS Today - October 2002



Software Review


Data Mining Components

Numerical Algorithms Group product offers a range of business applications

By Auroop R. Ganguly


Large- and mid-sized companies have expended significant resources in acquiring and maintaining databases and data warehouses for information storage, representation and retrieval. Managers and business planners need tools to extract meaningful knowledge about their business and their customers from these data repositories. Business applications ranging from prediction of customer churn rates for phone companies to demand management in the presence of promotions benefit from tools like data mining (DM) and traditional statistics (TS). The Data Mining Components (DMC) product from Numerical Algorithms Group (NAG) targets these areas, and is meant primarily for developers of enterprise-scale business applications.

While the distinction between TS and DM is neither rigid nor well defined, one could argue that DM extends TS by discovering hidden patterns in data and by focusing on deployment to large data sets. The operations research and management science community (and indeed the scientific community in general) is beginning to acknowledge that well-designed DM tools can complement, and in certain situations outperform, TS. Recent research (Cooper and Giuffrida, 2000; Ganguly and Gupta, 2002; Reyes-Aldasoro et al., 1999) indicates that well-designed combinations of TS and DM often work better than either tool alone. While TS methods are theoretically rigorous, DM methods can often find "interesting" patterns where TS methods cannot (Fayyad et al., 1996). The advantage of NAG's software is that the relatively new DM and the existing TS components can be used in conjunction by both application developers and researchers and professionals in OR/MS.

All About NAG


NAG is a leading provider of software that solves complex mathematical problems. Founded in 1970 as a University of Nottingham project in the United Kingdom, they now have offices in the U.K., Germany, Japan and the United States. In 1971, NAG developed the first mathematical software library that now has a significant worldwide customer base and has expanded into numerous mathematical and statistical functions. The range of products and services that NAG offers has continually expanded into statistical, symbolic, visualization and numerical simulation software, compilers and application development tools and wide-ranging consulting.

The NAG-DMC software is a collection of data cleaning, preparation, model building and utility functions for DM applications. They are designed to be called from an application program (typically C, C++, Visual Basic [VB] and Java) and are aimed at the professional developer either for building customized applications or for use in commercial software. NAG provides C++ and VB Application Programming Interfaces (APIs) for the components (which are written in C); they have developed some Java interfaces as well. The current (and first) version of the product that was released in early December of 2001 can be run on Windows 95/98/NT/2000/XP, Linux and Sun Solaris.

Functionality and Technical Considerations


The functionality available to an application developer using NAG-DMC can be categorized into three broad groups: information acquisition and submission; data pre- and post-processing; and DM techniques (see Table 1). Sample NAG-DMC function calls for some of the DM tools are shown in Table 2. Detailed descriptions and exhaustive lists are provided in the documentation and in the example files available with the product.

Table 1: NAG-DMC Features
Information Acquisition and Submission Capabilities
• Reads data into memory
• Calculates the number of rows and variables in as ASCII file
• Extracts information on discrete fields in a data set
• Cleans a file of missing values and replaces characters strings with numeric values
• Connects to an Open Database Connectivity (ODBC) database and extracts data from a table
• Connects to an ODBC database and exports to a table

Data Pre- and Post-Processing Functionality
• Calculates dummy variables for data fields
• Calculates mean and standard deviation of a data set
• Generates random numbers
• Scales data to have zero mean and unit variance
• Computes a principal components analysis
• Cross-tabulates classification data
• Computes sum of squares within a cluster

Data Mining Techniques
• Linear regression
• Logistic regression
• Hierarchical clustering
• K-Means clustering
• K-Nearest neighbor algorithm for classification and regression
• Decision tree construction and pruning for classification
• Multi-layer perceptron training and testing for classification and regression


Table 2: Example Function Calls
Data cleaning
nag_dmc_data_clean(const char fnamestr[], const char fnameout[], long n_rec, long n_var, const char *missing_flags[], long n_missing_flags, const char separators[], long buffer_length, long str_fields[], long n_str_fields, double **data, long *excluded, long n_excluded[], int info[])

K-means clustering
nag_dmc_kmeans(long rec1, long n_var, long n_rec, long dblk, double **data, void(NAG_DMC_CALL *dataFunction) (long , long , double ** , int[] ), long iwts, long k, long ic[], double **c, long maxit, int info[1])

Principal components analysis
nag_dmc_pca(long rec1, long nvar, long n_rec, long dblk, double **data, void (NAG_DMC_CALL *dataFunction) (long , long , double ** , int[]), long chunk_size, long iwts, long pca_type, double *xbar, double *s, double loadings[], double results[], int info[1])

K-nearest neighbor classification
nag_dmc_knn_class(long rec1, long n_var, long n_rec, long dblk, double **data, void (NAG_DMC_CALL *dataFunction) (long , long , double ** , int []), long rec1B, long n_varB, long n_recB, long n_independent, long independent[], long dependent, long n_groups, long n_in_groups[], double priors[], double thresh, long norm, long k, long ropt, long **nns, double **dists, long *results, int info[])

Linear regression
nag_dmc_regr(long rec1, long nvar, long n_rec, long dblk, double **user_data, void(NAG_DMC_CALL *dataFunction) (long , long , double ** , int[]), long chunk_size, long iwts, long yvar, double *rms, long *df, double *R2, double b[], double se[], double cov[], double eps, double *model, int info[1])

Multi-layer perceptrons
nag_dmc_mlp(long rec1, long n_var, long n_rec, long dblk, double **data, void (NAG_DMC_CALL *dataFunction) (long , long , double **, int []), long n_independent, long independent[], long n_dependent, long dependent[], long n_hidden, double alpha, double eta[], double rho, double mu, double prime_const, long act_fun[2], double gain, long max_num_epochs, long algo, long validate, long n_val, long s_val, long i_val, double **train_results, double train_sse[], double validate_sse[], MLPData *MLPD, int info[])


Generic guidelines can be issued to the application developer based on considerations like business requirements, problem complexity, the nature and type of information as well as the CPU clockspeed and memory requirements. However, statistical and DM solutions often tend to be problem and data specific. One approach is to develop focused applications that use analytic tools in pre-defined ways, for use by the business experts and managers. Another possibility is to embed analytic toolboxes within applications, which can then be utilized by TS or DM experts.

The first step in the utilization of analytic tools for customized or general-purpose enterprise applications is to understand the business requirements, as well as the ability of these tools to address those requirements. The DM functionality available from NAG-DMC can solve four types of commonly encountered business problems, which are described in this section using examples from the retail, telecommunications and service sectors.

Clustering or the reduction and categorization of cases. One business problem addressed by NAG-DMC is a reduction of the total number of "cases" without significant loss of information, or categorization of these cases into meaningful groups. Examples of cases are: (1) all the stock keeping units (SKUs) for a retail company; (2) all current, previous and potential customers of a phone company; and (3) a list of all customer issues that need to be addressed by a service provider maintaining a call center. Reduction of the number of cases can help in information analyses and processing. Categorization into meaningful groups can help the user understand, predict and, at times, favorably influence the behavior of these groups.

The call center service provider might want to categorize the resolved and anticipated customer issues into a few groups based on the nature of the problem and the solution. The retail company might want to group the SKUs based on similarities in sales patterns or based on revenue contributions. The phone company might want to group customers based on how loyal they are expected to be, i.e., the anticipated "churn rates."

The reduction or categorization of all available cases into fewer groups is called "clustering" in the DM literature. As the examples suggest, clustering is often a prerequisite for classification and prediction tasks. While association of causality might not always be appropriate for the results of clustering, the reduction of the number of cases can help reduce the intrinsic problem's dimensionality, thus making it more tractable.

NAG-DMC offers two sets of functionality to solve these problems. K-means clustering is applicable to high, medium or low data volumes (i.e., number of cases), provided the number of clusters or groups are known in advance. Hierarchical clustering is applicable when the number of cases is low, but no advance knowledge is available on the number of groups. While iterative or trial-by-error schemes could be designed to estimate the number of groups for k-means clustering, these are not computationally efficient. As NAG's documentation points out, there is often a trade-off for DM applications between computational speed and accuracy.

Reducing the number of variables. Certain variables might not be relevant in the context of the analyses or prediction problem. Multiple variables or fields might not always contain additional information and can often be combined into fewer variables. Additional variables that convey no useful information can burden the analytic engine, increase the likelihood of spurious results and reduce the possibility of discovering meaningful patterns. A business requirement addressed by NAG-DMC is to reduce the number of fields or variables (and hence the dimensionality of the problem) while retaining much of the original information. The functionality provided for this is Principal Component Analysis (PCA).

As in clustering, association of causality to the results of PCA might not always be appropriate but could be useful to illustrate the business requirement. For example, the phone company might have certain data regarding their customers (e.g., demographics, location, income, marital status, credit history, previous history of churn) available. Of these, variables pertaining to location alone might not be relevant to the analysis of churn rates, and there might not be any additional information in income or marital status other than what is already contained in credit history.

Similarly, if a certain set of product attributes are available, not all of them might be useful for analyzing sales patterns, and a few of these might be more useful than others. For the service provider, certain details of the calls like customers phone numbers might not be relevant to the solutions provided.

Classification or assigning cases to known groups. Another class of business problems that NAG-DMC addresses is classification of cases into groups. The phone company needs to be able to understand whether a new or an existing customer belongs to a category that is more likely to churn; the retail company needs to understand whether a newly introduced product would behave like existing products in terms of future sales; and the service provider needs to know if a new customer issue is similar to a known issue. These requirements entail analyzing new or existing cases, and assigning them to known groups.

If the majority of the variables that need to be used for classification are discrete (e.g., customer's gender) then NAG-DMC offers logistic regression for classification into two groups, and decision trees for multiple groups. On the other hand, if a majority of the variables are continuous (e.g., customer's income) then NAG-DMC offers logistic regression for classification into two groups, and k-nearest neighbors for multiple groups. Multi-layer perceptrons are also available for continuous variables, and are useful if the number of cases is low and each class approximately contains an equal number of cases.

Regression and rediction or assigning value to cases. Prediction and quantification are the end goals of many business analyses and planning endeavors. These problems are known to be difficult. However, most companies like to get an advance understanding of the future state of their business for tactical, operational and strategic planning needs. They might also want to quantify the revenue generated from product or customer categories. These are regression problems, and are referred to by NAG as "assigning values to cases."

In the earlier examples, once the phone company has identified the number of groups that the customers need to belong to, extracted the relevant variables for estimating churn and classified the customers into groups, it might want to generate predictions for customer behavior based on anticipated corporate or competitive policies, and design campaigns accordingly. The retail company might want to group SKUs by sales patterns and profitability, and then forecast future demand. That might help to plan promotions and discounts, optimize and allocate inventory, ensure adequate supply of raw materials, and take longer-term financial and human resource decisions. The service provider might need to forecast the type of calls that they are likely to get on various issues, to enable them to allocate resources appropriately.

NAG classifies the regression problem into "simple" or "complicated," depending on whether the underlying problem is linear or non-linear. Note that certain kinds of non-linearity might be adequately dealt with by transformation of variables (e.g., by taking a logarithm or an exponent), while more involved treatment might be useful for others. NAG-DMC offers linear regression for simple problems, which uses little of your CPU and memory. The k-nearest neighbor and the multi-layer perceptron can handle complicated problems. While the former can work for low, medium or high data volumes (number of cases), the latter is more optimal for low volumes. The memory requirements are high for both, while the CPU clockspeed requirements are medium for the k-nearest neighbor approach and high for the multi-layer perceptron.

Performance and Scalability Considerations


The ability to handle large data sets is a key requirement for DM functionality designed for commercial developers. Where possible, the algorithms in DMC have been designed by NAG to handle large sets of data. There are at least two issues in handling large data sets: speed and storage (memory). The hierarchical clustering algorithm uses modified criteria to allow the analysis of more data than the traditional algorithm would allow in the same time. The nearest neighbor routine uses a tree search algorithm to reduce computation time from the order O[n*n] to O[n*log(n)] for "n data."

The regression functions (linear and logistic) can fit a model by holding only a chunk of data in memory at any one time. This allows the regression routines to handle larger data sets than the traditional algorithms (by reducing the burden on "virtual memory," which in turn increases speed). However, NAG does not yet have any performance benchmarks for DMC, due to it being in the very early stages of the development life cycle. This information is crucial to developers and should be made available as soon as possible for future versions of the product.

Ease of Use, Documentation and Technical Support


NAG provides C++ and VB APIs for DMC. Given that their focus has been (and remains) on providing the greatest capabilities to custom developers, the lack of a traditional graphical user interface is understandable. The examples provided by NAG clearly demonstrate how the functions can be used in an application, and the documentation for NAG-DMC has additional details. Various existing components of NAG, like the ones on statistics, operations research and optimization, can also be used in conjunction with DMC to solve different business requirements.

The current set of documentation includes Install and User Guides and Frequently Asked Questions (FAQ). The installation process is relatively straightforward, and the FAQ and User Guides should be easy to understand for a technical user who is an expert on DM. The documentation is occasionally terse but always informative. When a customer buys a perpetual developer's license for NAG-DMC, one year of technical support via phone, fax or e-mail is automatically included with the purchase. Any updates or enhancements to the product that may come out during the first year period will be sent at no charge. Support for subsequent years is 15 percent of the purchase price and includes the same support and update components. There is only one level of support offered with the product. According to NAG, when working with software partners and/or organizations that may require specific levels of support for their clients and/or systems, NAG has worked out specific support metrics that meet or exceed the requirements. The author of this review was particularly impressed with the help and support he received from several members of the NAG team during the review process. Questions ranging from management vision to specific product and deployment issues were answered promptly.

Conclusions and Recommendations


The applications enabled by NAG's DMC could be custom solutions to a specific business problem or a generic solution that targets multiple companies across industry verticals. Tools like DMC are not designed to be complete business solutions, although they can be used to build these solutions. Also, DMC was not intended to be a sophisticated toolbox for research purposes, and does not provide elaborate features or programming capabilities optimized for the advanced TS or DM expert. The focus is on eventual use (including pre-processing and extraction of data) rather than on the ability to address DM research problems.

TS and DM vendors are judged by criteria such as the depth and goodness of functionality, vendor reputation, ease of implementation, and the quality of deliverables like documentation and training. NAG-DMC scores high points on vendor reputation and prior experience, as well as on most other criteria. However, DM tools and methodologies have continued to evolve over the years (Fayyad, et al., 1996; Kiang and Kumar, 2001), and additional functionality would make DMC more useful for the expert DM analyst.

Simple examples of custom applications with real or simulated data are needed for demonstration purposes and to reduce the learning curve for first-time developers. Business users would benefit from seeing specific examples from enterprise resource planning, advanced planning and scheduling, and customer relationship management. Best practices documents could also be included to illustrate the advantages of DMC for business managers and provide guidance for its use. And to ensure DMC's market acceptance, NAG needs to work with early adopters to develop customer references, deployment tips and statistics, and performance benchmarks, and to identify software glitches and potential design and implementation issues.

While this first release of NAG-DMC is not exhaustive in terms of DM functionality, it provides significant depth for the development of customized or generic enterprise applications. A developer of analytic applications should carefully consider NAG's DMC, and (depending on the nature of the problem) the use of DMC in conjunction with NAG's offerings in statistics, optimization and operations research. The features offered by DMC have been carefully selected by NAG to handle a variety of business requirements, and the implementation issues of speed, storage, scalability and ease of use have been considered in the design. The DMC development team and NAG's management appear committed to react quickly to customer needs, and to come up with significant new functionality in their upcoming releases.

Product Summary

Pricing and detailed product information for NAG can be found on their Web sites, which also include implementation availability, descriptions of products, download-able software, product documentation and technical reports. NAG's products can be purchased for a PC or single-user (non-server) workstation from NAG's Web store, or by downloading and faxing an order form to NAG, which is also available from one of their Web sites (see below).

NAG provides several different licensing options but the most frequently used are the perpetual developer licenses. For commercial and governmental organizations, this license is $8,000 for Windows or Linux (1-2 CPU workstation) and $12,000 for Unix. Multi-quantity discounts are available. Under the same license, additional runtime licenses are $1,600 and $2,400 per workstation respectively for Windows or Linux and Unix. For academic institutions, the development licenses are $4,800 and $3,200, respectively. For commercial software developers, NAG offers special terms and services. Site licenses can also be quoted on request.

NAG Web sites
North America
Numerical Algorithms Group
Tel: (630) 971-2337
Fax: (630) 971-2706
E-mail: infodesk@nag.com
Web: www.nag.com

Europe, Africa, Asia, South America
NAG Ltd, UK
Tel: +44 (0)1865 311744
Fax: +44 (0)1865 311755
E-mail: infodesk@nag.co.uk
Web: www.nag.co.uk

Japan
Nihon NAG KK
Tel: +81 (0)3 5542 6311
Fax: +81 (0)3 5542 6312
E-mail: help@nag-j.co.jp
Web: www.nag-j.co.jp


Vendor Comments

Editor's note: It is the policy of OR/MS Today to allow developers of reviewed software an opportunity to clarify and/or comment on the review article. Following are comments from Robert W. Meyer, president of the Numerical Algorithms Group.

We would like to thank and compliment Mr. Auroop Ganguly for his thorough and fair review of the Data Mining Components (DMC) from Numerical Algorithms Group (NAG).

The NAG DMC are a logical extension of work by NAG over the past 30 years to bring the most robust, accurate and best performing math and statistical routines to all major computing platforms. Our various numerical libraries now include more than 2,500 component routines for modeling, simulation and advanced statistical analysis; these include various equation solvers (including PDE & ODE), optimization routines, OR methods and advanced statistical methods among many others.

This growing body of code is used by developers building company-wide (enterprise) applications, as well as by developers of commercial software products. Our intent with this first version of the DMC product was to create a good selection of modeling components coupled with data cleaning, data preparation and utility routines. We recognize that developers need more than modeling methods to create a successful application.

We consciously chose to create DMC as a library of components, in C, with interfaces for various other languages (C++ and Java among them). As a result, developers can readily embed these techniques within their own applications rather than have to deal with a proprietary graphical user interface (GUI). We also chose to expose more of the application programming interface so that skilled developers can "steer" the routines for their particular needs. While this requires a higher level of effort at the front-end, it enables high performance with large datasets and much more flexibility for integration with existing data warehouses and applications.

Finally, while we recognize that the current form of the product is an excellent set of tools, we are well into planning extensions and improvements to this first product. NAG has a 30-year history of excellence in functionality, documentation, performance, continuous product improvements and technical support. We hope that OR/MS Today readers will take the opportunity to download a free trial of NAG DMC and related products at our Web site (www.nag.com).

References


  1. Cooper. L. G. and G. Giuffrida, 2000, "Turning Datamining into a Management Science Tool: New Algorithms and Empirical Results," Management Science, Vol. 46, No. 2, pgs. 249-264.
  2. Fayyad, U., G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996, "Advances in Knowledge Discovery and Data Mining," AAAI Press/MIT Press.
  3. Ganguly, A. R. and A. Gupta, 2002, "Hybrid statistical and data mining tools for business planning and management science - Algorithms and case studies," accepted for the INFORMS Annual Meeting, November.
  4. Kiang, M. Y. and A. Kumar, 2001, "An evaluation of self-organizing map networks as a robust alternative to factor analysis in data mining applications," Information Systems Research, Vol. 12, No. 2, pgs. 177-194.
  5. Reyes-Aldasoro, C. C., A. R. Ganguly, G. Lemus and A. Gupta, 1999, "A hybrid model based on dynamic programming, neural networks, and surrogate value for inventory optimization applications," Journal of the Operational Research Society, Vol. 50, No. 1, pgs. 85-94.



Auroop R. Ganguly is a research associate at the MIT Sloan School of Management and the MIT School of Engineering. In addition, he is the product manager for Oracle's Demand Planning, which is a component of their e-business applications suite. He thanks Professor Amar Gupta of the MIT Sloan School of Management for his encouragement and support on this review.





  • Table of Contents

  • OR/MS Today Home Page


    OR/MS Today copyright 2002 by the Institute for Operations Research and the Management Sciences. All rights reserved.


    Lionheart Publishing, Inc.
    506 Roswell Rd., Suite 220, Marietta, GA 30060 USA
    Phone: 770-431-0867 | Fax: 770-432-6969
    E-mail: lpi@lionhrtpub.com
    URL: http://www.lionhrtpub.com


    Web Site Copyright 2002 by Lionheart Publishing, Inc. All rights reserved.