ModelQuest Expert 1.0
Reliable and fast data mining using statistical networks
By Ashutosh V. Deshmukh
Data mining is a growing field driven primarily by the database explosion and availability of powerful computers. Data mining discovers relationships, patterns, trends and predictive information from large and complex databases. Generally, data mining is data driven, not user driven or verification driven . Users do not form and test hypotheses; data mining tools discover previously unknown relationships by working bottom-up on the data set. Data mining provides a convenient way to extract information from large databases generated by corporations and other entities, and has been successfully employed by many businesses.
Data mining tools use different types of processing algorithms. The four primary methods are associations, clustering, classification and sequential patterns . These methods can be further grouped; for example, classification can be achieved by decision trees, rule discovery, neural networks, rough sets, etc. The Internet lists numerous companies producing "siftware" (software to sift through databases) using the above approaches .
Introduction to ModelQuest Expert
ModelQuest Expert (MQ Expert) is one data mining tool among a family of PC-based data mining tools (MQ MarketMiner and MQ Enterprise) developed by AbTech Corporation. MQ MarketMiner (currently in Beta) and MQ Enterprise are client/server applications that run on Windows NT and select UNIX platforms. They differ primarily by their ability to handle the number of variables, size of data and also their reports and analyses. The details of these products are available on the company's web site. MQ Expert is AbTech's stand-alone product capable of handling 100 variables and 32,000 examples. MQ Expert uses a unique approach to data mining employing multiple strategies.
There are seven different types of nodes (or elements) which are differentiated by the type of algebraic form of mathematical functions represented by that node. The MQ Expert manual claims that statistical networks are effective because mathematical functions can compactly capture numeric knowledge and that networks can deal with complex problems by subdividing them into smaller problems. The manual also states that a system of one million production rules can be easily represented by a function of only six variables.
MQ Expert trains inductively by using a set of examples. MQ Expert handles numeric parameters such as expert judgments, probabilities, fuzzy values, prices, costs, sensor readings, failure rates, etc. I used a complex binary data set (values of 1 and 0) in one analysis, and MQ Expert handled it comfortably. Models are formulated and evaluated automatically by the program and the resulting network is implemented as a layered network of feed-forward functional elements.
The functional element coefficients, number of network elements, types of node and connectivity are developed automatically from the data. A variety of statistical evaluations are available with the trained model, which I found very useful especially when comparing MQ Expert with traditional statistical methods.
An array of other reports and user-defined reports is also available. AbTech has an impressive roster of clients, and MQ tools have been applied to a variety of problems in military and industry. The illustrative applications of MQ tools are direct marketing, fraud detection, stock market prediction, risk analysis, fault detection and analysis of clinical data.
System requirements for MQ Expert are relatively modest. The minimal configuration is an IBM PC or compatible (486 or above, a math-coprocessor is necessary), MS Windows or NT, 8MB of RAM, MS-Windows compatible mouse and a CD-ROM drive. MQ Expert requires at least 20MB of space on a hard drive. As the data sets to be analyzed become larger, the hard disk space required also becomes larger. The software will of course run faster on a higher performance computer. I evaluated the MQ Expert on P-133 and P-266 machines with 32MB and 64MB of RAM, respectively, each also running Windows 95.
MQ Expert is a Windows product, and the installation is a routine Windows application installation. MQ Expert comes on CD though alternative media are available on request. To install, one inserts the CD into a CD-ROM drive and, using Explorer, My Computer or File Manager, runs the file D:\install\MQXSetup.EXE. The installation wizard will then guide the user through the process. Generally the software is installed in the C:\MQEXPERT directory and the program is available from the Start menu in the Windows 95. In the case of other versions of Windows (3.1, 3.11, or NT), usual Windows procedures should be followed.
Data mining using MQ Expert
MQ Expert uses files called "Projects" to keep data, models, and evaluation results for a particular analysis. You can choose New Project from the File menu to create a new project window. The New Project window is shown in Figure 1.
Building a model with MQ Expert then involves four steps: importing the data (Data tab), training the model (Transform, Model, and Strategies tabs), evaluating the model performance (Analyses tab) and implementing the model. MQ Expert can import data from standard ASCII text files, Excel or desktop databases such as ACCES, DbaseIV, Fox Pro or Paradox. MQ Expert requires that data be in a sequential order and separated by spaces, tabs or commas. The index number is automatically added to the data and the data can be viewed in a spreadsheet format.
Before the user is able to train the model, the data needs to be transformed. The user has to define the data source for the transformation, which should be the data file imported in the first step. Once the user defines the data source, MQ Expert will populate the table in the Transform function. The user can then pre-process the data. For example, the user can add Uniform or Gaussian noise to the data, add temporal shift to the data, or perform moving window operations such as Window Average, Window Minimum or Window Standard Deviation. An extensive collection of functions is available to pre-process the data. I did not use the pre-processing functions; however, these functions will be very useful in building different types of models from the same data. Once you define (or pre-process) the data you can split it for training and evaluation (holdout sample) purposes. The default split is 75 percent training and 25 percent evaluation data. The split can be random (default) or sequential. The MQ Expert automatically stores the newly created data files.
The Model tab is then used to train the model, the resulting screen is shown in Figure 2.
The first step is to specify inputs and an output using the Define tab. Then you can train the model using the Train tab. The initial parameters such as complexity penalty multiplier, number of layers, and layer size are at default values. The term complexity penalty multiplier (CPM) needs some explanation. The CPM monitors the trade-off between complexity of the model and accuracy of the model. A high value of CPM will encourage formation of simpler networks that are easily generalizable. A low value will result in complex networks that may use many input variables and may overfit the data. MQ Expert also provides a utility that can be used to calculate an optimum value of CPM, which is discussed later. The user can view the network and change the polynomial equations. The training and model building, which uses statistical networks, are automatic. When training is in progress, the screen is split in three sections as shown in Figure 3.
The left window shows the current best network found by the MQ Expert, and the right top window shows the hypothesized network (that will be compared with the current best network). In the bottom of the right window, statistics describing the current best network are available. The training process is extremely fast. I finished several analyses in a few minutes.
The trained network can be displayed by using the View tab (Figure 2). The relationship between the inputs and output is shown in the trained network, and if you click at a node you can view relevant statistics or the equation for that part of the polynomial.
The network then can be tested against the evaluation data using the Apply tab (Figure 2). Specify the data source (the holdout sample, which is created earlier) and click the Apply button. The trained network is applied to a new data set. The results show two windows: one provides evaluation statistics and the other window provides a histogram of error. The evaluation statistics window shows descriptive statistics for evaluation and training data, maximum absolute error, average absolute error, average squared error, and R2. The manual states that average absolute error and maximum absolute error are important in determining whether you have a good model. If the average absolute error is low compared to the output maximum and minimum then you have a good model. The maximum absolute error indicates the worst performance of the model.
However, the determination of whether your model is good or not is more of an art. I used binary data in the management fraud project and most of the statistics were not very helpful. I compared the results of MQ Expert with neural networks and traditional statistical methods (logit regression and generalized qualitative response model) and found the results very comparable. The statistics provided by MQ Expert also make it superior to neural networks in some aspects. The histogram of errors provides a visual display of results. If the model is good then the histogram will look like a Gaussian curve with a mean of zero. MQ Expert can also provide the relative importance of each variable in the model, which can be used to perform sensitivity analysis.
Once the model is trained and evaluated then it can be applied to other input data using the Query function. The Query function provides a spreadsheet-like interface where you can enter input values for significant variables found by the MQ Expert. MQ Expert allows input values that fall within the range of the training sample; if a value falls outside this range then an automatic adjustment takes place. Once you input the values the Query function will calculate the output value. This function is useful in using the trained model for decision making. The trained network can be encoded into a C code module and can be embedded in other applications. In case of my management fraud project, the Query function can be used to assess the risk of management fraud during the audit engagement.
MQ Expert also provides many expert strategies to simplify the task of model building. The four expert strategies provided by MQ Expert are: Batch Mode, CPM Optimization, Error Network of Networks Strategy and Input Network of Networks Strategy. The Batch Strategy permits the user to predefine a series of data transforms, model training, and model evaluation tasks. MQ Expert runs these tasks without supervision. This strategy can be useful when large numbers of models need to be built and tested.
The CPM Optimization Strategy can automatically find the best CPM value for a given model. The user can specify evaluation criterion such as average absolute error or maximum absolute error and the search method. I found this strategy very useful since optimizing CPM manually is an extremely time consuming task. The different evaluation criterion and search methods will provide different optimum CPM values. The user needs to identify the proper evaluation criterion and search method. This requires a good understanding of the problem and the data being analyzed.
The Error Network of Networks Strategy allows MQ Expert to learn where the model of input features worked well and where it did not work well and then helps to compensate in the areas where the model did not work well. In the Input Network of Networks Strategy, MQ Expert models the problem in stages: first, the inputs that are highly correlated are used, then the output is modeled again using input variables that have some information, and the process is repeated until there are no inputs left to consider. The output of these various networks is then used to create a final network that models the output. I used the Error Network of Networks Strategy and Input Network of Networks Strategy and found the results very confusing. The resulting model includes terms that cannot be intuitively understood and the model appears to be a black box. However, I attribute this confusion to my relative inexperience with this complex package.
MQ Expert provides a host of utilities for graphing data, comparing models, and summarizing projects. I only discuss utilities that I thought are interesting. MQ Expert provides a utility "Comparative Performance Evaluation" that can help compare the performance of two or more models. It essentially displays the evaluation statistics side by side. This utility is very helpful if you have developed many models and have lost track of which one worked the best. There is another utility "Information Content Analysis" that assigns numerical measures of importance to each input variable.
The important input variables identified by this utility frequently differ from the variables included in the network models built by MQ Expert. The manual states that this analysis provides additional insight in the problem. I was unable to integrate the results of this utility with the earlier analysis. MQ Expert also provides a utility that summarizes performance evaluation results for each model. This utility provides a file name, description of network, date of analysis, and evaluation statistics. This is an excellent summary reference. The majority of the other utilities would be helpful in analyzing, rearranging, and summarizing the model results in various ways.
Documentation for MQ Expert
MQ Expert runs proprietary algorithms for data mining. It is impossible to comment on the technical soundness of the software. AbTech Corporation has been in this business for many years and its MQ tools have been used for various data mining applications. Experience indicates that the fundamental calculus behind these algorithms is technically solid. MQ Expert seems particularly preferable to neural networks since it provides a variety of statistics and automates many technical tasks.
The proprietary nature of algorithms makes it imperative that the documentation for the software be excellent. MQ Expert manuals come with many easy-to-follow tutorials. There are varied examples and once one completes the examples it is very easy to formulate your own models. I found that I was able to use MQ Expert within a day or so and had very little trouble. The manual also contains a very good discussion about recognizing good models, which is useful to novice model builders. The on-line help follows the printed manual and supplements it. I called the Tech Support several times and always received prompt and correct answers.
MQ Expert is easy to use, and if one is patient enough to go through the manuals it is easy to learn. I am happy with the speed and accuracy of the software package. I encountered some minor problems with the package and documentation, which I summarize below.
Data mining is poised to grow in the coming years. The size and complexity of industrial and government data bases is exponentially increasing. Traditional methods of data analysis are inadequate and in some cases the required expertise is scarce. Data mining provides a welcome alternative for non-technical users to exploit accumulated data. MQ Expert is an excellent tool for data mining. It has an easy, intuitive user interface, with the drop and drag simplicity of Windows, and strong analytical capabilities. The data analysis is done at high speeds and resulting networks can be embedded in other applications such as C code.
A word of caution to all model builders is in order. Data mining software such as MQ Expert makes it easy analyze data and identify important variables. However, no package can substitute for common sense, understanding of the problem, and judgment concerning plausibility of the significance of variables. It is now easier to crunch data and arrive at wrong conclusions. The excellent tools like MQ Expert supplement but do not supplant the human judgment.
Ashutosh V. Deshmukh is an Assistant Professor of Accounting at Pennsylvania State University Erie. He can be reached at firstname.lastname@example.org or at http://www.personal.psu.edu/faculty/avd1.
OR/MS Today copyright © 1997 by the Institute for Operations Research and the Management Sciences. All rights reserved.
Lionheart Publishing, Inc.
506 Roswell Street, Suite 220, Marietta, GA 30060, USA
Phone: 770-431-0867 | Fax: 770-432-6969
Web Site © Copyright 1997, 1998 by Lionheart Publishing, Inc. All rights reserved.