OR/MS Today - October 2001|
Data mining system targets likely respondents
By Carl Cozine
The best way to review a predictive modeling tool is to discuss its application to a particular data analysis task. Denver-based Cactus Strategies helps broadband companies achieve business success by creating and using heuristic predictive models for database marketing. This helps companies identify the best markets and acquire new business customers. For this review of the PolyAnalyst data mining system from Megaputer Intelligence, Inc., I will describe its use in a response prediction project carried out for a national broadband company introducing a new product to the market.
The importance of the project can be demonstrated by the following figures. According to The Yankee Group, the number of U.S. subscribers to high-speed Internet access services will grow to 16.6 million by 2004, up from 1.4 million at the end of 1999. The Strategies Group's study on Global Broadband Markets projects that broadband services will become a $160 billion industry in four years. The broadband industry encompasses cable, ILEC (incumbent local exchange carriers), CLEC (competitive local exchange carriers), satellite and wireless companies. The industry is driven by an incredible demand of business and residential users to have faster Internet access.
The business problem solved in this application was to use information about existing broadband customers to create a statistically accurate predictive model of a customer. Then this model was applied to databases of U.S. businesses across the country to simultaneously analyze market potential and locate the best sales prospects. This helps the company significantly increase the traditionally low (about 1 percent to 2 percent for random contacting) response rate of a direct marketing campaign, and thus save significant marketing resources.
A synergy of two things was necessary for the success of the project. First, one required training data a database of existing broadband customers that could be used for building a predictive model. Cactus Strategies formed a relationship with one of the hardware suppliers to the broadband industry to acquire a list of "known" buyers. The data included information on some 50,000 companies that expressed interest in products similar to the newly promoted one. Since no information about non-buyers of the new product was available, we added to the data 50,000 records of randomly selected companies, purchased from Claritas, an independent vendor of business data. Several pieces of information were provided about each company: the size, location, business classification, operation type, time period the company was tracked, etc. a total of 53 attributes.
Second, one needed a data mining software tool that would uncover classification rules predicting buyers among prospects. These rules are applied to the main database of prospects to predict their propensity to buy. We selected Megaputer's PolyAnalyst system for data mining and knowledge discovery because it offered broad analytic functionality and ease of use beyond any other product. PolyAnalyst provides 11 machine-learning algorithms, each furnishing a separate valuable report, as well as a large selection of data and result visualization and manipulation functions.
The system is very easy to install, learn and operate. An intuitive graphical user interface makes data exploration simple, and hiding all the statistical complexities of the performed data analysis behind the scenes allows non-statisticians to run successful data analysis projects. PolyAnalyst comes supplied with a step-by-step Tutorial containing eight lessons from different application fields. The system ranges in cost from $4,000 to $12,000 depending on the set of machine-learning algorithms provided.
Customer support is available by telephone and over the Internet. On-site coaching can also be arranged to help clients develop specific applications. The support of the Megaputer team, who were available at all times to facilitate our learning and use of the product, was invaluable. The documentation provided was clear and easy to understand, and the step-by-step tutorials allowed our staff to begin production almost immediately.
PolyAnalyst is an integrated data mining tool that allows the user to address all issues involved in a data analysis project: data import or linking to an external database, transformation, analysis, visualization, results reporting, and model application to the bulk of the data. The system assists in clustering, classifying, segmenting, predicting and explaining data, as well as finding association rules. PolyAnalyst offers a broad selection of data visualization tools including histograms, two-dimensional line and scatter plots, rule graphs, thermal charts and interactively rotating three-dimensional charts. In addition, the system provides Snake Charts (based on parallel axes technology), which visually compare different data sets on all three attributes at once (Figure 1). Also, Lift and Gain charts that are very popular among database marketers help immediately evaluate the efficiency of the derived predictive model for the profitability of the contemplated direct marketing campaign. Another very important feature of PolyAnalyst is its ability to use the resulting predictive model for scoring data in any external database through a standard protocol, OLE DB. This precisely serves the needs of a modern database marketer, since the resulting customer list sorted by their propensity to respond facilitates the seamless integration of the results of data mining in a marketing automation solution.
Figure 1. PolyAnalyst main window with several reports and graphs open.
Using PolyAnalyst, we analyzed the database of known customers to create a predictive model of companies that purchase broadband products. As a first step, all obtained data records went through a comprehensive quality analysis. The system checked for duplicate records, which were then de-duplicated, missing fields were appended and additional data of potential importance was added (for example, we added geo-coding information to create an exact "wire distance" field that pinpoints how far a business is located from a telecommunications central office). This data preparation took about two days of work.
The usual strategy for data analysis is the following: data understanding, data transformation and aggregation, important attribute selection, and then comprehensive predictive modeling. PolyAnalyst offers capabilities for addressing all these tasks. The Summary Statistics exploration engine delivers a first quick insight into the data structure and helps understand many peculiar features of the data and work out the best plan of attack on the problem. The next step of the project was to transform data into a form suitable for the analysis by aggregating values of some variables and substituting some original variables by their more predictive combinations. PolyAnalyst provides Visual Rule Assistant that dramatically simplified these manipulations. Then the data was analyzed with the help of machine learning algorithms. A joint application of several PolyAnalyst modeling algorithms provided the best results for this particular application.
A preliminary analysis involving statistical preprocessing of data with the Find Dependencies exploration engine resulted in selecting 12 attributes out of the original 53 deemed to be most predictive. Then models capable of accurately predicting future purchase decisions were obtained by running Clustering and Classification analysis powered by the Neural Network, and the Decision Tree algorithm.
The Classify exploration engine utilizes fuzzy logic to develop a continuous function modeling the probability that a record represents buyer/non-buyer and selects a threshold for minimizing the number of incorrect classifications. PolyNet Predictor, a Neural Network algorithm chosen to power classification in this case, allowed for the fast production of a viable model. After about an hour of perfecting the model, the system found a classification rule predicting with an 81 percent accuracy the probability that a potential customer will be a buyer based on seven independent attributes. This model was successful in predicting buyers on testing data, which were not used for training, with about the same accuracy. The only notable drawback of the classification carried out with the help of Neural Network is that this algorithm does not output the model in a format that can be comprehended by a human analyst. In other words, the model predicts which prospects are the most probable buyers of the product, but it does not prove the "why" answer.
The Decision Tree algorithm helped uncover the combination of features that distinguished a buyer. PolyAnalyst Decision Tree algorithm is based on the Information Gain criteria of Shannon's Information Theory. While providing results with only a slightly lower accuracy (about 79 percent) on the same data, the Decision Tree algorithm created an explicit model within 30 minutes twice as fast as the Classify algorithm. The model suggested that one could reliably predict buyers using just six attributes. When applied to the testing data set, the resulting model produced about the same accuracy as the Classify algorithm model. PolyAnalyst Decision Tree engine delivers a convenient report with a visual representation of the elaborate classification rule. Less populated nodes of the tree are represented by fainter color, which allows the database marketer to immediately identify the most important branches of the decision tree (Figure 2).
Figure 2. Decision Tree report specifies a combination of features that distinguishes buyers from non-buyers.
The business value of the discovered model can be readily visualized with the help of Lift and Gain charts (Figure 3). The Lift chart evaluates the benefits of performing a model-based versus random marketing campaign. It demonstrates what percentage of potential responders would be reached by contacting only a portion of the target population according to the derived predictive model against random mailing. The Gain chart illustrates the dependence of dollar-based profit on the number of model-suggested prospects contacted. It allows the company to optimize the number of prospects contacted to achieve a balance between the maximum profit and exposure. For a Gain chart, the cost per contact, profit per response and maximum number of prospects for a marketing campaign have to be specified by the user. For the selected parameters, the predicted profit peaks when roughly 50 percent of the best prospects are targeted, as can be seen on the graph.
Figure 3. Lift and Gain charts clearly demonstrate where profit is maximized.
The real business value of the obtained model is revealed when applying this model to score the main pull of prospects data, often containing millions of records, and utilizing the Predicted Responders field for better targeting the direct marketing campaign. While not related to data mining per se, this step is an unavoidable headache for the user: this step required dedicated model export and external integration work in all data mining tools. PolyAnalyst offers an elegant solution to this problem by allowing the user to directly score data in an external database through a standard protocol, OLE DB. This feature is available due to the support by PolyAnalyst of the newest standard in analytical software, OLE DB for Data Mining. Also, some users might be interested in exporting the resulting predictive models themselves in XML and the utilizing these models in the form of business rules for their decision support or CRM applications (Figure 4). PolyAnalyst supports exporting models to PMML; this Predictive Modeling Markup Language is a flavor of XML used by data miners.
Figure 4. Fragment of Decision Tree exported by PolyAnalyst in XML format.
The final models were utilized to score the bulk of data about prospects in the main database. Only those prospects that had been predicted as the most probable responders were contacted with a promotion for the new product. An optimal number of prospects contacted were selected so that the profit predicted by the Gain Chart would be maximized. This targeted marketing campaign resulted in a response rate that was significantly higher than the standard 1 percent to 2 percent. The accuracy of the analysis and the speed at which records can be analyzed provided us with a service that other data providers are unable to offer. To put a specific ROI on the implementation, we can note that as a result of the project for the client, in just one market which the client had originally dismissed as unfruitful, changes to the business plan based upon the results of response prediction analysis resulted in a projected increase in revenue opportunity of $9.2 million over 5 years. Broadband companies that use data mining services could conceivably realize between 10 percent to 40 percent increase in results from sales and marketing campaigns because they can target the best prospects with a high precision.
Utilizing the success of the project, we created a generic integrated solution for building predictive heuristic models. The solution involves an RDBMS for storing customer data, Oracle 8, and a data mining tool, PolyAnalyst 4, for building explicit predictive models. The solution has been in full production since May 1999. Technically it is implemented on two servers. One, a Sun Netra T 440 MHz, 512 MB RAM and 60 GB mirrored disk runs an Oracle database of business records that are matched for analysis. A Compaq 1850 R, Pentium III 550 MHZ, 18 GB mirrored disk with Windows NT runs PolyAnalyst software for predictive modeling. (Information on other platforms available for PolyAnalyst can be found on the company Web site at http://www.megaputer.com.) The system is currently designed for two concurrent users and may be expanded to handle multiple project requests simultaneously, with batch processing of the analysis. It took six months to implement the complete solution and the project team is comprised of a mathematician, a systems design engineer, a client requirements manager, a lead developer and two Oracle programmers.
PolyAnalyst is the key element of the solution because it helps to find patterns and relationships hidden in data and identify the best and most profitable prospects those that should be contacted first with promotions. Contacting only a fraction of prospects those most likely to purchase results in lower direct marketing expenses and better response rate, thereby increasing profit.
Carl Cozine (firstname.lastname@example.org), CEO of Cactus Strategies, has 20 years of automation engineering and telecommunications experience. Cozine holds degrees from Georgia Institute of Technology and Tulane University.
OR/MS Today copyright © 2001 by the Institute for Operations Research and the Management Sciences. All rights reserved.