February 1996 € Volume 23 € No. 1

Number Crunching: 1996 Statistics Survey

Explosive growth in computing power fuels increased use of methodology

By James J. Swain

Editor's note: The following article updates a similar statistics software survey published in the October 1994 issue of OR/MS Today. Data for the current survey was collected and compiled by OR/MS Today Managing Editor David Greenfield from December 1995 to January 1996.

Ours is a quantitative, problem solving field in which statistical methods are an important part of many projects. Statistics is used to summarize data, detect and estimate relations among variables, and test hypotheses. From the very beginning, operations researchers collected and analyzed data using statistics to understand processes, to build and to validate their models, and to develop appropriate inputs for use in optimization and simulation models. Periodic surveys have consistently shown that data analysis is a perennial activity among OR/MS professionals, and statistics has long been an integral part of the curriculum. As a recent instance, the Committee for the Review of the OR/MS Master's Degree Curriculum (OR/MS Today, Feb 1993) included two semesters of probability and stastistics in the proposed curriculum. In fact, service courses for statistics are often offered by operations research, management science, quantitative methods and industrial engineering departments in which our other courses are taught.

The explosive growth of our field is largely paralleled by the increase in use of statistical methodology, and both were substantially assisted by the growth in computing power and the availability of software to perform routine computation. Computers not only made computations easier, so that such commonplace techniques as regression and analysis of variance (ANOVA) could be conveniently performed, but computers have taken a major role in the generation, collection and management of the data itself. In the earliest days, data was collected and processed by hand or via punched cards. Data can now be obtained by the computer from other stored sources: via sensing equipment connected to the computer (e.g., hand-held bar code readers), or collated from remote computers at a central source. Commercial operations in transportation, telecommunications, marketing and retail may generate tens of thousands of observations to draw upon. Availability of computing power has also led to the increased use of simulation models and process improvement tools, such as SPC, TQM, Taguchi methods, and the design of experiments, and these have, in turn, further increased the need for statistical analysis.

Statistical software to aid the OR/MS professional is widely available. As this OR/MS Today survey of statistical software demonstrates, there are many products for the PC, Macintosh and workstations in a range of prices and capabilities. Product information for the survey has been supplied by vendors from a list compiled from reader suggestions, advertisers and prior surveys. While not an exhaustive list, these products are representative of the wide range of choices available today. These programs permit analysts to visually examine the data and pursue different approaches in analysis through the use of on-screen graphics and interactive analysis. Typically, these programs can construct histograms and other descriptive plots, and perform basic statistical tasks such as tests on means, one- and two-way tables, analysis of variance (ANOVA), and linear regression.

Given the large number of general purpose programs now available, is a statistics package needed for basic analysis? Most OR/MS users will already have a word processor, spreadsheet and communications program, and very likely presentation and database software, plus special purpose software for simulation, math programming and so on. For basic statistical analysis, spreadsheet users may not even require a separate statistical analysis program. Spreadsheets increasingly include graphical and statistical features, confidence intervals and statistical tests, ANOVA, and linear regression. When more detailed or specialized analysis is required, many statistical programs can import data directly from the spreadsheet or copy the data through the "clipboard."

Improvements in programs and operating systems also mean that graphics from statistical programs can readily be imported into word processing and presentation software. Likewise, symbolic algebraic programs (such as Mathematica) and numerical processing programs (such as Gauss and Matlab) have extensive statistical capabilities, sometimes in the form of application modules. While not always as easy to use as a statistical program, they generally have more flexibility and often more versatile graphics available.

In this article we take the view that a statistics program will be used by analysts or students to supplement other activities. Like all software, statistical software should be easy to install and use. Because data arises from a variety of sources and is stored in varying formats, statistics programs should be able to import data from as many formats as possible, and edit and transform data once it is acquired. Good products should provide: a variety of ways to display or view data, classical and some nonparametric procedures, linear regression and ANOVA. Other features that are often useful include sampling from distributions (for Monte Carlo sampling), statistical process control, multivariate statistics, and design of experiments. The vendor should have a development staff that includes statisticians as well as programmers.

Statistical software is a mature field; for instance, the SPSS and BMDP programs have roots that date back several decades. Improvements in operating systems has meant that command line or batch processing has been replaced by interactive interfaces. Using and learning these programs has never been easier. Not only do many products offer on-line help and tutorials, but many of the programs have readable documentation, and a number of the programs are widely featured as illustrations in statistics texts (e.g., SAS and Minitab) and in third party texts devoted to these products. Many of the latter feature student versions of the software and data sets on diskettes included with the text. The data sets are discussed in the text, and can be accessed by the student as they use the software.

Formatting Woes
These days, data arises from many sources, such as databases, spreadsheets, or CD-ROMs, and experimental data may be monitored and collected directly by computer. Data sets can be very large and stored in various formats. Having data that cannot be readily analyzed is extremely frustrating, so it is important that these data can be imported into the analysis package with a minimum of additional effort. Most statistical packages support ASCII (or plain text) input and import from spreadsheet and database formats. In addition, programs such as DBMS/COPY from Conceptual Software and Stat/Transfer from Circle Consulting provide data conversion between formats used by various statistical software programs, spreadsheets and databases. Once the data is captured, it should be easy to edit and manipulate as part of the analysis. The program should also be able to handle missing data.

While many people associate statistics with the procedures they learned in their introductory courses, statistics is the search for meaning within observations over time or condition or for relations between variables. To aid in this search, the modern trend (as typified by Exploratory Data Analysis or EDA) is to graphically examine data from a variety of quick perspectives. The stem-and-leaf plot (a quick version of a histogram) and box plot (or box-and-whisker's plot) are suitable for quick summary of a single or multiple variables. For instance, several different treatments can be compared using side-by-side box plots. These plots would provide, at a glance, the likely outcome of an ANOVA and allow a rough confirmation that assumptions are being met.

For multivariate data, scatter plots and their higher dimensional equivalents can aid the search for likely relationships. The famous Anscombe data sets ("Graphics in statistical analysis," American Statistician 27, 1973: pp. 17-21) illustrate the danger of relying on summary statistics alone. These four data sets have identical statistics, but widely differing interpretations that are readily apparent by simple scatter plots. Likewise, Tufte's "The Visual Display of Quantitative Information" has futher reinforced the power of graphical representations of data.

Good statistical software should include nonparametric procedures to augment the standard techniques (t-tests and ANOVA, for instance) that are based on parametric families such as the normal distribution. Nonparametric procedures are often based upon variations of the sign test or on ranks. They generally require fewer assumptions than parametric procedures and are often less sensitive to misspecification of assumptions. Parametric assumptions can be evaluated through the use of probability plots, which are useful diagnostic tools.

One of the most common activities of data analysis is curve fitting by linear regression. This is best known as the fitting of straight lines between two variables (so-called simple linear regression), but through transformations, polynomial and transcendental functions can be used, and more than one predictor used. This is a tool of immense power, which can be used for empirical summarization, as an approach for determining relations between variables in a process, or as a method of eliminating trends which obscure an analysis, as Fisher did when he regressed out fertility trends in the multiyear data taken at Rothamsted to sharpen comparisons between methods of crop treatments. These procedures were always limited by hand computations, and the statistical software has literally made regression available for common use. Software not only makes it possible to perform these analyses, but graphical and diagnostic statistics can be used to quickly and interactively guide model building.

Other features
Statistical programs vary in the number of statistical procedures they contain. For many users, the basic features described already should be sufficient for most applications. Additional features that are useful include Monte Carlo sampling, statistical process control (SPC), forecasting, and multivariate statistics.

With the increasing interest in SPC, having those features broadens the usefulness of the software, and students can certainly use the software in more than one course. Likewise, software for performing time series (forecasting) analysis is often useful.

Statistical software is increasingly including options for multivariate statistics -- the study of relations among several variables or among different attributes within records. For instance, a common marketing and demography problem is to characterize subgroups from the general population. To target advertising, the marketer needs to determine attributes, such as age, income, geographic region, or interests associated with a particular product or service. Multivariate clustering and discimination methods are used for this purpose. Graphics for making it possible to visualize multivariate relations are also helpful.

Monte Carlo sampling is particularly valuable when software is being used in conjunction with a statistics course, since it provides a way for students to observe the range of variability that they will encounter in practice. Monte Carlo sampling is also useful to test the sensitivity of a procedure to assumptions about distributions or to build insight into how various statistics might perform under different assumptions. The Resampling Statistics program is particularly suited to this kind of analysis, either by resampling from a particular set of data (also called bootstrapping) or from standard statistical distributions.

NOTE: A detailed listing of numerous statistical software packages is printed in an easily cross-referenced table in the February issue of OR/MS Today. If you are interested in obtaining a copy, contact Nora Craver at Lionheart Publishing Inc. -- Phone: (770) 431-0867 ext. 201; Fax: (770) 432-6969; E-mail: nora@lionhrtpub.com.

James J. Swain is associate professor of ISE at the University of Alabama in Huntsville. His technical interests include applied statistics and simulation.

E-mail to the Editorial Department of OR/MS Today: orms@lionhrtpub.com

OR/MS Today copyright 1997, 1998 by the Institute for Operations Research and the Management Sciences. All rights reserved.

Lionheart Publishing, Inc.
2555 Cumberland Parkway, Suite 299, Atlanta, GA 30339 USA
Phone: 770-431-0867 | Fax: 770-432-6969
E-mail: lpi@lionhrtpub.com

Web Site Copyright 1997, 1998 by Lionheart Publishing, Inc. All rights reserved.
Web Design by Premier Web Designs, e-mail lionwebmaster@preweb.com