OR/MS Today - June 2002|
Fitting continuous and discrete distributions to data
By Lawrence M. Leemis
Stat::Fit, Version 2 is a PC-based software package that fits discrete and continuous univariate distributions to data. The package is intuitive and requires little or no documentation for a user familiar with probability and statistics, although online and print documentation comes with the software. Its main application is in fitting parametric distributions of various stochastic elements (e.g., inter-arrival and service times in a single-server queue) in a discrete-event simulation model. Its simulation orientation is particularly apparent in its Export capability, where its parametric distribution fits can be exported in a format that is consistent with the parameterizations in variate generation algorithms for 26 different simulation packages.
Other potential users include reliability engineers, biostatisticians interested in survival analysis, operations researchers analyzing queueing or inventory models, financial analysts and statisticians. Stat::Fit was originally released in 1995. Version 2 was released in November of 2001. The developer (John Mauer) intends to support a Java version of the software within the next two years.
System Requirements, Documentation and Installation
According to the vendor, running Stat::Fit on a PC requires Windows 95 or a more recent version of the operating system. The program requires 4 megabytes of RAM to run and an additional 12 megabytes on a hard drive to store the manual. I loaded Stat::Fit onto a Windows 98 and a Windows 2000 machine without any difficulty. The software is not presently available on Linux or Mac platforms. The 177-page spiral-bound manual is well written but contains a few minor typographical errors.
The graphical user interface (GUI) has a toolbar at the top of the screen that contains icons associated with popular functions. I preferred using the drop-down menu options, which perform the same functions. The status bar at the bottom of the screen shows an expanded description of the action associated with the current cursor position. Both the toolbar and the status bar can be hidden by using the View drop-down menu, which will save about 5 percent of the workspace in the main window.
The titles of the nine drop-down menus from left to right are: File, Edit, Input, Statistics, Fit, Utilities, View, Window, Help. The File menu allows the user to read data sets from files, write data sets to files, export fitted distributions for a particular discrete-event simulation package and print. The Edit menu contains the standard cut, copy and paste utilities. The Input menu is used to transform the current data values or to generate data sets via Monte Carlo simulation. The Statistics menu displays descriptive statistics, plots the autocorrelation function, and runs statistical tests to assess independence. The Fit menu contains options to estimate parameters associated with 25 common distributions via maximum likelihood or the method of moments. This menu also displays graphics (e.g., a histogram with the fitted probability density function overlaid, a Box plot, or a PP plot) associated with the fit and performs goodness-of-fit tests. The Utilities menu contains a facility that determines the number of simulation replications necessary to achieve a particular accuracy and a facility that plots probability density functions and probability mass functions for 32 common distributions. The View menu hides or displays the toolbar and status bar. The Window menu organizes various windows that are open, and the Help menu enters the online documentation.
I will organize my description of Stat::Fit's key features in an order that is likely to be used by a modeler.
Reading in data. There are four ways to read data into Stat::Fit. First, the data can be keyed into the Data Table window that pops up at the beginning of every session. This is perfectly adequate for small data sets. Second, data can be read from a file. Third, data can be brought into Stat::Fit through the clipboard. Finally, data can be generated using the Input menu, which is a convenient way for a novice to get a sample data set in order to experiment with Stat::Fit.
Data manipulation. There is a reasonable list of options for transforming and filtering the data. These include adding a constant, multiplying by a constant, taking logarithms, etc. If a transformation is desired that is not on the list (e.g., arcsin), it must be performed prior to being read into Stat::Fit. Histogram cell values can be determined automatically using known formulas or can be set manually.
Descriptive statistics. The standard suite of descriptive statistics, such as the sample mean, minimum, maximum, mode, skewness and kurtosis, are calculated and displayed. In addition, a scatter plot of adjacent observations and the autocorrelation function can be displayed, and runs tests can be performed in order to assess independence.
Fitting distributions. Fitting distributions begins in a Setup Calculations dialog box, where distributions are selected for fitting. In addition, the choice between the method of moments and maximum likelihood estimation procedures for parametric distributions must be made in this dialog box. Finally, the user specifies whether the lower bound on the distribution is unknown (and hence computed by the fitter), or a fixed constant (default 0). Since there seems to be no standard example from the discrete-event simulation literature, I keyed in the n = 23 oft-analyzed ball bearing failure times (in millions of revolutions) from Lieblein and Zelen (1956): 17.88, 28.92, 33.00, 41.52, 42.12, 45.60, 48.48, 51.84, 51.96, 54.12, 55.56, 67.80, 68.64, 68.64, 68.88, 84.12, 93.12, 98.64, 105.12, 105.84, 127.92, 128.04, 173.40.
The graphical output associated with using Stat::Fit to fit an exponential and Weibull distribution to this data set is given Figure 1. The empirical and fitted cumulative distribution functions show the superiority of the Weibull distribution over the exponential, consistent with the fact that the ball bearings are wearing out. This is confirmed in the Box plots for the empirical and fitted distributions. The fitting program runs very quickly all of the distributions were fit to the data in a fraction of a second.
Figure 1: Fitted distribution and Box plots.
Assessing model adequacy. The adequacy of the model can be assessed using the Goodness of Fit option from the Fit menu. The distributions are displayed alphabetically, along with their Kolmogorov-Smirnov statistics (and, optionally, chi-square and Anderson-Darling statistics). This output format differs from the Auto::Fit option which ranks all of the fits to the model based on a goodness-of-fit criteria.
Exporting. Stat::Fit's ability to export distributions to various simulation packages is a major convenience factor for discrete-event simulation applications. The Weibull fit to the ball bearing failure times above, for example, is translated to WEIB(81.9, 2.1) for use in ARENA, while translated to W(2.1, 81.9) for use in ProModel. This same fit is translated to [[VSWeibull RVStream new] initializeWithSeed:1234567* [replication number]location:0. scale:81.9 shape:2.1]; in ORCA/VSE. Keeping track of the format names, parameterizations and order of parameters in all 26 of these discrete-event simulation packages is a time-consuming task that is made transparent by the Export feature.
Replications. This was the only portion of the software that required me to work a bit to determine what the software was doing. A call to support cleared up my confusion. The replications utility is used to determine the appropriate number of simulation replications that should be run in order to achieve a desired precision of a point estimate. This facility is unrelated to the data set being analyzed. I tried the parameters given in Example 11.17 in Banks, Carson, Nelson and Nicol (2001), where the measure of performance of interest is the steady-state mean queue length of an M/G/1 queue. Ten initial pilot runs give an initial estimate of the standard deviation to be 5.03 customers. I ran the software required to determine the total number of replications needed to achieve a confidence interval width of four customers with 90 percent confidence, and the replications utility returned 19 replications, consistent with the textbook example.
Distribution Viewer. This feature, shown in Figure 2 for a beta distribution with support on (0, 1) with parameters 1.5 and 3, should be a hit with academics assigned to teach a first course in probability and statistics. A plot of the probability density function for a continuous distribution or the probability mass function for a discrete random variable is displayed with slider bars for altering the values of the parameters. This allows an analyst or a student to experiment with the parameters to see their effects on the shape of the distribution. An instructor might consider purchasing the package for this one feature alone.
Figure 2: Distribution Viewer.
Extensions to Stat::Fit
Stat::Fit is a package that could expand in capability in several different directions. Some of these extensions involve extended capabilities in discrete-event simulation applications, while others would expand the use of the software to other applications. I will divide my comments into major and minor modifications to the software.
Confidence regions. Stat::Fit presently calculates point estimates for parameters in common univariate distributions. Confidence intervals for one-parameter distributions and confidence regions for two-parameter distributions (Cox and Oakes, 1984, page 42 for the Weibull distribution) would give modelers an indication of the accuracy of their point estimates. These regions could be used in a sensitivity analysis after the discrete-event simulation model has been verified and validated.
Right-censored data. If the developers want to move into the reliability and biostatistics market, extending their algorithms to accommodate right-censored data sets is crucial.
Point processes. The ability to fit parametric models for point processes would allow analysts to model nonstationary probabilistic elements of a discrete-event model such as time-varying arrival streams. This major enhancement would, for example, allow an analyst to fit a nonhomogeneous Poisson process with a power law intensity function (e.g., Rigdon and Basu, 2000) or some of the more general intensity functions (e.g., Kuhl, Wilson, and Johnson, 1997).
Upper and lower limits on data set size. The software presently limits the data set size to 8,000 data values. In an application where data is collected automatically (e.g., database transactions), it is possible that this limit may be inadequate. On the other end of the spectrum, the package requires at least 10 data values in order to operate. I can easily envision situations where only limited data exists and this could irritate a user.
Descriptive statistics. I was surprised to see some of the standard descriptive statistics differing from what I expected to see. The coefficient of variation is not presently defined as the ratio of the sample standard deviation to the sample mean. Also, the mode value for the ball bearing data set given above was not equal to one of the data values. I would have preferred to see the autocorrelation function displayed as spikes at the lag values, as opposed to a piecewise linear function. The standard 95 percent confidence limits associated with the autocorrelation function would have been helpful as well.
Algorithms. Allowing a user to see the algorithms used would be helpful in some situations. The algorithm to find the Maximum Likelihood Estimators (MLEs) of the triangular distribution, for example, is not clear from the documentation. Computing the MLEs for the triangular distribution is a rather difficult computational problem, and placing the algorithm in the documentation would be helpful for more sophisticated users. For the ball bearing failure times given above, for example, the MLEs of the minimum, mode, and maximum are 9.754, 42.19 and 185.31 million revolutions.
Goodness of fit p-values. The p-values associated with goodness-of-fit tests are based on the all-parameters-known case, and are hence very approximate. Some accommodation for the fact that the parameters have been estimated from data would enhance the value of the software. Law and Kelton (2000) outline techniques for approximating the p-values for several distributions in the case where all parameters are estimated from data.
Selecting distributions for fitting. In most applications, it will be clear to the data analyst whether the data set is best modeled by a discrete or continuous distribution. The GUI in the Setup Calculations option presently allows a modeler to select distributions one at a time or all at once. Two more buttons for "Select all discrete distributions" and "Select all continuous distributions" in the fit option would be a convenient and useful option. The Auto::Fit option presently divides the distributions into discrete and continuous classes, so an update of this nature should not be too difficult.
I was very pleased with the ease of use and capability of the Stat::Fit software. The documentation is adequate. The questions that I had were answered over the phone by the developer. Since a package of this magnitude could be worked on for a lifetime, the enhancement suggestions that I have made above are not intended to be a negative reflection on the current state of the package. It performs well for fitting parametric distributions to data at a very reasonable price. Its intuitive design and simple GUI allow an input model for a discrete-event simulation to be developed very rapidly.
Lawrence Leemis (email@example.com) is a professor and chair of the Mathematics Department at the College of William & Mary, where he teaches classes in his research areas of reliability and discrete-event simulation in the graduate COR (Computational Operations Research) specialization.
OR/MS Today copyright © 2002 by the Institute for Operations Research and the Management Sciences. All rights reserved.
Lionheart Publishing, Inc.
506 Roswell Rd., Suite 220, Marietta, GA 30060 USA
Phone: 770-431-0867 | Fax: 770-432-6969
Web Site © Copyright 2002 by Lionheart Publishing, Inc. All rights reserved.