OR/MS Today - October 2005



Software Review


WordStat 5.0

Content analysis software offers operations researchers valuable tool for exploring vast, untapped amounts of textual data.

By Byung-Gak Son


  
Product Information

WordStat is available for purchase through Provalis Research's Web site (www.provalisresearch.com). A fully functional trial version (30 days) of WordStat is available on the same Web page. To test the full functions of the software, you need to download the SimStat trial version. This Web page has links to a number of studies done using WordStat.

Pricing:
     Retail Academic
WordStat with Simstat 2.5    $955 $475
WordStat with QDA Miner 1.0    $1,095 $555
WordStat with Simstat & QDA Miner 1.0    $1,375 $725
WordStat with Simstat 2.5 & MVSP    $1,195 $645
WordStat with Simstat 2.5 & MVSP & QDA Miner 1.0    $1,625 $885

 
Some years ago, I witnessed a social science Ph.D. student trying to determine whether the political campaign by the Labour Party in the 1997 British election imitated the Clinton campaign in the U.S. presidential election the year before. She counted the frequency and the co-occurrence of such words as "compassion" and "charisma" to describe leadership attributes by all the election TV advertisements for both Blair and Clinton. I was struck by the rigor of the analysis of such a vast amount of textual data.

Content analysis, according to Holsti (1969), is "any technique for making inferences by objectively and systematically identifying specified characteristics of message."

Is content analysis relevant to O.R. professionals, who are more familiar with traditional analytical methods such as simulation or linear programming? The answer is yes, since most of the information a company has is textual data in the form of e-mails, documents, reports, etc. Typically such textual information is unstructured. Therefore, extracting meaningful information for decision-making from data of such nature can be quite time-consuming and difficult. Exploring such untapped textual data could complement existing O.R. tools for operational improvement.

Many well-known companies use text analysis tools such as WordStat to assess how their products are perceived by the public or by clients. WordStat analyzes databases of customer feedback and e-mail messages sent to customers or technical support by looking at words that are closely associated with their products. Companies also try to identify different types of customers, their consumption habits, their needs, their complaints, etc. Another example of using content analysis appeared in an article by Sodhi and Son (2005) in the August 2005 issue of OR/MS Today. The authors did a basic content analysis to explore what kind of skills employers want from O.R. graduates. The analysis provides useful insights about the key skills employers want from O.R. graduates and provides the kind of quantitative output O.R. people are used to producing.

Content analysis is new territory for O.R. professionals, but we can get help with a tool like WordStat 5.0 from Provalis Research. WordStat is an add-on module for the statistical analysis package SimStat that provides the statistical backend O.R. professionals would be quite comfortable with. According to Provalis, WordStat is specially designed to study textual information such as responses to open-ended questions, interviews, titles, journal articles, electronic communications, etc. In this review, I will be focus on exploring the basic features and the potential of this software.

Summary of Features


WordStat can perform analyses on text fields in various formats as well as on long documents. It can process texts reducing words to canonical form (e.g. "Dogs" and "Doggy" to "Dog").

WordStat can perform univariate frequency analysis (keyword count and occurrence) and presents results in matrix form (Figure 7). The phrase finder helps users to identify recurring phrases and their counts.

WordStat can perform bivariate comparison between any textual field (for example, the personal ads in the tutorial in the next section) and any nominal and ordinal variables (such as gender or age group of the respondents). There are many association measures in WordStat to assess the relationship between the keyword occurrence and nominal/ordinal variables, e.g. the difference between keyword occurrence among the personal ads placed by men and by women.

Keyword-in-context (KWIC) is a useful feature in WordStat that allows one to see the occurrence of either a specific word or all words related to a category in an actual text arranged in a table format. It is handy when one needs to assess the consistency (or lack of consistency) of meanings associated with a word (Figure 1).

Software Review - WordStat 5.0 - Content Analysis

Figure 1: The keyword-in-context (KWIC) feature is handy when assessing the consistency of meanings associated with a word.

In addition to the above features, WordStat provides various other features such as automated text classification, analysis of case or document similarity, etc. For details on these and other features, visit www.provalisresearch.com/wordstat/WordstatFeatures.html.

Mini Tutorial


In this mini-tutorial, I follow the quick tour included in the manual of WordStat 5.0 exemplifying some features of WordStat. The scope of this tutorial is limited to core features such as univariate analysis and exploring relationships between some keywords and other categorical variables. This example analyzes personal ads. We run a content analysis of 68 personal ads published in a Montreal-based cultural newspaper to find out if there is any relationship between words used in the ads and the gender and the age of the person who placed the ads. Then we can investigate if such stereotypes as "boys only care about appearance" are really true. The data are stored in a data file, in this case, with three fields: the text of the ad itself and two categorical variables (the gender and the age group of the person placing the ad; the latter two may be hard to infer from the ad itself and hence are manually coded).

Step 1: Creating a data file. In order to create a data file, you can use the base program SimStat and input data just like other statistics packages. I found it is a bit cumbersome to use SimStat for data entry and manipulation due to its rather different data entry interface. However, WordStat (via SimStat) can directly import different types of data files such as MS Access, MS Excel and dBase smoothly. Also, it has a number of tools assisting importing data from plain text or word-processed files.

For the field for textual information, you can simply copy and paste into the spreadsheet or database of your choice and import to SimStat. Categorical and other variables related to the textual information such as gender and age group obviously need to be coded by the user. For example, for the data file of our job ads analysis [2], we used MS Access to create our data set by copying and pasting the job ads from Monster.com from the Internet and the HTML files provided by OR/MS Today. We manually coded the industry and other fields for further analysis.

In this tutorial, I use the sample data file (SEEKING.DBF), which comes with the software. Once you open the file, you can see three variables: the nominal variable GENDER (1 = Men, 2 = Women), the ordinal variable AGEGROUP (1 = 18-24, 2 = 25-29, 3 = 30-39, 4 = 40+) and the text variable AD_TEXT (Figure 2). The variable AD_TEXT contains the text of the 68 actual personal ads copied and pasted from newspapers; this variable is the focus of our analysis. The other two variables — GENDER and AGEGROUP — have been manually coded by browsing the personal ads.

Software Review - WordStat 5.0 - Content Analysis

Figure 2: Three variables from sample data file: nominal, ordinal and text.

Step 2: Select variables. Once you open the SEEKING.DBF file in SimStat, go to STATISTICS menu and execute CHOOSE X-Y command. Here, we need to move the variables to appropriate locations. Let's first move the AD_TEXT variable to the DEPENDENT list box. Then two other categorical variables (GENDER and AGEGROUP) need to be placed in the INDEPENDENT list box (Figure 2). Note that this is similar to what one might do if one were doing an analysis of variance analysis with quantitative data. Note also that so far we have been using SimStat, which is a statistical package.

Step 3: Run WordStat. Go to the STATISTICS menu and execute CONTENT ANALYSIS command. A new window with six tabs pops up and now we are ready to do the content analysis.

Step 4: Choose the proper dictionaries. The backbone of WordStat usage is a "dictionary." A dictionary is a specification of words and phrases under various named categories that allows WordStat to either exclude certain words from the analysis or, more to the point, create counts under each "category" when a word or phrase under that category is found in a record.

WordStat allows users to choose, view and edit dictionaries used for specific content analysis. In this tutorial, we exclude: 1) pre-processing for the custom transformation of text, and 2) "lemmatization," a process by which various forms of words are reduced to a more limited number of canonical forms, for example, transforming plural into singular. The third setting, "exclusion," is a dictionary that contains words to be removed during the process of analysis. For example, words with little semantic values such as pronouns, articles and conjunctions are automatically removed by the rules set by the exclusion dictionary. On the other hand, "categorization" allows one to specify words, word patterns and phrases to be included in the analysis (Figure 3).

Software Review - WordStat 5.0 - Content Analysis

Figure 3: "Categorization" specifies words, word patterns and phrases to be included in the analysis.

All of these dictionaries can be edited in the program or by using any text-editing tool (e.g., Notepad). For this tutorial, we select default exclusion dictionary (DEFAULT.EXC) and a tailor-made categorization dictionary (SEEKING.CAT) that contains words and phrases that frequently appear in personal ads. Keywords can be arranged in hierarchical manner so users can have different levels of analysis (Figure 4). The level-one category includes major attributes partners may be looking for. Under the category "appearance," for example, one would find various words describing physical appearance.

Software Review - WordStat 5.0 - Content Analysis

Figure 4: Keywords can be arranged in hierarchical manner for different levels of analysis.

You can download a large number of pre-made dictionaries from the Web page (http://www.provalisresearch.com/wordstat/RID.html), depending on the subject of interest. Most O.R. users will want to construct their own dictionaries from the raw data using WordStat. For this tutorial, the category dictionary SEEKING.CAT was given. Let us assume that we did not have this so we would have to create our own dictionary for this analysis. In this case, we could construct a categorization dictionary by running the frequency analysis of words and the phrase finder in WordStat to identify the ones most commonly used. On the basis of the results, we can construct our own category dictionary by selecting the most frequently occurring words and phases (Figure 5 and 6). However, these two functions do extract irrelevant words and phrases such as "LEAVE A MESSAGE"; we need to go though the lists to single out these words and phrases.

Software Review - WordStat 5.0 - Content Analysis

Figure 5 (above) and Figure 6 (below): Users can construct their own category dictionary by selecting the most frequently occurring words and phases.

Software Review - WordStat 5.0 - Content Analysis

Once you have chosen or created an appropriate dictionary, you can select advanced options. For this tutorial, we disabled all options.

Step 5: Perform a frequency analysis of the personal ads. Finally, we are ready to analyze the most important attributes of Mr. or Miss "Perfect" according to the personal ads. We click the third tab (Frequencies) to determine the count of word categories or frequency analysis. We found that words under the "appearance" category are the most frequently mentioned criteria in the personal ads. Indeed, 41 out of 68 ads contain words related to appearance (Figure 7). Note that the "appearance" category contains various words such as "beautiful" and "muscular" (Figure 4). The "finance" category, on the other hand, appeared the least. You can display other words that are not included in the category dictionary by changing the display option.

Software Review - WordStat 5.0 - Content Analysis

Figure 7: "Appearance" category was popular in frequency analysis of the personal ads.

Step 6: Examining the relationship between included categories and the gender of the author. So far, the frequency analysis on the ads we have just done shows the frequency of words regardless of the gender. It can be also very interesting to see if there is any difference in preference over the ideal partners between men and women. We go to the fourth tab, "Crosstab menu," and WordStat runs two separate frequency analyses for men and women and provides a nice table (Figure 8). The results suggest that the most important criteria for men appears to be "appearance," while women value "communication" and "family" the most. From the same menu we can also estimate the strength of these relationships by selecting different association measures such as Chi-square or a Pearson's R statistics.

Software Review - WordStat 5.0 - Content Analysis

Figure 8: "Crosstab menu" provides, among other things, a nice table.

You can do various other tasks in "Crosstab page" such as correspondence analysis. You can also create "heatmaps" that help clarify the relationship between words and categories (Figure 9).

Software Review - WordStat 5.0 - Content Analysis

Figure 9: "Heatmaps" help clarify relationship between words and categories.


My Experience with WordStat


When my co-author Mohan Sodhi and I were initially planning the article "What Industry Employers Want from OR/MS Graduates — Preliminary Results from an Analysis of Job Ads" [2], we were overwhelmed by the sheer amount of textual information (more than 650 job ads). Our original plan was to go through the ads one by one and manually coding each one. The estimated timeline for the article was three to five months. When we discovered WordStat on the Web, we were excited by the potential of the methodology and its features to explore vast amount of job ads in a fraction of the time.

We downloaded the demo version and became familiar with the software without the benefit of the printed manual. Thanks to its straightforward interface and easy-to-follow online manual, WordStat was relatively easy to use. In addition, WordStat proved quite versatile in terms of importing data from popular applications and easily exporting outcomes to various formats. We were able to import the data in MS Access format containing 650+ job ads in a few seconds without any difficulties.

Two features we particularly liked were "frequency analysis for a single word" and "phrase extractor." As we did not have any category dictionary for "discipline," "degree," "skill" and "nature of work," we had to create our own category dictionary. Although we had to go through more than a thousand key words and phrases automatically identified by these features to single out irrelevant words and phrases, the two features helped us to identify relevant key words and phrases in a quick and more accurate manner. While writing the article, we repeated the above process as we added more ads over time. Therefore, we had to update our category dictionary a number of times, and editing the category dictionary in WordStat was not complicated.

We found "keyword-in-context" (KWIC) useful when we were trying to find out the relevance of certain terms. For example, we discovered that Monster's search engine for our phrase "operations research" (within quotes) also returned ads in which the words "operations" and "research" were separated by a punctuation mark. So, we used the KWIC feature to go through individual ads to spot "operations, research." WordStat automatically searched all the ads containing "operations, research" and highlighted in different colors, so we could easily spot and remove the ads with "operations, research."

The speed of processing records was fast enough. The specification of the computer I use is Intel Celeron 2.4 with 512 RAM. It took approximately three seconds to do frequency analysis of words and one minute for the key phrase extraction over 650-plus ads.

WordStat helped us analyze the vast textual information quickly and in a rigorous manner. However, manipulating the data in the base statistical program, SimStat, was rather difficult and cumbersome relative to MS Excel and MS Access from which data can be imported directly.

While writing this review, I found various academic and industry articles reporting results obtained using WordStat. I was amazed how creative the users of WordStat are in terms of applying this software to various situations. For example, Péladeau and Stovall analyzed a database of pilot reports on collision risks, commonly known as TCAS Reports (Traffic Collision Avoidance System Report). Using WordStat they were able to identify the specific risks at different airports, the hour of day where those errors occurred, the flight phase where those collision risks occurred, as well as some properties of those collision incidents (timing of events, multiplicity of events, pilot actions, etc.) [1].

Literally any sort of textual information can be analyzed with WordStat with dictionaries of your own choice. Imagine being able to analyze vast amounts of field operations documents, reports, e-mails, databases and other text fields that were untapped because they were simply too cumbersome or time-consuming to analyze manually. In O.R. classrooms, introducing a content analysis tool like WordStat can help students become aware of extending number-based statistics to text in order to mine information.

Overall, WordStat is an easy-to-use, affordable and feature-rich software that provides O.R. professionals with yet another analytical technique.

Postcript: For the record, the Ph.D. student's conclusion was that a large part of Blair's campaign was benchmarked from the Clinton campaign, and the evidence presented by the content analysis was quite convincing.

Vendor Comments

Editor's note: It is the policy of OR/MS Today to allow developers of reviewed software an opportunity to clarify and/or comment on the review article. Following are comments by Normand Péladeau, president of Provalis Research.

The reviewer provided an excellent introduction to the basic features of WordStat. His description of how they built their own categorization dictionary to analyze job ads gives an accurate picture of the very first task a new user encounters when working with WordStat. The creation of taxonomies or categorization dictionaries is often an essential condition for well-grounded analysis and valid conclusions, and WordStat offers many tools to assist the user in such a task.

The reviewer mentioned the ability to easily assign words or phrases from lists. Additional features include a drag-and-drop dictionary editor and various lexical resources that can suggest additional items to be added to existing content categories. The ability to look at how words and phrases co-occur, using hierarchical clustering, multidimensional scaling or proximity plots, represents another way to identify themes in a collection of documents or perform knowledge-discovery tasks.

As with all advanced analysis tools that could only be mentioned in this short introduction, I encourage readers to look at some of the published studies available from our Web site or download either the electronic version of the manuals or the fully functional demos of WordStat and Simstat.

Statistical software is designed to handle numerical data and may not be the most appropriate tool to handle collections of documents. We fully understand the reviewer's difficulty in entering those ads directly into Simstat. This is one reason why we implemented import routines for various database and spreadsheet file formats and created a document conversion wizard to import various types of documents. This is also one of the reasons why we released, last year, a new application called QDA Miner that can be used in place of Simstat as the base module for WordStat. This software shares the same file format as our statistical program, but it was designed to provide more user-friendly document management features. It also introduces a new set of tools borrowed from social sciences computer-assisted qualitative analysis, which relies on manual and semi-automatic coding of text segments and on text retrieval.

WordStat, Simstat and QDA Miner are desktop applications. However, Provalis Research is planning to release, before the end of the year, a software developer's kit (SDK) that may be used with numerous programming languages and database programming environments. Such a library will allow integration of WordStat categorization and classification technologies into enterprise document management and decision-support systems.





Byung-Gak Son (b.g.son@city.ac.uk) is a Research Fellow at the Cass Business School, City University London, who recently finished his Ph.D. in supply chain management.

References


  1. Holsti, O.R., 1969, "Content Analysis for the Social Science and Humanities," Reading, Mass.: Addison-Wesley.
  2. M. Sodhi and B. Son, 2005, "What Industry Employers Want from OR/MS Graduates," OR/MS Today (August 2005), Vol. 32, No. 4, pgs. 32-38.
  3. Péladeau, N., & Sovall, C., 2005, "Application of Provalis Research Corp.'s Statistical Content Analysis Text Mining to Airline Safety Reports," Global Aviation Information Network.





  • Table of Contents
  • OR/MS Today Home Page


    OR/MS Today copyright © 2005 by the Institute for Operations Research and the Management Sciences. All rights reserved.


    Lionheart Publishing, Inc.
    506 Roswell Rd., Suite 220, Marietta, GA 30060 USA
    Phone: 770-431-0867 | Fax: 770-432-6969
    E-mail: lpi@lionhrtpub.com
    URL: http://www.lionhrtpub.com


    Web Site © Copyright 2005 by Lionheart Publishing, Inc. All rights reserved.