Evaluation of Methods for Classifying Epidemiological Data on Choropleth Maps in Series
Cynthia A. Brewer , Department of Geography , Pennsylvania State University
Annals of the Association of American Geographers
, 92(4), 2002, pp. 662-681
Table
1. Summary of Classification Methods Tested
EI. The hybrid equal interval classification that we developed used the upper whisker of a box plot to define the highest category of outliers; see box-plot discussion below (BP) for explanation of whiskers. The remaining range of the data below the upper whisker was divided into equal intervals (e.g., equal steps of 7 deaths per 100,000). This approach was intended to be an improvement of the standard equal-interval method, which divides the overall data range into classes of equal range, regardless of the magnitude of extreme values. These extreme outliers are often present in epidemiological data, and they interfere with use of a regular equal-interval classification, making it an impractical or “straw man” method for mapping real data.
QN. The quantile method placed equal numbers of enumeration units into each class. With five classes, 20 percent of the units were in each class. Quantile classification is also known as percentile classification. With five classes, the test maps were quintile maps.
BP. The box-plot-based method had a middle class containing data in the interquartile range (the middle 50 percent of the data straddling the median). The adjacent classes extended from the hinges of the box plot to the whiskers, and the extreme classes contained outside and extreme values beyond the whiskers. Generally, the hinges of a box plot mark the top and bottom of the interquartile range, and the whiskers mark the last data values within 1.5 times the distance of the interquartile range above and below the hinges. For example, with an interquartile range of 10, from 33 to 43, the upper hinge would be as high as 58 (43+15). Data values higher than 58 and lower than 18 would be in the extreme classes for this example. See example map and corresponding box-plot and histogram in Figure 2C for a visual example. Box-plot-based classes were intended to be more suitable for skewed or asymmetric data distributions than a mean-based classification (see SD, below).
SD. The standard deviation classification had a middle class centered on the mean with a range of 1 standard deviation (.5 standard deviation to either side of the mean). Classes above and below this mean class were also one standard deviation in range, from +(0.5 to 1.5) standard deviations. The high and low classes contained remaining data that fell outside +1.5 standard deviations.
NB. The natural-breaks method used was the implementation of the Jenks optimization procedure that was made available in ESRI’s ArcView GIS software. In general, the optimization minimized within-class variance and maximized between-class variance in an iterative series of calculations. ESRI’s documentation did not explain the specifics of their algorithm, but the ArcView natural-breaks method produced the same class breaks as did the Jenks algorithm that minimizes the sum of absolute deviations from class means (Terry Slocum, personal communication, e-mail, May 2000). See Slocum (1999) for a recent description of the Jenks algorithms.
BE. The minimum-boundary-error method used was also an iterative optimizing method (Cromley 1996). It was the only method tested that considered the spatial distribution or topology of the enumeration units (rather than their statistical distribution). In general, differences in data values between adjacent polygons were minimized in the same class and differences across boundaries were maximized between different classes (different colors). Larger differences in the data were, therefore, represented by color changes on the maps.
SA. The shared-area method used an ordered list of polygons ranked by data value to accumulate specific land areas in each class. With five classes, the extreme classes each covered ten percent of the map area. The middle class contained 40 percent of the area, and the remaining classes each contained 20 percent of the area. This method was based on the work of Carr, Olsen, and White (1992) and was intended to be a more sophisticated version of the constant-blackness (equal-area) method tested by Lloyd and Steinke in earlier work (1976, 1977). All maps in our series “share” the 10-20-40-20-10-percent area assignments, so we have labeled the method “shared area.” We did not choose the previously used “constant area” or “equal area” terminology because classes within maps did not have equal areas.