Updated Dec. 3
We present an approach to the process of constructing knowledge through structured exploration of large spatiotemporal data sets. First, we introduce our problem context and define both Geographic Visualization (geovisualization) and Knowledge Discovery in Databases (KDD), the source domains for methods being integrated. Next, we review and compare recent geovisualization and KDD developments and consider the potential for their integration, emphasizing that an iterative process with user interaction is a central focus for uncovering interesting and meaningful patterns through each. We then introduce an approach to design of an integrated geovisualization-KDD environment directed to exploration and discovery in the context of spatiotemporal environmental data. The approach emphasizes a matching of geovisualization and KDD meta-operations. Following description of the geovisualization and KDD methods that are linked in our prototype system, we present a demonstration of the prototype applied to a typical spatiotemporal dataset. We conclude by outlining, briefly, research goals directed toward more complete integration of geovisualization and KDD methods and their connection to temporal GIS.
Large environmental data sets represent a major challenge for both domain and information sciences. The domain sciences, most of which developed under data poor conditions, must now adapt to a world that is data rich &endash; so data rich that large volumes of data often remain unexplored while the media they are stored upon deteriorate or become obsolete. The information sciences, most of which developed in a pre-computer era or when batch processing by computer was the norm, must now adapt to a world that is not only digital but highly dynamic &endash; in which there is a potential for computers to produce answers in real time as an analyst explores data and poses "what if" questions. Much of the environmental data being generated today (e.g., from the Earth Observation System, from monitoring efforts in endangered ecosystems, from meteorological stations, etc.) includes georeferencing. The spatial aspects of these data are, in fact, often a primary focus of analysis&endash;for studies of pollutant dispersal, forest fragmentation, and other applications. Repeated observation is critical to answering the most important environmental science questions (those related to environmental process), thus environmental data sets typically have temporal as well as spatial components.
Particularly when applied to scientific data, geovisualization and KDD have similar goals. They differ, however, in the extent to which they rely upon human vision or computational methods to process data. In this section, we review, separately, the underlying principles and key developments of the past decade in both fields (each builds on longer traditions, but has been identified as a distinct research stream for about a decade).
A coherent theoretical framework for geovisualization is just beginning to emerge. That framework integrates the formalism of semiotics as an approach for understanding and modeling representational relationships with a cognitive perspective on the process of using visualization methods to facilitate scientific understanding (MacEachren, 1995). The goal for this integrated perspective is to develop a conceptualization of geovisualization as a process that involves humans achieving insight by interacting with data through use of manipulable visual displays that provide representations of these data and of the operations that can be applied to them. From semiotics, we gain tools for understanding abstract representations of phenomena and processes (i.e., representations in digital, visual, and other forms) as well as methods for explaining how meaning is brought to the representations by their creators and users. From the study of cognition we gain a perspective on the ways in which human information users conceptualize problem domains, process visual displays, and link mental schemata to actions through interface tools (figure 1). This integrated cognitive-semiotic perspective serves as a base from which to consider three categories of visualization meta-operations that are at the heart of the data exploration components of the Apoala Project: feature identification, feature comparison, and feature interpretation. These three operation forms are defined below, briefly.
The development of KDD coincides with an exponential increase in data generated by and available to science, government, and industry, particularly data generated in digital form. The term "knowledge discovery in databases" was coined in 1989 in an effort to distinguish between the application of particular algorithms designed to extract pattern from data (the subprocess of "data mining") and the overall process within which data mining is a step in "extracting knowledge" from these patterns (Fayyad, et. al., 1996).
The KDD literature contains frequent mention of the importance of visualization (e.g., Brachman and Anand, 1996; Uthurusamy, 1996). In most cases, however, visualization is considered only as a tool to facilitate the interpretation-evaluation stage of KDD. (Simoudis, et al, 1996 and Slortz, et. al., 1995 are two exceptions). Our goal, in contrast, is a more complete integration of geovisualization and KDD methods. In order to achieve this goal, we approach system integration at three levels: conceptual, operational, and implementational (Howard and MacEachren, 1997).
Figure 1. Extended feature ID model for map-based visualization. This diagram extends upon an earlier pattern ID model of geographic visualization (MacEachren and Ganter, 1990) to integrate what we know about human perception and cognition in the context of visual information displays. Emphasis is given to the iterative nature of human experts examining a representation and attempting to interpret that representation and use it as a prompt to insight (a process that cycles between seeing or noticing "features" in the display and interpreting those features by matching what is seen with what is known. The model suggest that knowledge is stored in the form of cognitive representations that are drawn upon to generate knowledge schemata (methods for matching what is sensed with prior knowledge) [reproduced from MacEachren, 1995, Guilford Press].
Figure 2. Simple concept hierarchy for climate analysis – one (a) appropriate for study of northern hemisphere climate, the other (b) appropriate to the southern hemisphere.
Figure 3. Integration of geovisualization and KDD methods. With each pairing of operation categories, we cite possible outcomes of the merger as it relates to knowledge construction goals. The possibilities are intended as examples from a larger potential list. Examples cited are derived from a perspective of what we gain by adding geovisualization to KDD operations. A similar matrix could be derived taking the alternative perspective.
Figure 4. A geoview is a three-dimensional window in which geographic space is mapped to display space in at least two of the dimensions. The third dimension is used to represent either the third spatial dimension (elevation/depth) or to represent time (with time treated as a linear dimension). In this case, the base of the view represents geographic space and political boundaries to help provide context. The z-axis represents time. Here, time is represented as linear and discrete starting with day 1 at the surface. Glyphs highlighted in red represent precipitation events at those times and places.
Figure 5. 3D scatterplots represent relationships between three variables, with one variable plotted on each axis of a cube. 3D scatterplots are a simple form of "spatialization" in which non-spatial data "dimensions" are mapped to the two- or three-dimensions of a display space. Spatialization suggests, and takes advantage of, the metaphor of near=similar | far=different. Using linked views, the "mapping" of non-spatial aspects of georeferenced data to the spaces of a display can be grounded in more intuitive representations that map geographic space to display space (our geoviews). In this scatterplot, a sample of cases included in a data mining run is depicted in a space defined by three attributes: sequential date (number of days from the beginning of the data set), precipitation, and sea level pressure. Color represents a fourth attribute (three classes of surface humidity) and size of "glyphs" represent a fifth (humidity at 700m).
Figure 6. A Parallel Coordinate Plot (PCP) is a data representation that contains several parallel axes, one for each variable in the data set. The data are distributed along each axis and a line connects individual records from one axis to the next, producing a "signature" for each data record. Parallel coordinate plots (PCPs) are particularly effective in uncovering relationships among many variables (Inselberg, 1997). Most previous implementations, however, provided a limited perspective on data relationships because the assignment of variables to axes was fixed. In our implementation, the user can interactively adjust that assignment (see Figure 8). This typical PCP depicts each case from a sample of data used in a data mining run by connecting the position of that case on each variable axis. The pattern of lines that results represents relationships among the classes generated in a data mining run (the initial axes of the PCP) and each of the variables included in the run (including spatial, temporal, and attribute variables). Our variation on standard parallel coordinate plots includes several additions related to user interaction (detailed below) as well as the use of color to represent categories for one of the variable axes (in this case, the surface humidity axis is grouped into three equal value range classes depicted with shades of green). Not surprisingly, the sample cases depicted that have low surface humidity (light green) all also have zero precipitation. Considering precipitation, we see that most of these zero precipitation events occur when surface atmospheric pressure is high and the high precipitation events occur when pressure is low.
Figure 7. Integration of representation and interaction forms. Each of the three representation forms implemented (as well as others that we may implement in the future), is controlled through multiple interaction forms that allow users to manipulate various parameters to the data-to-display mapping. In most cases, linking among the representation forms results in an action applied by a user to one representation being reflected across the set of representations displayed. Each of our representation forms can be independently or simultaneously manipulated through applications of one or more interaction forms. Each interaction form can be considered the implementation of a visual analysis operation. The interaction forms we have implemented include assignment, brushing, focusing, colormap manipulation, viewpoint perspective manipulation, and sequencing.
Figure 8. Assignment.
Figure 9. Brushing.
Figure 10. Focusing.
Figure 11. Colormap manipulation.
The success of this research was dependent upon our ability to create a function link between TCL and IBM's Data Explorer. This sidebar provides a detailed explination this work. Included are portions of code and conceptual diagrams. Don't forget to return to this page. More...
Figure 13. Sequencing.
Figure 14. Typical interpretation/evaluation display that includes the three representation forms.
In this section, we illustrate the potential of our integrated geovisualization-KDD approach to knowledge construction with an application of methods to a sample gridded regional climate data set for northern Mexico and southern U.S. The target audience for our demonstration consists of environmental scientists (particularly climatologists). Sample data examined represent climate phenomena that are continuous in both space and time. At a conceptual level, the analysis goal is to find both individual features and classes of feature in spatiotemporal climate data sets. A secondary conceptual level goal is to explicate the data mining algorithm applied to the data (so that we can make more informed decisions about setting model parameters and so that scientists can better interpret the meaning of entity classes derived). At an operational level, these goals are instantiated as a series of operations or data processing tasks. Emphasis is on row two, cells one and two, of the meta-operations matrix introduced above (figure 3) &endash; on the application of feature identification and comparison methods to the KDD operation of categories extraction and classification.
Figure 15. Geovisualization tool integration.
Figure 16. Here, we use focusing to highlight only cases occurring in 1987. Focusing on the PCP observations from 1987 creates a particularly informative perspective on these data. The eight sample observations identified are members of three classes: class 0, class 1, and class 24. By brushing (in the PCP) the lines representing each class, the query is narrowed and the prototypical "signatures" of each class represented in 1987 can be traced. As shown in this figure, the four observations in 1987 that were most likely to be in class 24 (highlighted in green) have nearly identical signatures. Each is not only a zero precipitation event but also is characterized by moderate to low surface humidity, low midlevel humidity, and relatively low sealevel pressure.
Figure 17. Here, cases in class 0 that occur in 1987 are highlighted (in green), with a different class signature apparent. In contrast to Figure 16, these do not have a "signature" that tracks as consistently through all climate variables: the lines diverge at the midlevel humidity axis, but reconverge to one particular position on the sea level pressure axis (relatively high).
Figure 18. Focusing is moved to 1992 in order to compare class 0 cases from that year (again, highlighted in green) to those in 1987 (see figure 17). Focusing on 1992 shows the same "trace" for class 0: no precipitation, low surface humidity, high sea level pressure, and a wide variation of 700 mb humidity values. Class 0, then, is clearly more dependent upon sea level pressure than on midlevel humidity.
Figure 19. In this figure, the variable to which focusing is applied has been moved from "year" to "class#" to examine the characteristics of class 0 across all years (i.e., the full set of prototype class members). As was true for the specific years described above, observations in class 0 overall are characterized by zero precipitation, low surface humidity, high sea level pressure, and a wide range of midlevel humidity. This relationship can be confirmed using the 3D scatterplot, which shows the clustering of large glyphs (size scaled to sea level pressure) along the xaxis, which is, in this case, 700 mb humidity. A more dramatic result of the visualization of prototypical class 0 cases is the spatiality of this class, as displayed in the geoview; events in this class happen exclusively over land.
Figure 20. In this figure we focus on another class, class 24, and find it characterized by zero precipitation, relatively high (and more varied) surface humidity, and moderately low midlevel humidity and sea level pressure. This clustering is shown not only in the traces of the PCP but also in the 3D scatterplot. The spatiality of this class also shows a dramatic inverse of that of class 0: all of the instances of this class occur over the Gulf of Mexico. A climatologist would expect this result: high surface humidity and lower midlevel humidity is more common over open water than over land.
Figure 21. In this figure, exploration of Class 26 finds cases that have generally high surface humidity, zero precipitation, and a distinct temporal pattern (with two event clusters in time). The general pattern in the full data set is that the classes with high strength are distinguished more by temporal than spatial characteristics, particularly by patterns with similar events that are proximal in time (e.g., in a particular season of a particular year). The 3D geoview can incorporate time on the z-axis, and can be rotated so that these temporal associations are emphasized. Class 26, is a good example of a class that is characterized by temporal rather than spatial factors. In terms of climate variables, cases in class 26 have moderate surface humidity, low precipitation, and high midlevel humidity events. Events with this combination of characteristics are clustered temporally (several locations with similar attributes on the same day) as opposed to spatially (several days with similar attributes in the same region). This clustering is apparent if the geoview is rotated: events belonging to class 26 occur only on certain days of the data set.
Figure 22. Here, color hue is used to distinguish the 7 classes being explored. The distinct space, time, and attribute characteristics of the classes are clearly visible. The results of data mining can thus be visualized and interpreted effectively using our combined representation and interaction forms. As a confirmation of the above analysis, each of the seven classes in the sample data set (classes 0, 1, 3, 7, 12, 24, and 26) can be assigned a different color (using the spectral scheme, seven class choice in the PCP Classify menu). The clustering of these colors in the 3D scatterplot dramatically illustrates that the observations are classified according to the values of a combination of climate variables, in tandem with spatial and temporal characteristics.
The objectives of this paper have been to make the case for integration of geovisualization and KDD methods, to propose a conceptual framework for that integration emphasizing a merger of meta operations fundamental to each set of methods, and to describe an initial prototype knowledge construction environment and its application to a test data set. At this stage in our long term strategy for geovisualization-KDD integration, we have applied our knowledge construction methods to an isolated data set stored as flat files. We plan a subsequent coupling of the geovisualization-KDD methods with a temporal GIS. The goal here is to make the early stages in the knowledge construction process more flexible and facilitate interative exploration of user selected subsets of data (choice of which is prompted by prior analysis steps). In developing and implementing geovisualization methods, we have focused on methods that are particularly useful in the later stages of the KDD process (after data have been selected, preprocessed, and transformed into a format suited to applications of a particular data mining tool). We expect, however, that many of the geovisualization methods developed will also be useful for applications at the earlier KDD stages.
This research was supported by the U.S. Environmental Protection Agency (EPA), under Grant R825195-01-0 (Donna J. Peuquet and Alan M. MacEachren, Co-PIs). Support has also been provided by the Penn State Center for Academic Computing where MacEachren is a Faculty Fellow. We thank Tereza Cavazos for providing the Mexico data used in our demonstration and for help in interpretation of the data mining output as well as Mark Harrower for his work on both print graphics and on design and production of the web supplement to this paper.
Copyright © 1998 Pennsylvania State University