
Gordon German
School of Computing, Curtin University of Technology, Kent Street, Perth, Western Australia
Email: gordon@cs.curtin.edu.au
Artificial neural networks have several advantages when used as classifiers of complex geographic and remotelysensed datasets. They normally require no assumptions on the data distribution and can be trained with relatively small sample sets. Further, they are robust classifiers that require little data preparation prior to use. The author has previously shown a robust methodology for selecting an architecture and reducing the time required for training (German et al., 1998), where each hiddenlayer node is responsible for the creation and positioning of a separating hyperplane in attributespace.
In an extension of the prior methodology, hyperplanes that already produce adequate separation in attribute space are 'frozen' in position prior to training. This allows the network to focus on areas of poor separability and spawn additional hyperplanes if required. Results obtained on a complex realworld dataset shows a significant improvement in classification performance, as well as a reduction in training time.
The pixellevel classification of complex geographic datasets is an arduous task, in which the increase in the amount and variety of satelliteacquired data (Wilkinson et al., 1995), has made it more difficult for human classifiers to deduce the relationships within the data. Automated supervised classifiers, based on both statistical and machinelearning approaches, often are used to provide a discrete chloropleth map in which a given pixel is assigned a particular class label.
A classifier’s function can be formulated in terms of a mapping of its input variables to its (given) output conditions. We can write:
.............. (1)
where p is the number of attributes, q is the number of classes and n is the number of samples. The goal of classification is to select an output class from a different phenomenological domain (P , the classification scheme) to that of the input attributes (Â ) for each input vector x^{p}. These transformation models can be categorised as unsupervised or supervised classifiers. In the case of supervised classification, the user chooses the scheme P and the classifier learns an approximation G '(n, p) to the required transfer function G (n, p). This is accomplished by examining a small set (the training set) of the data for which the correct classification has already been determined (by ground survey, or a previous classification); hence, the (common) scheme of ground cover type is often derived from an attribute domain that may comprise several bands of LANDSAT data, as well as ancillary data such as digital elevation models, rainfall, etc.
The classifier must then be of such complexity as to enable encoding of the approximated transformation function G ¢ (n, p). For statistical classifiers, such as the gaussian maximum likelihood classifier (MLC) or minimumdistancetomean (MDM – see Richards, 1986), this often translates into defining some set of probability density functions (p.d.f.) that describe the class distributions. A discriminant function is then constructed that uses these p.d.f.s to provide the required transformation, i.e., these classifiers operate by use of a discriminant rule, which divides the p dimensional attributespace Â ^{p} into disjoint regions, one region per class. The problem for these statistical methodologies lies in the production of a satisfactory p.d.f. for each class, which may require certain assumptions and limitations on the data. For instance, the popular MLC assumes a gaussian distribution for the class variability and requires as an absolute minimum p+1 input vectors per class in the training set to avoid the class covariance matrix (S _{i}) becoming singular. Generally, it is recommended to have at least 10p to 100p input vectors per class to accurately model the class distribution with the gaussian p.d.f. (Mardia et al., 1979; Swain & Davis, 1978).
In contrast, the taskbased multilayered perceptron (MLP), a form of neural network based on machinelearning techniques, requires no assumptions on the class data distributions (although in general, it does assume a gaussian noise distribution through the data) and has no real lower limit on the size of the training classes. The MLP consists of three layers of nodes, as per Figure 1. Each node of the (single) hidden layer, along with its subset of the W weight matrix, is responsible for the generation of a separating hyperplane within the p dimensional attributespace. The task of each node is to separate out one particular class (c_{i}) from one other (c_{j}). The output layer nodes and associated connections are then responsible for the amalgamation of one or more of these hyperplanes into q1 class boundary decision surfaces and in a process analogous to a ‘superclass’ formation, can sum disjoint regions to provide a final classification strategy. Used correctly, the MLP can give a significant improvement over the classification performance of an MLC on complex, realworld data, as shown later (or see Gahegan & West, 1998 for comparison details).
The MLP classifier’s operation is covered in detail in German & Gahegan (1996). In brief, an input vector is placed on the input nodes and is propagated to the output layer via the weight connections and the hiddenlayer. This is done for each vector in the training set (one iteration). Each node in the hidden and output layers transforms the sum of its inputs via an activation function, normally a sigmoid of the form:
Figure 1: A simplified MLP network architecture.
.............(2)
where y is the node’s output, x is the summed input, q is a threshold, or bias value and f is a gain, normally treated as a constant and set to 1. During the training phase, the network seeks to reduce its output error E by varying the connecting weight matrices W and U after each iteration. The cost function E is the metric used by the network to judge the effectiveness with which it has learned the transfer function G (n, p) of Eqn. 1. The most commonly used cost function is the meansquared error (MSE) function given by:
.............(3)
where n = number of input patterns (vectors) in the training set, q = number of output nodes, t_{j} is the expected output at node j (of the output layer) and z_{j} is the actual output of that node. Ideally, E should be a measure of how far from optimal the current state of the network is. In this guise, the MLP classifier is a "winnertakesall" classifier, i.e., it provides a single class label for each input pixel.
The weight variation is performed by some form of minimisation routine (converging on a minimal E across the multidimensional weightspace) and there are many numerical routines available to implement this (e.g., Battiti & Masulli, 1990; Luenberger, 1984; Bryson & Ho, 1969). Traditionally, the standard neural network minimisation function is based on gradient descent, a rather simplistic numerical method using firstorder information derived from the partial derivatives of the weight vectors that require the user to "guess" certain parameters prior to implementation. An MLP implementing gradient descent is commonly referred to as a "vanilla backprop" neural network and has a reputation for being difficult to configure and initialise (e.g., Murata et al., 1991; Knerr et al., 1990; Jacobs, 1988) and may take 5,000 to 100,000 iterations to converge. Even so, it is unfortunately the method of choice for many GIS researchers when implementing neural network classifiers (for example, see Kamata & Kawaguchi, 1993; Foody, 1995; Bischof et al., 1992). The minimisation routine favoured by the author is a modified, multiplestrategy scaled conjugate gradient (SCG) routine, adapted from Moller (1993), which requires no userprovided parameters. It uses estimated secondorder information and performs better in complex spaces than gradient descent without the overhead of true secondorder routines such as NewtonRaphson and can choose between several search strategies, depending on the local error surface (see German, 1999). Acceptable convergence is often in the order of a few hundred or so iterations.
Ideally, during training, one would expect an inverse relationship between the cost function and the classification performance; for every decrease in E, there would be a corresponding increase in the percentagecorrect (PCC) classification rate (see Figure 2); however, for the MSE cost function of Eqn. 3, the relationship on the Kioloa dataset is as Figure 3.
Figure 2: Ideal relationship between the cost function and classification performance.
Obviously, a reduction in E does not always lead to an increase in classification. Alternative cost functions, such as the classification figureofmerit (CFM function, see Hampshire & Kumar, 1993) and the buffered MSE (BMSE, see German, 1999), have been devised, which attempt to alleviate this problem. Unfortunately, CFM cost functions, although highly efficient, do not model output noise as a gaussian distribution
Figure 3: Relationship of MSE cost function and performance on the Kioloa dataset.
and so cannot be used directly with the SCGbased minimisation routines. The author has used CFM functions with vanilla backprop, where they do exhibit greater performance, but with all the drawbacks that go along with that form of neural architecture. (See German (1999) for details.) On the other hand, BMSE can be used readily with SCG. The underlying idea of BMSE is that, so long as the correct output node has the highest output value (hence the input vector is assigned to the correct class), the network should not be penalised for small ‘errors’ in the output of losing nodes, provided they are below some error threshold (k ). By comparison, the MSE algorithm penalises all nodes regardless. Figure 4 shows the performance of the BMSE algorithm on the Kioloa dataset, which is much closer to the ideal relationship of Figure 2.
Figure 4: Relationship of BMSE cost function and performance on the Kioloa dataset.
Table 1 gives some comparative performance figures for various classifiers on the same complex dataset. The dataset is from the Kioloa area of New South Wales, Australia, which has been made available as a NASA pathfinder dataset through the Australian National University in Canberra (Lees and Ritman, 1991). A Landsat composite image is shown in Figure 5. The study area is a complex natural forest habitat of approximately 15 km square and there are 9 floristiclevel classes to be delineated from 11 attribute layers (four of these represent nominal or ordinal data, four are Landsat TM bands) i.e. q = 9
Figure 5: Landsat TM composite of the Kioloa region (bands 2, 4, 5, 1990).
and p = 11. The output classes, in order, are: dry sclerophyl, E. Botryoides, lower slope, wet E. Maculata, dry E. Maculata, rainforest ecotone, rainforest, cleared land and water. The ground truth (1705 pixels from an image of 275 625 pixels) does not consist of homogenous regions of pixels, rather they are randomly scattered throughout the dataset, in all a difficult task for any classifier, as can be seen from Table 1. The PCC figure gives the overall percentage correctly identified by the classifier, and the ANR figure is an average of the performance over each of the 9 classes. The Discrete Output Neural NETwork (DONNET) MLP is a simulation written by the author to implement the results of the current research at Curtin University into neural network classifiers. The code is available on the Web at www.cs.curtin.edu.au/~gordon/donnet. The DONNET MLP’s architecture and weights are derived as per German & Gahegan (1996). The DONNET MLP, Vanilla backprop and the C4.5 figures were all produced by the author, with the C4.5 decision tree optimised for maximum generalisation. The MLC classification is from Gahegan & West (1998), who do not provide validation figures.
Table 1: Comparison of various classifiers on the Kioloa dataset.
Training ..................................Validation
Net Type 
PCC 
ANR 
PCC 
ANR 
DONNET MLP 
81.13% 
73.38% 
72.61% 
60.37% 
C4.5 Decision Tree 
78.31% 
65.27% 
66.96% 
52.36% 
Vanilla backprop 
68.88% 
47.18% 
50.77% 
43.23% 
MLC 
50.50% 
48.50% 
 
 
As is described in German et al. (1998), the MLP’s performance can be further improved by the addition of particular hiddenlayer nodes to aid in the formation of more complex decisionsurfaces. With prudently placed additional nodes (determined from analysis of the zeroconfusion task matrix and the boundary misclassification rate, or BMR matrix as per German et al., 1998), the performance can be improved to 67.32% ANR for the validation set (82.11% ANR for the training set) after only 300 iterations, considerably better than the vanilla backprop, MLC or decision tree figures.
Prior to training, the taskbased MLP (German, 1994) has its initial weights, and therefore separating hyperplanes, calculated from the set of linear discriminant functions that maximise the ratio of the withinclass variance to the betweenclass variance for each pair of classes. The class confusion matrix of the classifier prior to training, is shown in Table 2. Note this already shows some classification ability due to the nonrandom way in which the initial weights are selected. For such a classifier, we can define a Boundary Misclassification Rate (BMR) matrix The initial BMR matrix for an MLP with the Kioloa dataset is given in Table 3. This table gives the performance at each classpair boundary and is derived from the class confusion matrix of Table 2 (see German et al., 1997). Positive figures are errors of omission, negative figures are errors of commission, and figures in bold are errors comprising of both (dualerror figures). For example, the table shows that the hyperplane defining the class 1:3 boundary (9.7%, hidden node 2) has incorrectly assigned 9.7% of class 1 to class 3. Similarly, the class 2:3 boundary (+2.2%, node 9) has incorrectly assigned 2.2% of class 3 to class 2 and the class 3:4 boundary (0%, node 16) has correctly assigned all of classes 4 and 5. In fact, there are 13 zero error boundaries. This implies that the weights incident to hiddenlayer nodes 8, 15, 16, 21, 25, 26, 29, 30, 32, 33, 34, 35 and 36 (reading across the columns of Table 3) are optimally positioned to accomplish their specific primary tasks.
Some of these "zeroconfusion" hyperplanes will be redundant in the sense that one (or more) of them may, in fact, do more than one task (see Figure 6). The redundant hyperplanes can be identified by the network (German et al., 1997) and can be used by the network to aid in the construction of more complex decision surfaces elsewhere in the attributespace.
Table 2: 9 Class Confusion Matrix (0 iterations)
Class1 
Class2 
Class3 
Class4 
Class5 
Class6 
Class7 
Class8 
Class9 
Totals 

True 1 
162 
1 
0 
17 
15 
0 
4 
5 
0 
204 
True 2 
24 
0 
0 
5 
3 
3 
5 
4 
0 
44 
True 3 
25 
0 
1 
1 
3 
0 
4 
1 
0 
35 
True 4 
42 
0 
0 
97 
21 
2 
6 
0 
0 
168 
True 5 
22 
0 
0 
22 
73 
1 
3 
0 
0 
121 
True 6 
10 
0 
0 
31 
6 
17 
1 
0 
0 
65 
True 7 
14 
3 
0 
9 
7 
0 
25 
0 
0 
58 
True 8 
0 
2 
0 
0 
0 
0 
0 
109 
0 
111 
True 9 
0 
0 
0 
0 
0 
0 
0 
0 
333 
333 
Table 3: 9 Class BMR Matrix (0 iterations) on the Kioloa dataset.

Class1 
Class2 
Class3 
Class4 
Class5 
Class6 
Class7 
Class8 
Class9 
Avg 
Class1 
 
10.4% 
9.7% 
12.4% 
10.7% 
7.3% 
+5.7% 
2.2% 
0% 
7.3% 
Class2 
 
 
+2.2% 
2.5% 
1.1% 
1.6% 
4.3% 
1.1% 
0% 
2.9% 
Class3 
 
 
 
0% 
2.3% 
0.9% 
2.9% 
1.2% 
0% 
2.4% 
Class4 
 
 
 
 
17.8% 
11.8% 
9.8% 
0% 
0% 
6.8% 
Class5 
 
 
 
 
 
+1.0% 
3.0% 
0% 
0% 
4.5% 
Class6 
 
 
 
 
 
 
4.7% 
0% 
0% 
3.4% 
Class7 
 
 
 
 
 
 
 
0% 
0% 
3.8% 
Class8 
 
 
 
 
 
 
 
 
0% 
0.6% 
Class9 
 
 
 
 
 
 
 
 
 
0.0% 
As can be seen from Figure 6, as long as one of the redundant hyperplanes is kept in its optimal position, the others can be moved by the network for use in other areas; however, during training, this optimally positioned hyperplane is still often repositioned by the minimisation routine in its search for a global minimum for E. Even for the BMSE cost function, there are still areas of Figure 4 where the classification rate does not increase for a given decrease in the cost function.
Figure 6: Hyperplane ‘A’ is constructed to separate classes 1 and 2 and ‘C’ for classes 2 and 3; however, each does both task perfectly, so one of them is redundant.
Removing the associated weights from the minimisation routine's search space could eliminate this unnecessary "jittering" of the optimally placed hyperplanes. In essence, we "freeze" the values of the W weights associated with one of these optimally placed hyperplanes. This should produce shorter training times and an incremental increase in classification performance.
Freezing of these weights will necessitate some method of setting the U weights that are associated with the hyperplane, as these are not static, but vary as the network attempts to produce a final classification scheme. The values for this subset of the U matrix (u_tmp) can be calculated via the methodology shown in German & Gahegan (1996). The procedure is as follows:
The minimisation routine does not use the frozen weights of the W matrix, nor the associated output weights of the U matrix. Because of this restricted search space, we would expect a slight reduction in the computational time per iteration, as well as a reduction in the number of iterations required for equivalent convergence.
The MLP is constructed with 11 input nodes, 36 hidden nodes, and 9 output nodes. The W weights are calculated from the Fisher pairwise linear discriminant functions and the U matrix is calculated from the known training class labels and the hidden layer outputs (see German & Gahegan, 1996, for complete details). In this initial state, prior to training, the class confusion matrix of Table 2, the previous BMR matrix of Table 3 and the zeroconfusion task matrix of Table 4 can be derived. The zeroconfusion task matrix shows the group of 7 compatibly redundant hyperplanes that do exactly the same tasks with 100% accuracy, or zero confusion and is derived by counting the number of times each hiddenlayer node "fires" for a particular class. In this particular example there is only one group, unsurprisingly associated with class 9 (water), the confusion matrix showing that only class 9 is initially perfectly separated (generally, the zeroconfusion task matrix may describe multiple groups with identical zeroconfusion tasks). From these 7 hyperplanes, we require one to be left in position (which one is arbitrary), while the others can be used by the network in other areas that may require more complex decision surfaces (e.g., the class 4:5 boundary, as can be seen from Table 2). Let us choose hyperplane h8 to be "frozen" in position. This hyperplane will perform all of the tasks in the 3^{rd} column of Table 4 with 100% accuracy. To accomplish this we must ensure that the network’s minimisation routine does not change any of the weights feeding into node 8 and modify the weights feeding out of it appropriately, at each iteration.
Table 4: Zero Confusion Task Matrix for 9 Class Example (0 iterations)
Hyperplane 
Primary Class Separation 
Zero Confusion Tasks 
8 
Task 8 (c_{1}:c_{9}) 
8,15,21,26,30,33,35,36 
15 
Task 15 (c_{2}:c_{9}) 
8,15,21,26,30,33,35,36 
21 
Task 21 (c_{3}:c_{9}) 
8,15,21,26,30,33,35,36 
26 
Task 26 (c_{4}:c_{9}) 
8,15,21,26,30,33,35,36 
30 
Task 30 (c_{5}:c_{9}) 
8,15,21,26,30,33,35,36 
33 
Task 33 (c_{6}:c_{9}) 
8,15,21,26,30,33,35,36 
35 
Task 35 (c_{7}:c_{9}) 
8,15,21,26,30,33,35,36 
We run the network for 300 iterations and produce the tables and figures below. Table 5 gives the classification results, compared to the original MLP model, Table 6 shows the class confusion matrix, and Figure 7 shows the comparative % classification as a function of the number of training iterations.
Table 5: Comparison of various classifiers on the Kioloa dataset.
Training........................................ Validation
Net Type 
PCC 
ANR 
PCC 
ANR 
Restricted DONNET 
84.72% 
76.53% 
75.44% 
64.34% 
Unrestricted DONNET 
81.13% 
73.38% 
72.61% 
60.37% 
Vanilla backprop 
68.88% 
47.18% 
50.77% 
43.23% 
Table 6: 9 Class Confusion Matrix (300 iterations)
Class1 
Class2 
Class3 
Class4 
Class5 
Class6 
Class7 
Class8 
Class9 
Totals 

True 1 
170 
6 
1 
13 
13 
0 
1 
0 
0 
204 
True 2 
5 
28 
0 
3 
2 
0 
3 
3 
0 
44 
True 3 
13 
1 
15 
0 
2 
0 
3 
1 
0 
35 
True 4 
17 
3 
2 
125 
13 
6 
2 
0 
0 
168 
True 5 
10 
3 
0 
14 
91 
3 
0 
0 
0 
121 
True 6 
3 
3 
0 
8 
1 
50 
0 
0 
0 
65 
True 7 
5 
1 
0 
6 
4 
0 
42 
0 
0 
58 
True 8 
0 
0 
0 
0 
0 
0 
0 
111 
0 
111 
True 9 
0 
0 
0 
0 
0 
0 
0 
0 
333 
333 
Figure 7: Classification performance during training on the Kioloa training set. The upper plot is for the restricted network.
As can be seen from Table 5 and Figure 7, the restricted MLP produces a better final classification (up by 4%) of 64.34% ANR on the validation set, with a slightly higher learning rate. The final classified image is shown in Figure 8, along with an image classified by the MLC. The superiority of the MLP image is evident in the lower granularity of the image, especially in the central regions.
We have shown how to improve the performance of a neural network classifier on GIS data, by intelligent restriction of hyperplane movement. The freezing of the appropriate weights and recalculation of the hiddenlayer/outputlayer weights achieve this. The method requires additional analysis prior to starting the training phase; however, by reducing the amount of unnecessary hyperplane readjustments, an improvement in the learning rate is achieved. Although this is only slight, over the few hundred iterations that are typical of the training phase, it leads to a significant overall benefit.
Figure 8: Classified images of the Kioloa dataset. The upper image is from the MLC classifier, the lower from the DONNET MLP.
Battiti, R. and F. Masulli, 1990 BFGS Optimization for Faster and Automated Supervised Learning, Proc. 1990 International Neural Network Conference, Paris, Vol. 2, pp. 757760.
Bischof, H., W. Schneider and A.J. Pinz, 1992. Multispectral Classification of LandsatImages Using Neural Networks, IEEE Transactions on Geoscience and Remote Sensing, Vol. 30, No. 3, pp. 482490.
Bryson, A.E. and Y. Ho, 1969. Applied Optimal Control : Optimization, Estimation, and Control. Ginn Publishing, Waltham, MA.
Cybenko, G., 1990. Complexity Theory of Neural Networks and Classification Problems, Proc. Neural Networks EURASIP Workshop, Ed., L. B. Almeida, C. J. Wellekens, Sesimbra Portugal, pp. 2444.
Foody, G. M., 1995. Land Cover Classification by an Artificial Neural Network with Ancillary Information, International Journal of Geographical Information Systems, Vol. 9, No. 5, pp. 527542.
Gahegan, M. and G.W. West, 1998. The Classification of Complex Geographic Datasets: An Operational Comparison of Artificial Neural Network and Decision Tree Classifiers, Proceedings of the 3rd International Conference on Geocomputation, Bristol, U.K.
German G.W.H., 1994. Multi Layered Perceptrons and the Classification of Remotely Sensed Data, Honours Thesis, School Of Computing, Curtin University of Technology, Bentley Western Australia.
German, G.W.H. and M.N. Gahagan, 1996. Neural Network Architectures for the Classification of Temporal Image Sequences, Computers and Geosciences, Vol. 22, No. 9, pp. 969979.
German, G., M. Gahegan and G. West, 1997. Predictive Assessment of Neural Network Classifiers For Applications in GIS, Proceedings of the 2nd International Conference on Geocomputation, University of Otago, NZ.
German, G.W.H., M.N. Gahagan and G.W. West, 1998. Improving the Learning Abilities of a Neural Networkbased Geocomputational Classifier, Proceedings of the International Society of Photogrammetry and Remote Sensing, Commission II, Cambridge, U.K., Vol. 32, No. 2, pp. 8087.
Hampshire, J. B. and B.V.K. Kumar, 1993. Differentially Generated Neural Network Classifiers are Efficient, Neural Networks for Signal Processing 3  Proc. 1993 IEEE Workshop, Eds. Kamm, Kuhn, Yoon, Chellappa and Kung, IEEE Press, NY.
Hornik, K., M. Stinchcombe and H. White, 1989. Multilayer Feedforward Networks are Universal Approximators, Neural Networks, Vol. 2, pp. 359366.
Jacobs, R.A., 1988. Increased Rates of Convergence Through Learning Rate Adaptation, Neural Networks, Vol. 1, pp. 295307.
Kamata, S. and E. Kawaguchi, 1993. A Neural Net Classifier for MultiTemporal LANDSAT Images Using Spacial and Spectral Information, Proc. of the 1993 International Joint Conference on Neural Networks, pp. 2,1992,202.
Knerr, S., L. Personnaz and G. Dreyfus, 1990. Single Layer Learning Revisited: A Stepwise Procedure for Building and Training a Neural Network, Neurocomputing: Algorithms, Architectures, and Applications, Eds., F. FogelmanSoulie and J. Herault, Vol. F68, NATO ASI Series, SpringerVerlag, Berlin, Germany.
Lees, B.G. and K. Ritman, 1991. Decision Tree and Rule Induction Approach to Integration of Remotely Sensed and GIS Data in Mapping Vegetation in Disturbed or Hilly Environments, Environmental Management, Vol. 15, pp. 5071.
Luenberger, D., 1984. Linear and Nonlinear Programming. AddisonWesley Publishing, Reading, MA.
Mardia, K.V., T. Kent and J.M. Bibby, 1979. Multivariate Analysis, Academic Press, London, England.
Moller, M.F., 1993. A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning, Neural Networks, Vol. 6, pp. 525533.
Murata, N., S. Yoshizawa and S. Amari, 1991. A Criterion for Determining the Number of Parameters in an Artificial Neural Network Model, Artificial Neural Networks, Ed., T. Kohonen, Elsevier Science Publishers, Holland, pp. 914.
Richards, J.A., 1986. Remote Sensing Digital Image Analysis: An Introduction. SpringerVerlag, Berlin, Germany.
Swain, P.H. and S.M. Davis, 1978. Remote Sensing: The Quantitative Approach, McGrawHill.
Wilkinson, G.G., S. Folving, I. Kanellopoulos, N. McCormick, K. Fullerton and J. Megier, 1995. Forest Mapping From MultiSource Satellite Data Using Neural Network Classifiers – An Experiment in Portugal, Remote Sensing Reviews, Vol. 12, pp. 83106.