logo资料库

Learning representations by back propagating errors(1985).pdf

第1页 / 共49页
第2页 / 共49页
第3页 / 共49页
第4页 / 共49页
第5页 / 共49页
第6页 / 共49页
第7页 / 共49页
第8页 / 共49页
资料共49页,剩余部分请下载后查看
(V) two LEARNING INTERNAL REPRESENTATIONS BERROR PROPAGATION David E. Ruineihart, Geoffrey E. Hint.., and Ronald J. Williams 0 4 September 1985 ICS Report 8506 COGNITIVE SCIENCE IaQ I INSTITUTE FOR COGNITIVE SCIENCE UNIVERSITY OF CALIFORNIA, SAN DIEGO 862 18 LA JOLLA, CALIFORNIA 92093 120,
4- U-
LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams DTIC September 1985 ICS Report M50 SLD AL"LECT FEB 20 M David E. Rumelhart Institute for Cognitive Science University of California, San Diego Geoffrey E. Hinton Department of Computer Science Carnegie-Mellon University Ronald J. Williams Institute for Cognitive Science University of California, San Diego Di' _ AL xow"d io n publ@ DkltAtfikuim fi1Id To be published in D. E. Rumelhart & i. L. McClelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1: Foundations. Cambridge, MA: Bradford Books/MIT Press. This research wa supported by Contract NOD14.45-K0450, NR 667-58 with the Personnel Lad Training Research Pro- gams of the Office of Naval Research and by grants from the System Development Foundation. Requests for rm- prints siould be seat to David E. Ramdhut, Institute for Cognitive Science, C4)13; University of California, San Diego; La Jolla, CA 92093. Copyright 0 1985 by David B. Rumeliat, Geoffrey E. finton. and Ronald i. Williams U P.
Unclassified SECURITY CLASSIFICATION OF THIS PAGE REPORT DOCUMENTATION PAGE la. REPORT SECURITY CLASSIFICATION Unclassified 2a. SECURITY CLASSIFICATION AUTHORITY 2b. DECLASSIFICATION I DOWNGRADING SCHEDULE lb RESTRICTIVE MARKINGS 3. DISTRIBUTION/AVAILABILITY OF REPORT Approved for public release; distribution unlimited. ,, 4. PERFORMING ORGANIZATION REPORT NUMBER(S) S. MONITORING ORGANIZATION REPORT NUMBER(S) ICS 8506 6a. NAME OF PERFORMING ORGANIZATION Institute for Cognitive Science I 6b. OFFICE SYMBOL (If applicable) 6c. ADDRESS (City, State, and ZIP Code) C-015 University of California, San Diego La Jolla, CA 92093 Ba. NAME OF FUNDING/SPONSORING ORGANIZATION Personnel & Training Researc 8b. OFFICE SYMBOL (If applicable) 8c. ADDRESS (City, State, and ZIP Code) Code 1142 PT Office of Naval Research 800 N. Quincy St., Arlington, VA 22217-5000 11. TITLE (Include Security Classification) 7a NAME OF MONITORING ORGANIZATION 7b. ADDRESS (City, State, and ZIP Code) 9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER N00014-85-K-0450 10. SOURCE OF FUNDING NUMBERS PROGRAM ELEMENT NO. PROJECT NO. NR 667-548 TASK NO. WORK UNiT ACCESSION NO Learning Internal Representations by Error Propagation 12. PERSONAL AUTHOR(S) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams 13a. TYPE OF REPORT Technical 13b. TIME COVERED FROMMar 85 TO Sept 8 14. DATE OF REPORT (ear, Month, Day) S. PAGE COUNT September 1985 34 16 SUPPLEMFNTARY NOTATION To be published in J. L. McClelland, D. E. Rumelhart, & the PDP Research Group, Parallel Dislribuled Processing: Exploratiois in the Microstrucure of Cognkion: Vol 1. Founsdations. Cambridge, MA: Bradford nlooks/Mnr Press. 17. COSATI CODES FIELD GROUP SUB-GROUP 18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number) -learning; networks; perceptrons; adaptive systems; learning machines; back propagation 19. ABSTRACT (Continue on reverse if necessary and identify by block number) This paper presents a generalization of the perceptron learning procedure for learning the correct sets of connections for arbitrary networks. The rule, called the generalized delta rule, is a simple scheme for implementing a gradient descent method for finding weights that minimize the sum squared error of the system's performance. The major theoretical contribution of the work is the procedure called error propagation, whereby the gradient can be determined by individual units of the network based only on locally available information. The major empirical contribution of the work is to show that thc problem of local minima is not serious in this application of gradient descent. i 20. DISTRIBUTION/AVAILABILITY OF ABSTRACT QUNCLASSIFIED/UNLIMITED 0 SAME AS RPT. 21. ABSTRACT SECURITY CLASSIFICATION 0 DTIC USERS Unclassified 22a NAME OF RESPONSIBLE INDIVIDUAL 22b. TELEPHONE (Include Area Code) 22c. OFFICE SYMBOL DO FORM 1473, 84 MAR 83 APR edition may be used until exhausted. All other editions are obsolete. SECURITY CLASSIFICATION OF THIS PAGE Unclassified
Contents THE PROBLEM ................................................................................. THE GEN4ERALIZED DELTA RULE........................................................ The delta rule and gradient descent................................................ The delta rule for semilinear activation functions in feedforward networks. S[MULATION RESULTS ..................................................................... A useful activation function .......................................................... The learning rate..................................................................... Symmetry breaking................................................................... The XOR Problem........................................................................ Parity ................................................................................... The Encoding Problem................................................................... Symmetry ............................................................................... Addition ................................................................................ Negation Problem.................................................................... *The The T-C Problem .......................................................................... More Simulation Results................................................................. SOME FURTHER GENERALIZATIONS ................................................... The Generalized Delta Rule and Sigma-Pi Units .................................... Recurrent Nets......................................................................... Leff nng to be a shift register ...................................................... Learning to complete sequences ..................................................... CONCLUSION................................................................................... REFERENCES................................................................................ 4 4 5 8 8 9 10 10 12 14 17 18 21 22 26 26 26 2 29 30 31 33 Uiannoun'ced 13
Learning Internal Representations by Error Propagation DAVID E. RUMELHART, GEOFFREY E. HINTON, and RONALD J. WILLIAMS THE PROBLEM We now have a rather good understanding of simple two-layer associative networks in which a set of input patterns arriving at an input layer are mapped directly to a set of output patterns at an output layer. Such networks have no hidden units. They involve only input and owput units. In these cases there is no internal represenatmon. The coding provided by the external world must suffice. These networks have proved useful in a wide variety of applications (cf. Chapters 2, 17, and 18). Perhaps the essential character of such networks is that they map simi- lar input patterns to similar output patterns. This is what allows these networks to make rea- sonable generalizations and perform reasonably on patterns that have never before been presented. The similarity of patterns in a PDP system is determined by their overlap. The overlap in such networks is determined outside the learning system itself-by whatever pro- duces the patterns. The constraint that similar input patterns lead to similar outputs can lead to an inability of the system to learn certain mappings from input to output. Whenever the representation pro- vided by the outside world is such that the similarity structure of the input and output pat- tcrns are very different, a network without internal representations (i.e., a network without hidden units) will be unable to perform the necessary mappings. A classic example of this case is the exclusive-or (XOR) problem illustrated in Table 1. Here we see that those patterns which overlap least arc supposed to generate identical output values. This problem and many others like it cannot be performed by networks without hidden units with which to create TABLE I Input Patterns Output oPatterns 01 -0
2 RUMELHART. HrwrON. and WILLIAMS TABLE 2 Input Patterns Output Patterns 00 010 11 - 0 1. 0 their own internal representations of the input patterns. It is interesting to note that had the input patterns contained a third input taking the value 1 whenever the first two have value 1 as shown in Table 2, a two-layer system would be able to solve the problem. Minsky and Papert (1969) have provided a very careful analysis of conditions under which such systems are capable of carrying out the required mappings. They show that in a large number of interesting cases, networks of this kind are incapable of solving the problems. On the other hand, as Minsky and Papert also pointed out, if there is a layer of simple perceptron- like hidden units, as shown in Figure 1, with which the original input pattern can be aug- mented, there is always a recoding (i.e., an internal representation) of the input patterns in the hidden units in which the similarity of the patterns among the hidden units can support any required mapping from the input to the output units. Thus, if we have the right connections from the input units to a large enough set of hidden units, we can always find a representation Output Patterns Internal O" *Representation Units Input Patterns FIGURE 1. A multilayer network. In this case the information coming to the input units is receded into an inter- nal representation and the outputs are generated by the internal representation rather than by the original pattern. Input patterns can always be encoded, if there are enough hidden units, in a form so that the appropriate output pat- tern can be generated from any input pattern. 6 M
LEARNING miNLRNAL REPRE ENTATIONS 3 that will perform any mapping from input to output through these hidden units. In the case of the XOR problem, the addition of a feature that detects the conjunction of the input units changes the similarity structure of the patterns sufficiently to allow the solution to be learned. As illustrated in Figure 2, this can be done with a single hidden unit. The numbers on the arrows represent the strengths of the connections among the units. The numbers written in the circles represent the thresholds of the units. The value of +1.5 for the threshold of the hidden unit insures that it will be turned on only when both input units are on. The value 0.5 for the output unit insures that it will turn on only when it receives a net positive input greater than 0.5. The weight of -2 from the hidden uait to the output unit insures that the output unit will not come on when both input units arc on. Note that from the point of view of the output unit, the hidden unit is treated as simply another input unit. It is as if the input patterns consisted of three rather than two units. The existence of networks such as this illustrates the potential power of hidden units and internal representations. The problem, as noted by Minsky and Papert, is that whereas there is a very simple guaranteed learning rule for all problems that can be solved without hidden units, namely, the perceptron convergence procedure (or the variation due originally to Widrow and Hoff, 1960, which we call the delta rule; see Chapter 11), there is no equally powerful rule for learning in networks with hidden units. There have been three basic responses to this lack. One response is represented by competitive learning (Chapter 5) in which simple unsupervised learning rules are employed so that useful hidden units develop. Although these approaches are promising, there is no external force to insure that hidden units appropriate for the required mapping are developed. The second response is to simply asswne an internal representation that, on some a priori grounds, seems reasonable. This is the tack taken in the chapter on verb learning (Chapter 18) and in theinteractive activation model of word perception (McClelland & Rumelhart, 1981; Rumelhart & McClelland, 1982). The third approach is to attempt to develop a learning procedure capable of learning an internal representation adequate for per- forming the task at hand. One such development is presented in the discussion of Boltzmann machines in Chapter 7. As we have seen, this procedure involves the use of stochastic units, requires the network to reach equilibrium in two different phases, and is limited to symmetric networks. Another recent approach, also employing stochastic units, has been developed by Barto (1985) and various of his colleagues (cf. Barto & Anandan, 1985). In this chapter we .5 Output Unit +1 -21 +1 Hidden Unit S +1 +1 Input Units FIGURE 2. A simple XOR network with one hidden unit. See text for explanation. d '% .. - P I .
分享到:
收藏