Monographs in Computer Science
Editors
David Gries
Fred B. Schneider
Monographs in Computer Science
Abadi and Cardelli, A Theory of Objects
Benosman and Kang [editors], Panoramic Vision: Sensors, Theory, and Applications
Bhanu, Lin, Krawiec, Evolutionary Synthesis of Pattern Recognition Systems
Broy and Stølen, Specification and Development of Interactive Systems: FOCUS on
Streams, Interfaces, and Refinement
Brzozowski and Seger, Asynchronous Circuits
Burgin, Super-Recursive Algorithms
Cantone, Omodeo, and Policriti, Set Theory for Computing: From Decision Procedures
to Declarative Programming with Sets
Castillo, Gutiérrez, and Hadi, Expert Systems and Probabilistic Network Models
Downey and Fellows, Parameterized Complexity
Feijen and van Gasteren, On a Method of Multiprogramming
Grune and Jacobs, Parsing Techniques: A Practical Guide, Second Edition
Herbert and Spärck Jones [editors], Computer Systems: Theory, Technology, and
Applications
Leiss, Language Equations
Levin, Heydon, Mann, and Yu, Software Configuration Management Using VESTA
Mclver and Morgan [editors], Programming Methodology
Mclver and Morgan [editors], Abstraction, Refinement and Proof for Probabilistic
Systems
Misra, A Discipline of Multiprogramming: Programming Theory for Distributed
Applications
Nielson [editor], ML with Concurrency
Paton [editor], Active Rules in Database Systems
Poernomo, Crossley, and Wirsing, Adapting Proof-as-Programs: The Curry-Howard
Protocol
Selig, Geometrical Methods in Robotics
Selig, Geometric Fundamentals of Robotics, Second Edition
Shasha and Zhu, High Performance Discovery in Time Series: Techniques and Case
Studies
Tonella and Potrich, Reverse Engineering of Object Oriented Code
Dick Grune
Ceriel J.H. Jacobs
Parsing Techniques
A Practical Guide
Second Edition
Dick Grune and Ceriel J.H. Jacobs
Faculteit Exacte Wetenschappen
Vrije Universiteit
De Boelelaan 1081
1081 HV Amsterdam
The Netherlands
Series Editors
David Gries
Department of Computer Science
Cornell University
4130 Upson Hall
Ithaca, NY 14853-7501
USA
Fred P. Schneider
Department of Computer Science
Cornell University
4130 Upson Hall
Ithaca, NY 14853-7501
USA
ISBN-13: 978-0-387-20248-8
e-ISBN-13: 978-0-387-68954-8
Library of Congress Control Number: 2007936901
©2008 Springer Science+Business Media, LLC
©1990 Ellis Horwood Ltd.
All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer Science+Business Media LLC, 233 Spring Street,
New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and retrieval, electronic
adaptation, computer software, or by similar or dissimilar methodology now known or hereafter
developed is forbidden.
The use in this publication of trade names, trademarks, service marks and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or
not they are subject to proprietary rights.
Printed on acid-free paper.
9 8 7 6 5 4 3 2 1
springer.com
(SB)
Preface to the Second Edition
As is fit, this second edition arose out of our readers’ demands to read about new
developments and our desire to write about them. Although parsing techniques is
not a fast moving field, it does move. When the first edition went to press in 1990,
there was only one tentative and fairly restrictive algorithm for linear-time substring
parsing. Now there are several powerful ones, covering all deterministic languages;
we describe them in Chapter 12. In 1990 Theorem 8.1 from a 1961 paper by Bar-
Hillel, Perles, and Shamir lay gathering dust; in the last decade it has been used to
create new algorithms, and to obtain insight into existing ones. We report on this in
Chapter 13.
More and more non-Chomsky systems are used, especially in linguistics. None
except two-level grammars had any prominence 20 years ago; we now describe six
of them in Chapter 15. Non-canonical parsers were considered oddities for a very
long time; now they are among the most powerful linear-time parsers we have; see
Chapter 10.
Although still not very practical, marvelous algorithms for parallel parsing have
been designed that shed new light on the principles; see Chapter 14. In 1990 a gen-
eralized LL parser was deemed impossible; now we describe two in Chapter 11.
Traditionally, and unsurprisingly, parsers have been used for parsing; more re-
cently they are also being used for code generation, data compression and logic
language implementation, as shown in Section 17.5. Enough. The reader can find
more developments in many places in the book and in the Annotated Bibliography
in Chapter 18.
Kees van Reeuwijk has — only half in jest — called our book “a reservation
for endangered parsers”. We agree — partly; it is more than that — and we make
no apologies. Several algorithms in this book have very limited or just no practical
value. We have included them because we feel they embody interesting ideas and
offer food for thought; they might also grow and acquire practical value. But we
also include many algorithms that do have practical value but are sorely underused;
describing them here might raise their status in the world.
vi
Preface to the Second Edition
Exercises and Problems
This book is not a textbook in the school sense of the word. Few universities have
a course in Parsing Techniques, and, as stated in the Preface to the First Edition, read-
ers will have very different motivations to use this book. We have therefore included
hardly any questions or tasks that exercise the material contained within this book;
readers can no doubt make up such tasks for themselves. The questions posed in the
problem sections at the end of each chapter usually require the reader to step outside
the bounds of the covered material. The problems have been divided into three not
too well-defined classes:
•
not marked — probably doable in a few minutes to a couple of hours.
• marked Project — probably a lot of work, but almost certainly doable.
• marked Research Project — almost certainly a lot of work, but hopefully doable.
We make no claims as to the relevance of any of these problems; we hope that some
readers will find some of them enlightening, interesting, or perhaps even useful.
Ideas, hints, and partial or complete solutions to a number of the problems can be
found in Chapter A.
There are also a few questions on formal language that were not answered eas-
ily in the existing literature but have some importance to parsing. These have been
marked accordingly in the problem sections.
Annotated Bibliography
For the first edition, we, the authors, read and summarized all papers on parsing
that we could lay our hands on. Seventeen years later, with the increase in publica-
tions and easier access thanks to the Internet, that is no longer possible, much to our
chagrin. In the first edition we included all relevant summaries. Again that is not pos-
sible now, since doing so would have greatly exceeded the number of pages allotted
to this book. The printed version of this second edition includes only those refer-
ences to the literature and their summaries that are actually referred to in this book.
The complete bibliography with summaries as far as available can be found on the
web site of this book; it includes its own authors index and subject index. This setup
also allows us to list without hesitation technical reports and other material of possi-
bly low accessibility. Often references to sections from Chapter 18 refer to the Web
version of those sections; attention is drawn to this by calling them “(Web)Sections”.
We do not supply URLs in this book, for two reasons: they are ephemeral and
may be incorrect next year, tomorrow, or even before the book is printed; and, es-
pecially for software, better URLs may be available by the time you read this book.
The best URL is a few well-chosen search terms submitted to a good Web search
engine.
Even in the last ten years we have seen a number of Ph.D theses written in lan-
guages other than English, specifically German, French, Spanish and Estonian. This
choice of language has the regrettable but predictable consequence that their con-
tents have been left out of the main stream of science. This is a loss, both to the
authors and to the scientific community. Whether we like it or not, English is the
de facto standard language of present-day science. The time that a scientifically in-
Preface to the Second Edition
vii
terested gentleman of leisure could be expected to read French, German, English,
Greek, Latin and a tad of Sanskrit is 150 years in the past; today, students and sci-
entists need the room in their heads and the time in their schedules for the vastly
increased amount of knowledge. Although we, the authors, can still read most (but
not all) of the above languages and have done our best to represent the contents of
the non-English theses adequately, this will not suffice to give them the international
attention they deserve.
The Future of Parsing, aka The Crystal Ball
If there will ever be a third edition of this book, we expect it to be substantially
thinner (except for the bibliography section!). The reason is that the more parsing
algorithms one studies the more they seem similar, and there seems to be great op-
portunity for unification. Basically almost all parsing is done by top-down search
with left-recursion protection; this is true even for traditional bottom-up techniques
like LR(1), where the top-down search is built into the LR(1) parse tables. In this
respect it is significant that Earley’s method is classified as top-down by some and
as bottom-up by others. The general memoizing mechanism of tabular parsing takes
the exponential sting out of the search. And it seems likely that transforming the
usual depth-first search into breadth-first search will yield many of the generalized
deterministic algorithms; in this respect we point to Sikkel’s Ph.D thesis [158]. To-
gether this seems to cover almost all algorithms in this book, including parsing by
intersection. Pure bottom-up parsers without a top-down component are rare and not
very powerful.
So in the theoretical future of parsing we see considerable simplification through
unification of algorithms; the role that parsing by intersection can play in this is not
clear. The simplification does not seem to extend to formal languages: it is still as
difficult to prove the intuitively obvious fact that all LL(1) grammars are LR(1) as it
was 35 years ago.
The practical future of parsing may lie in advanced pattern recognition, in addi-
tion to its traditional tasks; the practical contributions of parsing by intersection are
again not clear.
Amsterdam, Amstelveen
June 2007
Dick Grune
Ceriel J.H. Jacobs
Acknowledgments
We thank Manuel E. Bermudez, Stuart Broad, Peter Bumbulis, Salvador Cavadini,
Carl Cerecke, Julia Dain, Akim Demaille, Matthew Estes, Wan Fokkink, Brian Ford,
Richard Frost, Clemens Grabmayer, Robert Grimm, Karin Harbusch, Stephen Horne,
Jaco Imthorn, Quinn Tyler Jackson, Adrian Johnstone, Michiel Koens, Jaroslav Král,
Olivier Lecarme, Lillian Lee, Olivier Lefevre, Joop Leo, JianHua Li, Neil Mitchell,
Peter Pepper, Wim Pijls, José F. Quesada, Kees van Reeuwijk, Walter L. Ruzzo,
Lothar Schmitz, Sylvain Schmitz, Thomas Schoebel-Theuer, Klaas Sikkel, Michael
Sperberg-McQueen, Michal Žemliˇcka, Hans Åberg, and many others, for helpful cor-
respondence, comments on and errata to the First Edition, and support for the Second
Edition. In particular we want to thank Kees van Reeuwijk and Sylvain Schmitz for
their extensive “beta reading”, which greatly helped the book — and us.
We thank the Faculteit Exacte Wetenschappen of the Vrije Universiteit for the
use of their equipment.
In a wider sense, we extend our thanks to the close to 1500 authors listed in the
(Web)Authors Index, who have been so kind as to invent scores of clever and elegant
algorithms and techniques for us to exhibit. Every page of this book leans on them.