logo资料库

GATE 8 使用指南.pdf

第1页 / 共698页
第2页 / 共698页
第3页 / 共698页
第4页 / 共698页
第5页 / 共698页
第6页 / 共698页
第7页 / 共698页
第8页 / 共698页
资料共698页,剩余部分请下载后查看
I GATE Basics
Introduction
How to Use this Text
Context
Overview
Developing and Deploying Language Processing Facilities
Built-In Components
Additional Facilities in GATE Developer/Embedded
An Example
Some Evaluations
Recent Changes
Next Release
Version 8.0 (May 2014)
Further Reading
Installing and Running GATE
Downloading GATE
Installing and Running GATE
The Easy Way
The Hard Way (1)
The Hard Way (2): Subversion
Running GATE Developer on Unix/Linux
Using System Properties with GATE
Changing GATE's launch configuration
Configuring GATE
Building GATE
Using GATE with Maven/Ivy
Uninstalling GATE
Troubleshooting
Using GATE Developer
The GATE Developer Main Window
Loading and Viewing Documents
Creating and Viewing Corpora
Working with Annotations
The Annotation Sets View
The Annotations List View
The Annotations Stack View
The Co-reference Editor
Creating and Editing Annotations
Schema-Driven Editing
Printing Text with Annotations
Using CREOLE Plugins
Installing and updating CREOLE Plugins
Loading and Using Processing Resources
Creating and Running an Application
Running an Application on a Datastore
Running PRs Conditionally on Document Features
Doing Information Extraction with ANNIE
Modifying ANNIE
Saving Applications and Language Resources
Saving Documents to File
Saving and Restoring LRs in Datastores
Saving Application States to a File
Saving an Application with its Resources (e.g. GATECloud.net)
Keyboard Shortcuts
Miscellaneous
Stopping GATE from Restoring Developer Sessions/Options
Working with Unicode
CREOLE: the GATE Component Model
The Web and CREOLE
The GATE Framework
The Lifecycle of a CREOLE Resource
Processing Resources and Applications
Language Resources and Datastores
Built-in CREOLE Resources
CREOLE Resource Configuration
Configuration with XML
Configuring Resources using Annotations
Mixing the Configuration Styles
Loading Third-Party Libraries using Apache Ivy
Tools: How to Add Utilities to GATE Developer
Putting Your Tools in a Sub-Menu
Adding Tools To Existing Resource Types
Language Resources: Corpora, Documents and Annotations
Features: Simple Attribute/Value Data
Corpora: Sets of Documents plus Features
Documents: Content plus Annotations plus Features
Annotations: Directed Acyclic Graphs
Annotation Schemas
Examples of Annotated Documents
Creating, Viewing and Editing Diverse Annotation Types
Document Formats
Detecting the Right Reader
XML
HTML
SGML
Plain text
RTF
Email
PDF Files and Office Documents
UIMA CAS Documents
CoNLL/IOB Documents
XML Input/Output
ANNIE: a Nearly-New Information Extraction System
Document Reset
Tokeniser
Tokeniser Rules
Token Types
English Tokeniser
Gazetteer
Sentence Splitter
RegEx Sentence Splitter
Part of Speech Tagger
Semantic Tagger
Orthographic Coreference (OrthoMatcher)
GATE Interface
Resources
Processing
Pronominal Coreference
Quoted Speech Submodule
Pleonastic It Submodule
Pronominal Resolution Submodule
Detailed Description of the Algorithm
A Walk-Through Example
Step 1 - Tokenisation
Step 2 - List Lookup
Step 3 - Grammar Rules
II GATE for Advanced Users
GATE Embedded
Quick Start with GATE Embedded
Resource Management in GATE Embedded
Using CREOLE Plugins
Language Resources
GATE Documents
Feature Maps
Annotation Sets
Annotations
GATE Corpora
Processing Resources
Controllers
Modelling Relations between Annotations
Duplicating a Resource
Sharable properties
Persistent Applications
Ontologies
Creating a New Annotation Schema
Creating a New CREOLE Resource
Adding Support for a New Document Format
Using GATE Embedded in a Multithreaded Environment
Using GATE Embedded within a Spring Application
Duplication in Spring
Spring pooling
Further reading
Using GATE Embedded within a Tomcat Web Application
Recommended Directory Structure
Configuration Files
Initialization Code
Groovy for GATE
Groovy Scripting Console for GATE
Groovy scripting PR
The Scriptable Controller
Utility methods
Saving Config Data to gate.xml
Annotation merging through the API
Using Resource Helpers to Extend the API
JAPE: Regular Expressions over Annotations
The Left-Hand Side
Matching Entire Annotation Types
Using Features and Values
Using Meta-Properties
Building complex patterns from simple patterns
Matching a Simple Text String
Using Templates
Multiple Pattern/Action Pairs
LHS Macros
Multi-Constraint Statements
Using Context
Negation
Escaping Special Characters
LHS Operators in Detail
Equality Operators
Comparison Operators
Regular Expression Operators
Contextual Operators
Custom Operators
The Right-Hand Side
A Simple Example
Copying Feature Values from the LHS to the RHS
Optional or Empty Labels
RHS Macros
Use of Priority
Using Phases Sequentially
Using Java Code on the RHS
A More Complex Example
Adding a Feature to the Document
Finding the Tokens of a Matched Annotation
Using Named Blocks
Java RHS Overview
Optimising for Speed
Ontology Aware Grammar Transduction
Serializing JAPE Transducer
How to Serialize?
How to Use the Serialized Grammar File?
Notes for Montreal Transducer Users
JAPE Plus
ANNIC: ANNotations-In-Context
Instantiating SSD
Search GUI
Overview
Syntax of Queries
Top Section
Central Section
Bottom Section
Using SSD from GATE Embedded
How to instantiate a searchabledatastore
How to search in this datastore
Performance Evaluation of Language Analysers
Metrics for Evaluation in Information Extraction
Annotation Relations
Cohen's Kappa
Precision, Recall, F-Measure
Macro and Micro Averaging
The Annotation Diff Tool
Performing Evaluation with the Annotation Diff Tool
Creating a Gold Standard with the Annotation Diff Tool
Corpus Quality Assurance
Description of the interface
Step by step usage
Details of the Corpus statistics table
Details of the Document statistics table
GATE Embedded API for the measures
sec:eval:qapr
Corpus Benchmark Tool
Preparing the Corpora for Use
Defining Properties
Running the Tool
The Results
A Plugin Computing Inter-Annotator Agreement (IAA)
IAA for Classification
IAA For Named Entity Annotation
The BDM-Based IAA Scores
A Plugin Computing the BDM Scores for an Ontology
Quality Assurance Summariser for Teamware
Profiling Processing Resources
Overview
Features
Limitations
Graphical User Interface
Command Line Interface
Application Programming Interface
Log4j.properties
Benchmark log format
Enabling profiling
Reporting tool
Developing GATE
Reporting Bugs and Requesting Features
Contributing Patches
Creating New Plugins
What to Call your Plugin
Writing a New PR
Writing a New VR
Writing a `Ready Made' Application
Distributing Your New Plugins
Updating this User Guide
Building the User Guide
Making Changes to the User Guide
III CREOLE Plugins
Gazetteers
Introduction to Gazetteers
ANNIE Gazetteer
Creating and Modifying Gazetteer Lists
ANNIE Gazetteer Editor
OntoGazetteer
Gaze Ontology Gazetteer Editor
The Gaze Gazetteer List and Mapping Editor
The Gaze Ontology Editor
Hash Gazetteer
Prerequisites
Parameters
Flexible Gazetteer
Gazetteer List Collector
OntoRoot Gazetteer
How Does it Work?
Initialisation of OntoRoot Gazetteer
Simple steps to run OntoRoot Gazetteer
Large KB Gazetteer
Quick usage overview
Dictionary setup
Additional dictionary configuration
Dictionary for Gazetteer List Files
Processing Resource Configuration
Runtime configuration
Semantic Enrichment PR
The Shared Gazetteer for multithreaded processing
Working with Ontologies
Data Model for Ontologies
Hierarchies of Classes and Restrictions
Instances
Hierarchies of Properties
URIs
Ontology Event Model
What Happens when a Resource is Deleted?
The Ontology Plugin: Current Implementation
The OWLIMOntology Language Resource
The ConnectSesameOntology Language Resource
The CreateSesameOntology Language Resource
The OWLIM2 Backwards-Compatible Language Resource
Using Ontology Import Mappings
Using BigOWLIM
The sesameCLI command line interface
The Ontology_OWLIM2 plugin: backwards-compatible implementation
The OWLIMOntologyLR Language Resource
GATE Ontology Editor
Ontology Annotation Tool
Viewing Annotated Text
Editing Existing Annotations
Adding New Annotations
Options
Relation Annotation Tool
Description of the two views
Create new annotation and instance from text selection
Create new annotation and add label to existing instance from text selection
Create and set properties for annotation relation
Delete instance, label or property
Differences with OAT and Ontology Editor
Using the ontology API
Using the ontology API (old version)
Ontology-Aware JAPE Transducer
Annotating Text with Ontological Information
Populating Ontologies
Ontology API and Implementation Changes
Differences between the implementation plugins
Changes in the Ontology API
Non-English Language Support
Language Identification
Fingerprint Generation
French Plugin
German Plugin
Romanian Plugin
Arabic Plugin
Chinese Plugin
Chinese Word Segmentation
Hindi Plugin
Russian Plugin
Bulgarian Plugin
Domain Specific Resources
Biomedical Support
ABNER
MetaMap
GSpell biomedical spelling suggestion and correction
BADREX
MiniChem/Drug Tagger
AbGene
GENIA
Penn BioTagger
MutationFinder
NormaGene
Tools for Social Media Data
Tools for Twitter
Twitter JSON format
Low-level PRs for Tweets
Handling multi-word hashtags
The TwitIE Pipeline
Parsers
MiniPar Parser
Platform Supported
Resources
Parameters
Prerequisites
Grammatical Relationships
RASP Parser
SUPPLE Parser
Requirements
Building SUPPLE
Running the Parser in GATE
Viewing the Parse Tree
System Properties
Configuration Files
Parser and Grammar
Mapping Named Entities
Upgrading from BuChart to SUPPLE
Stanford Parser
Input Requirements
Initialization Parameters
Runtime Parameters
Machine Learning
ML Generalities
Some Definitions
GATE-Specific Interpretation of the Above Definitions
Batch Learning PR
Batch Learning PR Configuration File Settings
Case Studies for the Three Learning Types
How to Use the Batch Learning PR in GATE Developer
Output of the Batch Learning PR
Using the Batch Learning PR from the API
Machine Learning PR
The DATASET Element
The ENGINE Element
The WEKA Wrapper
The MAXENT Wrapper
The SVM Light Wrapper
Example Configuration File
Tools for Alignment Tasks
Introduction
The Tools
Compound Document
CompoundDocumentFromXml
Compound Document Editor
Composite Document
DeleteMembersPR
SwitchMembersPR
Saving as XML
Alignment Editor
Saving Files and Alignments
Section-by-Section Processing
Crowdsourcing Data with GATE
The Basics
Entity classification
Creating a classification job
Loading data into a job
Importing the results
Entity annotation
Creating an annotation job
Loading data into a job
Importing the results
Combining GATE and UIMA
Embedding a UIMA AE in GATE
Mapping File Format
The UIMA Component Descriptor
Using the AnalysisEnginePR
Embedding a GATE CorpusController in UIMA
Mapping File Format
The GATE Application Definition
Configuring the GATEApplicationAnnotator
More (CREOLE) Plugins
Verb Group Chunker
Noun Phrase Chunker
Differences from the Original
Using the Chunker
TaggerFramework
TreeTagger—Multilingual POS Tagger
GENIA and Double Quotes
Chemistry Tagger
Using the Tagger
Zemanta Semantic Annotation Service
Lupedia Semantic Annotation Service
TextRazor Annotation Service
Annotating Numbers
Numbers in Words and Numbers
Roman Numerals
Annotating Measurements
Annotating and Normalizing Dates
Snowball Based Stemmers
Algorithms
GATE Morphological Analyzer
Rule File
Flexible Exporter
Configurable Exporter
Annotation Set Transfer
Schema Enforcer
Information Retrieval in GATE
Using the IR Functionality in GATE
Using the IR API
Websphinx Web Crawler
Using the Crawler PR
Proxy configuration
WordNet in GATE
The WordNet API
Kea - Automatic Keyphrase Detection
Using the `KEA Keyphrase Extractor' PR
Using Kea Corpora
Annotation Merging Plugin
Copying Annotations between Documents
OpenCalais Plugin
LingPipe Plugin
LingPipe Tokenizer PR
LingPipe Sentence Splitter PR
LingPipe POS Tagger PR
LingPipe NER PR
LingPipe Language Identifier PR
OpenNLP Plugin
Init parameters and models
OpenNLP PRs
Obtaining and generating models
Stanford CoreNLP
Stanford Tagger
Stanford Parser
Stanford Named Entity Recognition
Content Detection Using Boilerpipe
Inter Annotator Agreement
Schema Annotation Editor
Coref Tools Plugin
Pubmed Format
MediaWiki Format
Fast Infoset Document Format
CSV Document Support
TermRaider term extraction tools
Termbank language resources
Termbank Score Copier
The PMI bank language resource
Document Normalizer
Developer Tools
Linguistic Simplifier
IV The GATE Family: Cloud, MIMIR, Teamware
GATE Cloud
GATE Cloud services: an overview
Comparison with other systems
How to buy services
Pricing and discounts
Annotation Jobs on GATECloud.net
The Annotation Service Charges Explained
Annotation Job Execution in Detail
Running Custom Annotation Jobs on GATECloud.net
Preparing Your Application: The Basics
The GATECloud.net environment
GATE Teamware: A Web-based Collaborative Corpus Annotation Tool
Introduction
Requirements for Multi-Role Collaborative Annotation Environments
Typical Division of Labour
Remote, Scalable Data Storage
Automatic annotation services
Workflow Support
Teamware: Architecture, Implementation, and Examples
Data Storage Service
Annotation Services
The Executive Layer
The User Interfaces
Practical Applications
GATE Mímir
Appendices
Change Log
Next Release
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
Version 8.0 (May 2014)
Major changes
Other new and improved plugins
Bug fixes and other improvements
For developers
Version 7.1 (November 2012)
New plugins
Library updates
GATE Embedded API changes
Version 7.0 (February 2012)
Major new features
Removal of deprecated functionality
Other enhancements and bug fixes
Version 6.1 (April 2011)
New CREOLE Plugins
Other new features and improvements
Version 6.0 (November 2010)
Major new features
Breaking changes
Other new features and bugfixes
Version 5.2.1 (May 2010)
Version 5.2 (April 2010)
JAPE and JAPE-related
Other Changes
Version 5.1 (December 2009)
New Features
JAPE improvements
Other improvements and bug fixes
Version 5.0 (May 2009)
Major New Features
Other New Features and Improvements
Specific Bug Fixes
Version 4.0 (July 2007)
Major New Features
Other New Features and Improvements
Bug Fixes and Optimizations
Version 3.1 (April 2006)
Major New Features
Other New Features and Improvements
Bug Fixes
January 2005
December 2004
September 2004
Version 3 Beta 1 (August 2004)
July 2004
June 2004
April 2004
March 2004
Version 2.2 – August 2003
Version 2.1 – February 2003
June 2002
Version 5.1 Plugins Name Map
Obsolete CREOLE Plugins
Ontotext JapeC Compiler
Google Plugin
Yahoo Plugin
Using the YahooPR
Gazetteer Visual Resource - GAZE
Display Modes
Linear Definition Pane
Linear Definition Toolbar
Operations on Linear Definition Nodes
Gazetteer List Pane
Mapping Definition Pane
Google Translator PR
Design Notes
Patterns
Components
Model, view, controller
Interfaces
Exception Handling
Ant Tasks for GATE
Declaring the Tasks
The packagegapp task - bundling an application with its dependencies
Introduction
Basic Usage
Handling Non-Plugin Resources
Streamlining your Plugins
Bundling Extra Resources
The expandcreoles Task - Merging Annotation-Driven Config into creole.xml
Named-Entity State Machine Patterns
Main.jape
first.jape
firstname.jape
name.jape
Person
Location
Organization
Ambiguities
Contextual information
name_post.jape
date_pre.jape
date.jape
reldate.jape
number.jape
address.jape
url.jape
identifier.jape
jobtitle.jape
final.jape
unknown.jape
name_context.jape
org_context.jape
loc_context.jape
clean.jape
Part-of-Speech Tags used in the Hepple Tagger
References
Developing Language Processing Components with GATE Version 8 (a User Guide) For GATE version 8.1-snapshot (development builds) (built October 14, 2014) Hamish Cunningham Diana Maynard Kalina Bontcheva Valentin Tablan Niraj Aswani Ian Roberts Genevieve Gorrell Adam Funk Angus Roberts Danica Damljanovic Thomas Heitz Mark A. Greenwood Horacio Saggion Johann Petrak Yaoyong Li Wim Peters et al ©The University of Sheeld, Department of Computer Science 2001-2014 https://gate.ac.uk/ This user manual is free, but please consider making a donation. HTML version: https://gate.ac.uk/userguide Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), Ontotext Matrixware, the Information Retrieval Facility and several EU-funded projects: (TrendMiner, uComp, Arcomem, SEKT, TAO, NeOn, MediaCampaign, Musing, KnowledgeWeb, PrestoSpace, h-TechSight, and enIRaF).
Developing Language Processing Components with GATE Version 8 ©2014 The University of Sheeld, Department of Computer Science The University of Sheeld, Department of Computer Science Regent Court 211 Portobello Sheeld S1 4DP United Kingdom https://gate.ac.uk This work is licenced under the Creative Commons Attribution-No Derivative Licence. You are free to copy, distribute, display, and perform the work under the following conditions: ˆ Attribution  You must give the original author credit. ˆ No Derivative Works  You may not alter, transform, or build upon this work. With the understanding that: ˆ Waiver  Any of the above conditions can be waived if you get permission from the copyright holder. ˆ Other Rights  In no way are any of the following rights aected by the license: your fair dealing or fair use rights; the author's moral rights; rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. ˆ Notice  For any reuse or distribution, you must make clear to others the licence terms of this work. For more information about the Creative Commons Attribution-No Derivative License, please visit this web address: http://creativecommons.org/licenses/by-nd/2.0/uk/
Brief Contents I GATE Basics 1 Introduction 2 Installing and Running GATE 3 Using GATE Developer 4 CREOLE: the GATE Component Model 5 Language Resources: Corpora, Documents and Annotations 6 ANNIE: a Nearly-New Information Extraction System II GATE for Advanced Users 7 GATE Embedded 8 JAPE: Regular Expressions over Annotations 9 ANNIC: ANNotations-In-Context 10 Performance Evaluation of Language Analysers 11 Proling Processing Resources 12 Developing GATE III CREOLE Plugins 13 Gazetteers 14 Working with Ontologies 15 Non-English Language Support 16 Domain Specic Resources 17 Tools for Social Media Data 18 Parsers iii 3 5 29 39 71 93 117 137 139 193 233 243 273 281 293 295 317 357 365 373 377
iv 19 Machine Learning 20 Tools for Alignment Tasks 21 Crowdsourcing Data with GATE 22 Combining GATE and UIMA 23 More (CREOLE) Plugins IV The GATE Family: Cloud, MIMIR, Teamware 24 GATE Cloud Contents 391 441 457 469 481 555 557 25 GATE Teamware: A Web-based Collaborative Corpus Annotation Tool 567 26 GATE Mímir Appendices A Change Log B Version 5.1 Plugins Name Map C Obsolete CREOLE Plugins D Design Notes E Ant Tasks for GATE F Named-Entity State Machine Patterns G Part-of-Speech Tags used in the Hepple Tagger References 581 583 583 621 623 631 639 647 655 657
Contents I GATE Basics 1 Introduction 1.1 How to Use this Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Developing and Deploying Language Processing Facilities . . . . . . . 1.3.2 Built-In Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Additional Facilities in GATE Developer/Embedded . . . . . . . . . . 1.3.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Some Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Recent Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Next Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Version 8.0 (May 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Installing and Running GATE 2.1 Downloading GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Installing and Running GATE . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 2.2.1 The Easy Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 The Hard Way (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 The Hard Way (2): Subversion . . . . . . . . . . . . . . . . . . . . . 2.2.4 Running GATE Developer on Unix/Linux . . . . . . . . . . . . . . . 2.3 Using System Properties with GATE . . . . . . . . . . . . . . . . . . . . . . 2.4 Changing GATE's launch conguration . . . . . . . . . . . . . . . . . . . . . 2.5 Conguring GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Building GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Using GATE with Maven/Ivy . . . . . . . . . . . . . . . . . . . . . . 2.7 Uninstalling GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Using GATE Developer 3.1 The GATE Developer Main Window . . . . . . . . . . . . . . . . . . . . . . 3.2 Loading and Viewing Documents . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Creating and Viewing Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . v 3 5 8 8 9 9 11 12 12 14 15 15 16 19 29 29 29 29 29 31 31 32 34 35 36 37 38 38 39 40 42 45
vi Contents Installing and updating CREOLE Plugins 3.4 Working with Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 The Annotation Sets View . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 The Annotations List View . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 The Annotations Stack View . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 The Co-reference Editor . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Creating and Editing Annotations . . . . . . . . . . . . . . . . . . . . 3.4.6 Schema-Driven Editing . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.7 Printing Text with Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Using CREOLE Plugins . . . . . . . . . . . . . . . . . . . 3.6 3.7 Loading and Using Processing Resources . . . . . . . . . . . . . . . . . . . . 3.8 Creating and Running an Application . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Running an Application on a Datastore . . . . . . . . . . . . . . . . . 3.8.2 Running PRs Conditionally on Document Features . . . . . . . . . . 3.8.3 Doing Information Extraction with ANNIE . . . . . . . . . . . . . . . 3.8.4 Modifying ANNIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Saving Applications and Language Resources . . . . . . . . . . . . . . . . . . Saving Documents to File . . . . . . . . . . . . . . . . . . . . . . . . Saving and Restoring LRs in Datastores . . . . . . . . . . . . . . . . Saving Application States to a File . . . . . . . . . . . . . . . . . . . Saving an Application with its Resources (e.g. GATECloud.net) . . . 3.10 Keyboard Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.1 Stopping GATE from Restoring Developer Sessions/Options . . . . . 3.11.2 Working with Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 3.9.2 3.9.3 3.9.4 4 CREOLE: the GATE Component Model 4.1 The Web and CREOLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The GATE Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 The Lifecycle of a CREOLE Resource . . . . . . . . . . . . . . . . . . . . . . 4.4 Processing Resources and Applications . . . . . . . . . . . . . . . . . . . . . 4.5 Language Resources and Datastores . . . . . . . . . . . . . . . . . . . . . . . 4.6 Built-in CREOLE Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 CREOLE Resource Conguration . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Conguration with XML . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Conguring Resources using Annotations . . . . . . . . . . . . . . . . 4.7.3 Mixing the Conguration Styles . . . . . . . . . . . . . . . . . . . . . 4.7.4 Loading Third-Party Libraries using Apache Ivy . . . . . . . . . . . . 4.8 Tools: How to Add Utilities to GATE Developer . . . . . . . . . . . . . . . . 4.8.1 Putting Your Tools in a Sub-Menu . . . . . . . . . . . . . . . . . . . 4.8.2 Adding Tools To Existing Resource Types . . . . . . . . . . . . . . . 5 Language Resources: Corpora, Documents and Annotations 5.1 Features: Simple Attribute/Value Data . . . . . . . . . . . . . . . . . . . . . 47 47 48 48 49 50 53 54 55 57 58 60 60 61 62 62 63 63 64 65 66 67 69 69 70 71 72 73 73 74 75 75 76 77 82 87 89 90 91 91 93 93
Contents vii 5.2 Corpora: Sets of Documents plus Features . . . . . . . . . . . . . . . . . . . 5.3 Documents: Content plus Annotations plus Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Annotations: Directed Acyclic Graphs 5.4.1 Annotation Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Examples of Annotated Documents . . . . . . . . . . . . . . . . . . . 5.4.3 Creating, Viewing and Editing Diverse Annotation Types . . . . . . . 5.5 Document Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 94 94 94 96 99 99 5.5.1 Detecting the Right Reader . . . . . . . . . . . . . . . . . . . . . . . 101 5.5.2 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5.3 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5.4 SGML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.5 Plain text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.6 RTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5.7 Email 5.5.8 PDF Files and Oce Documents . . . . . . . . . . . . . . . . . . . . 114 5.5.9 UIMA CAS Documents . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.10 CoNLL/IOB Documents . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.6 XML Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6 ANNIE: a Nearly-New Information Extraction System 117 6.1 Document Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2 Tokeniser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2.1 Tokeniser Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2.2 Token Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2.3 English Tokeniser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3 Gazetteer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.4 Sentence Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.5 RegEx Sentence Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.6 Part of Speech Tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.7 Semantic Tagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.8 Orthographic Coreference (OrthoMatcher) . . . . . . . . . . . . . . . . . . . 127 6.8.1 GATE Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.8.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.8.3 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.9.1 Quoted Speech Submodule . . . . . . . . . . . . . . . . . . . . . . . . 129 6.9.2 Pleonastic It Submodule . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.9.3 Pronominal Resolution Submodule . . . . . . . . . . . . . . . . . . . 129 6.9.4 Detailed Description of the Algorithm . . . . . . . . . . . . . . . . . . 130 6.10 A Walk-Through Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.10.1 Step 1 - Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.10.2 Step 2 - List Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.10.3 Step 3 - Grammar Rules . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.9 Pronominal Coreference
viii II GATE for Advanced Users Contents 137 7 GATE Embedded 7.8.1 139 7.1 Quick Start with GATE Embedded . . . . . . . . . . . . . . . . . . . . . . . 139 7.2 Resource Management in GATE Embedded . . . . . . . . . . . . . . . . . . 140 7.3 Using CREOLE Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.4 Language Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.4.1 GATE Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.4.2 Feature Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.4.3 Annotation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.4.4 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.4.5 GATE Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.5 Processing Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.6 Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.7 Modelling Relations between Annotations . . . . . . . . . . . . . . . . . . . 155 7.8 Duplicating a Resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Sharable properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.9 Persistent Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.10 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.11 Creating a New Annotation Schema . . . . . . . . . . . . . . . . . . . . . . . 161 7.12 Creating a New CREOLE Resource . . . . . . . . . . . . . . . . . . . . . . . 162 7.13 Adding Support for a New Document Format . . . . . . . . . . . . . . . . . 165 7.14 Using GATE Embedded in a Multithreaded Environment . . . . . . . . . . . 167 7.15 Using GATE Embedded within a Spring Application . . . . . . . . . . . . . 168 7.15.1 Duplication in Spring . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.15.2 Spring pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 7.15.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.16 Using GATE Embedded within a Tomcat Web Application . . . . . . . . . . 174 7.16.1 Recommended Directory Structure . . . . . . . . . . . . . . . . . . . 174 7.16.2 Conguration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.16.3 Initialization Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.17 Groovy for GATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7.17.1 Groovy Scripting Console for GATE . . . . . . . . . . . . . . . . . . 177 7.17.2 Groovy scripting PR . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.17.3 The Scriptable Controller . . . . . . . . . . . . . . . . . . . . . . . . 181 7.17.4 Utility methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 . . . . . . . . . . . . . . . . . . . . . . . . . 189 7.18 Saving Cong Data to gate.xml 7.19 Annotation merging through the API . . . . . . . . . . . . . . . . . . . . . . 189 7.20 Using Resource Helpers to Extend the API . . . . . . . . . . . . . . . . . . . 190 8 JAPE: Regular Expressions over Annotations 193 8.1 The Left-Hand Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 8.1.1 Matching Entire Annotation Types . . . . . . . . . . . . . . . . . . . 195 8.1.2 Using Features and Values . . . . . . . . . . . . . . . . . . . . . . . . 196
分享到:
收藏