第 3 章 GATE Developer 使用
本章介绍GATE Developer, 它是GATE的图形用户界面。它类似为数学软件包
Mathematica,或Eclipse Java程序员,为语言处理软件的研究和开发提供了方便
的图形环境。同时也是一个强有力的工具As well as being a powerful research tool
in its own right, it is also very useful in conjunction with GATE Embedded (the
GATE API by which GATE functionality can be included in your own applications);
for example, GATE Developer can be used to create applications that can then be
embedded via the API. This chapter describes how to complete common tasks using
GATE Developer. It is intended to provide a good entry point to GATE functionality,
and so explanations are given assuming only basic knowledge of GATE. However,
probably the best way to learn how to use GATE Developer is to use this chapter in
conjunction with the demonstrations and tutorials movies. There are specific links to
them throughout the chapter. There is also a complete new set of video tutorials here.
The basic business of GATE is annotating documents, and all the functionality we
will introduce relates to that. Core concepts are;
• the documents to be annotated,
• corpora comprising sets of documents, grouping documents for the purpose of
running uniform processes across them,
• annotations that are created on documents,
• annotation types such as `Name' or `Date',
• annotation sets comprising groups of annotations,
• processing resources that manipulate and create annotations on documents, and
• applications, comprising sequences of processing resources, that can be applied to a
document or corpus.
What is considered to be the end result of the process varies depending on the task,
but for the purposes of this chapter, output takes the form of the annotated
document/corpus. Researchers might be more interested in 图s demonstrating how
successfully their application compares to a `gold standard' annotation set; Chapter 10
in Part II will cover ways of comparing annotation sets to each other and obtaining
measures such as F1. Implementers might be more interested in using the annotations
programmatically; Chapter 7, also in Part II, talks about working with annotations
from GATE Embedded. For the purposes of this chapter, however, we will focus only
on creating the annotated documents themselves, and creating GATE applications for
future use.
GATE includes a complete information extraction system that you are free to use,
called ANNIE (a Nearly-New Information Extraction System). Many users find this is
a good starting point for their own application, and so we will cover it in this chapter.
Chapter 6 talks in a lot more detail about the inner workings of ANNIE, but we aim to
get you started using ANNIE from inside of GATE Developer in this chapter.
We start the chapter with an exploration of the GATE Developer GUI, in Section 3.1.
We describe how to create documents (Section 3.2) and corpora (Section 3.3). We
talk about viewing and manually creating annotations (Section 3.4).
We then talk about loading the plugins that contain the processing resources you will
use to construct your application, in Section 3.5. We then talk about instantiating
processing resources (Section 3.6). Section 3.7 covers applications, including using
ANNIE (Section 3.7.3). Saving applications and language resources (documents and
corpora) is covered in Section 3.8. We conclude with a few assorted topics that might
be useful to the GATE Developer user, in Section 3.10.
3.1 GATE Developer 主窗口
图 3.1 显示了GATE Developer 主窗口, 运行时你将看到5个区域:
(1)顶部的菜单栏 `File', `Options', `Tools', `Help' 和最常用的操作图标。
图 3.1: GATE Developer主窗口
(2)左侧是以`GATE'开始的资源树,包括`Applications', `Language
Resources'等。
(3)左下角的矩形窗格是一个小的资源视图。
(4)中部包含`Messages'或来自资源树名称的选项卡,主要资源视图。
(5)底部为信息栏。
The menu and the messages bar do the usual things. Longer messages are displayed in
the messages tab in the main resource viewer area.
The resource tree and resource viewer areas work together to allow the system to
display diverse resources in various ways. The many resources integrated with GATE
can have either a small view, a large view, or both.
At any time, the main viewer can also be used to display other information, such as
messages,by clicking on the appropriate tab at the top of the main window. If an error
occurs in processing, the messages tab will ash red, and an additional popup error
message may also occur.
In the options dialogue from the Options menu you can choose if you want to link the selection in
the resources tree and the selected main view.
3.2 加载和浏览文档
图 3.2: 创建一个新文档
如果你在资源窗口右击`Language Resources',选择\New' then `GATE Document',则出现
`Parameters for the new GATE Document'窗口(图3.2)。在窗口你可以指定要创建的GATE
文档,如果你不指定要创建的GATE文档,则系统为你创建一个文档名。输入文档的URL或
用文件浏览器指出你希望使用的文档资源。例如,你可以用`http://gate.ac.uk'或浏览盘中的
text、XML文件,点击`OK',根据你指定的资源GATE文档。
1、查看创建文档信息
文档编辑器是包含在GATE Developer选项卡窗格中部。你在资源窗格双击你的文件,
可以查看文档编辑器。文档编辑器顶部控制不同视图的显示的按钮和图标及搜索框。最初,
你只看到你的文档的文本,如图3.3所示。点击`Annotation Sets'和`Annotations List'查看
正确的标注集和在底部的标注列表。你看到的视图类似图3.4,在标注列表的地方,你可以
选择查看标注堆栈。在标注集的地方,你可以选择查看共指编辑器。有关此功能的个人拿更
多信息在3.4节给出。
2、从右上角的三角形图标有几个选项可以设置
选择`Save Current Layout'显示存储的不同视图的方式和文档中标注类型的高亮度。
然后,如果你选择`Restore Layout Automatically',每次你打开一个文档时你会得到相同
的视图和标注类型。
图 3.3: 文档编辑器
如果设置`Read-only',则不能够编辑编辑器的文本,但你任然能编辑标注。为了避免
无意修改原始文本,这项选择是非常有用的。
最后,你可以选择`Insert Append'和`Insert Prepend'。当你在标注边缘插入文本,该设置
有重大作用。
当你选择`Insert Append'时,如果你把光标放在标注的开始,在这种情况下,新输入的
文本将成为标注的一部分(在最后),在其他情况下,新输入的文本不能成为标注的一部分。
当你选择`Insert Prepend'时,如果你把光标放在标注最后,新输入的文本将成为标注的一部
分(在最前),在其他情况下,新输入的文本不能成为标注的一部分。
假设有一文本`This is an [annotation].'方括号[]表示annotation的边界。如果我们在
annotation的’a’之前或`n'之后插入一个`x`,这样我们得到:
选择Append时,有:
• This is an x[annotation].
• This is an [annotationx].
选择Prepend时,有:
• This is an [xannotation].
• This is an [annotation]x.
图 3.4: 带有标注集和标注列表的文档编辑器
Text in a loaded document can be edited in the document viewer. The usual
platform specific cut, copy and paste keyboard shortcuts should also work, depending
on your operating system (e.g. CTRL-C, CTRL-V for Windows). The last icon, a
magnifying glass, at the top of the document editor is for searching in the document.
To prevent the new annotation windows popping up when a piece of text is selected,
hold down the CTRL key. Alternatively, you can hide the annotation sets view by
clicking on its button at the top of the document view; this will also cause the
highlighted portions of the text to become un-highlighted.
See also Section 18.2.3 for the compound document editor.
3.3 语料库创建和浏览
You can create a new corpus in a similar manner to creating a new document; simply
rightclick on `Language Resources' in the resources pane, select `New' then `GATE
corpus'. A brief dialogue box will appear in which you can optionally give a name for
your corpus (if you leave this blank, a corpus name will be created for you) and
optionally add documents to the corpus from those already loaded into GATE.
There are three ways of adding documents to a corpus:
1. When creating the corpus, clicking on the icon next to the \documentsList" input
field brings up a popup window with a list of the documents already loaded into
GATE Developer. This enables the user to add any documents to the corpus.
2. Alternatively, the corpus can be loaded first, and documents added later by double
clicking on the corpus and using the + and - icons to add or remove documents to the
corpus. Note that the documents must have been loaded into GATE Developer before
they can be added to the corpus.
3. Once loaded, the corpus can be populated by right clicking on the corpus and
selecting `Populate'. With this method, documents do not have to have been
previously loaded into GATE Developer, as they will be loaded during the population
process. If you right-click on your corpus in the resources pane, you will see that you
have the option to `Populate' the corpus. If you select this option, you will see a
dialogue box in which you can specify a directory in which GATE will search for
documents. You can specify the extensions allowable; for example, XML or TXT.
This will restrict the corpus population to only those documents with the extensions
you wish to load. You can choose whether to recurse through the directories contained
within the target directory or restrict the population to those documents contained in
the top level directory. Click on `OK' to populate your corpus. This option provides a
quick way to create a GATE Corpus from a directory of documents.
Additionally, right-clicking on a loaded document in the tree and selecting the `New
corpus with this document' option creates a new transient corpus named Corpus for
document name containing just this document.
See also the movie for creating and populating corpora.
Double click on your corpus in the resources pane to see the corpus editor, shown in
图 3.5. You will see a list of the documents contained within the corpus.
In the top left of the corpus editor, plus and minus buttons allow you to add
documents to the corpus from those already loaded into GATE and remove
documents from the corpus (note that removing a document from a corpus does not
remove it from GATE).
图 3.5: 语料库编辑器
Up and down arrows at the top of the view allow you to reorder the documents in the
corpus. The rightmost button in the view opens the currently selected document in a
document editor.
At the bottom, you will see that tabs entitled `Initialisation Parameters' and `Corpus
Quality Assurance' are also available in addition to the corpus editor tab you are
currently looking at. Clicking on the `Initialisation Parameters' tab allows you to view
the initialisation parameters for the corpus. The `Corpus Quality Assurance' tab
allows you to calculate agreement measures between the annotations in your corpus.
Agreement measures are discussed in depth in Chapter 10. The use of corpus quality
assurance is discussed in Section 10.3.
3.4 关于标注的研究
在本节中,我们讨论标注集视图的细节,以及手工创建和编辑他们。正像本章开始的讨
论那样,GATE的主要目的是标注文档。虽然应用可以自动标注文档,但标注也可以手工完
成,例如:由用户或半自动修改和增加新的标注。3.4.5节的重点手工标注,3.6节中我们谈
谈在文档上运行的处理资源。我们首先概述标注的功能,通过GUI区域组织相关功能。
3.4.1 标注集视图
为了查看标注集,单击文档编辑器顶部的`Annotation Sets'按钮,或使用F3键(见3.9更
多的快捷键)。这将触发标注集视图,显示可用的标注集和其相应的标注类型。
标注集视图显示在文档编辑器的右边,这是一棵类似树的视图,每个标注有一个根。列
表中的第一个标注始终是一个无名的集合。这是默认的标注集合。在图3.4中可以看到,在
无名标注旁边与一个向下的箭头。文档其他标注集,如图3.4所示的`Key'和`Original markups'。
由于该文档是一个XML文档,原始的XML标记按标注集的形式保留。这个标注集被扩展了,
所以你可以看到`TEXT', `body', `font', `html', `p', `table', `td'和`tr'标注(图3.4)。
为了显示一个标注类型的所有标注,在相应的复选框打勾或选中后敲空格键,与标注相
应的文本词语将突出的显示在主文本窗口。删除标注类型使用删除键,改变颜色使用回车键。
所有这些行为有一个弹出菜单,你可以通过右键单击一个标注类型得到。
当你打开标注集视图时,如果保持按下Shift键,GATE Developer试图选择任何标注,它
在以前查看的文档(如有的话)选择的标注,否则没有选中的标注。
由于在标注集视图中选定一个标注类型,在主资源浏览器鼠标停悬在一个标注上或右键
单击标注,将弹出一个框,其中包含了与标注相关的列表,从中可以选择一个标注查看标注
编辑器(如图3.6)。
3.4.2 标注列表视图
图 3.6: 标注编辑器
为了查看标注和对应属性的列表,点击主窗口的顶部`Annotations list'按钮或使用F4
键。标注列表视图将出现在主文本窗口下面,它只包含在标注集视图中选定的标注,通过单
击相应的列的标题,列表项可以按升序或降序排序。此外,你可以在环境菜单中右键单击列
的标题隐藏该列。单击选择的表中的行会闪烁文档中相应的标注。右键单击选择的行,则删
除或编辑标注,Delete键是删除选定的标注的快捷方式。