一. 概念
1. 正则表达式
2. NFA
使用 bison 来解析输入文件,将输入文件的规则区中的正则表达式转换为 NFA 图
典型的 NFA 状态图可见下面的示例
3. DFA
NFA 到 DFA 的计算过程:
a. 从 NFA 图中得到每个对应 DFA 状态的 NFA 状态集合 c1(每次转换一步)
b. 求 c1 的 epsilon 闭包,得到一个新的 NFA 状态集合 c2
c. 根据 c2 求出一个对应的 DFA 状态,对应的节点集合为 c3
重复步骤 a-c,直至所有的转换都已完成(已无新的 NFA 状态集合),整个构造过程是动态的,
NFA 状态集合是动态计算得到,一旦有新的 NFA 状态集合求出,对应的 DFA 状态也对应增加
4. 等价类
等价类(Equivalence Classes)指的是将输入字符根据规则需要分类,例如下面的例子 EC 总数为
8;而 meta-等价类(Meta-Equivalence Classes)则是用于模型机的(template),是一种更抽象的
分类,如下面的例子 meta-EC 总数为 3
5. 转换表和转换算法
a. 转换矩阵
根据上面 DFA 计算结果以及所有的等价类,求出从一个 DFA 状态转换到另一 DFA 状态的转换
矩阵,两个转换状态的转换边(转换条件)为 EC
DFA 状态集合和转换矩阵可见下面的示例
b. Template 和 proto
Template 和 proto 用于减少转换表项的空间,及加速查找和转换过程;
两种表项均为双向链表
Template 和 proto 表项可见下面的示例
c. 四个一维数组
1. def
2. base
3.
chk
4. nxt
d. 快速转换算法
/* mk1tbl - create table entries for a state (or state fragment) which
*
*/
has only one out-transition
void mk1tbl( state, sym, onenxt, onedef )
int state, sym, onenxt, onedef;
{
if ( firstfree < sym )
firstfree = sym;
while ( chk[firstfree] != 0 )
if ( ++firstfree >= current_max_xpairs )
expand_nxt_chk();
base[state] = firstfree - sym;
def[state] = onedef;
chk[firstfree] = state;
nxt[firstfree] = onenxt;
if ( firstfree > tblend )
{
tblend = firstfree++;
if ( firstfree >= current_max_xpairs )
expand_nxt_chk();
}
}
或者:
base[statenum] = tblbase;
def[statenum] = deflink;
for (i = minec; i <= maxec; ++i)
if (state[i] != SAME_TRANS)
if (state[i] != 0 || deflink != JAMSTATE) {
nxt[tblbase + i] = state[i];
chk[tblbase + i] = statenum;
}
if (baseaddr == firstfree)
/* Find next free slot in tables. */
for (++firstfree; chk[firstfree] != 0; ++firstfree) ;
tblend = MAX (tblend, tbllast);
从以上的算法中可以看到 firstfree= base[state]+sym(sym 代表 EC 或者 meta-EC),因此
chk[firstfree] = chk[base[state]+sym] = state 如果成立,则表示存在这个转换表项,
就将 nxt[firstfree] = nxt[base[state]+sym] = onenxt 的值赋给 next(下一表项)
printf("KEY: %s\n",yytext);
printf("ID: %s\n",yytext);
printf("NUM: %s\n", yytext);
printf("OPER: %s\n",yytext);
printf("OPER: %s\n",yytext);
printf("ANNOTATION: %s\n",yytext);
二. 一个例子:
1. Test.l 文件内容:
%%
if
[a-z][a-z0-9]*
[0-9]+
"/"
"*"
"/*"(.)*"*/"
%%
void main(int argc, char** argv)
{
yylex();
}
int yywrap()
{
return 1;
}
2. 输出的各类表项:
G:\meterial\compiler\726lexYacc\flex-2.5.4a-1-src\src\flex\2.5.4a\flex-2.5.4a>flex.exe -T test.l
a.正则表达式
%%
1
2
3
4
if
[a-z][a-z0-9]*
[0-9]+
"\/"
"\*"
"\/\*"(.)*"\*\/"
End Marker
5
6
7
加上默认的“.”规则总共有 7 个规则
********** beginning dump of nfa with start state 38
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
state #
0
0
0
0
0 [1]
3
0
0
10
0 [2]
10
7
0
0 [3]
13
0
0
0 [4]
16
0
0
0 [5]
20
0
0
0
0
29
0
29
0
0
0
0 [6]
24
0
0 [7]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
257:
257:
105:
102:
257:
257:
-1:
-2:
257:
257:
257:
257:
-3:
257:
257:
257:
47:
257:
257:
257:
42:
257:
257:
257:
47:
42:
-4:
257:
257:
257:
257:
42:
47:
257:
257:
-5:
257:
0,
0,
4,
5,
0,
1,
11,
9,
8,
0,
8,
6,
14,
13,
12,
17,
18,
0,
15,
21,
22,
0,
19,
25,
26,
30,
28,
27,
31,
27,
32,
33,
34,
0,
23,
37,
0,
38
state #
********** end of dump
257:
35,
36
由此得到的 NFA 图为:
b. DFA Dump:
state # 1:
state # 2:
state # 3:
state # 4:
state # 5:
state # 6:
state # 7:
state # 8:
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
3
5
5
6
7
8
state # 9:
5
6
7
8
state # 10:
1
3
4
5
6
7
8
state # 11:
4
4
5
6
7
8
8
9
4
4
5
6
7
8
8
9
10
11
12
12
12
12
12
12
13
12
14
15
14
14
14
14
14
5
state # 12:
5
6
7
8
state # 13:
5
6
7
8
state # 14:
1
3
4
5
6
7
8
state # 15:
1
3
4
5
6
7
8
state # 16:
1
3
4
5
6
7
8
11
12
12
12
12
12
12
12
12
14
15
14
14
14
14
14
14
15
16
14
14
14
14
14
15
14
14
14
14
14
得到的 DFA 矩阵为:
从上图可以看到总共有 16 个 DFA 状态,每个状态是一个对应 NFA 状态的 epsilon 闭包
out-transitions: [ \000-\t
jam-transitions: EOF [ \n ]
\v-\377 ]
c.可接受的状态
state # 3 accepts: [8]
state # 4 accepts: [7]
state # 5 accepts: [5]
state # 6 accepts: [4]
state # 7 accepts: [3]
state # 8 accepts: [2]
state # 9 accepts: [2]
state # 11 accepts: [3]
state # 12 accepts: [2]
state # 13 accepts: [1]
state # 16 accepts: [6]
d.Equivalence Classes:
\000 = -1 ' ' = 1
\001 = 1
\002 = 1
\003 = 1
! = 1
" = 1
# = 1
@ = 1
A = 1
B = 1
C = 1
` = 1 \200 = 1 \240 = 1 \300 = 1 \340 = 1
a = 6 \201 = 1 \241 = 1
\301 = 1 \341 = 1
b = 6 \202 = 1 \242 = 1 \302 = 1 \342 = 1
\203 = 1 \243 = 1 \303 = 1 \343 = 1
c = 6