logo资料库

中文版面分析与重构.pdf

第1页 / 共4页
第2页 / 共4页
第3页 / 共4页
第4页 / 共4页
资料共4页,全文预览结束
文章编号:1671-2021(2008)02-0333-04 中文版面分析和重构 钟 辉,孙士兰,刘 倩 (沈阳建筑大学信息与控制工程学院,辽宁沈阳110168) 摘要:目的在将纸张文档数字化的过程中,解决中文文档版面信息的自动提取与恢复问题. 方法通过搜索连通域,并根据连通域的尺寸特征,优先提取非文本区域,对提取出来的非文本 区域,根据投影直方图、宽高比和黑白像素比等特征区分出表格、直线和图像;对文本区域采用 改进的基于投影的纵横切割法来达到对文本正确分割的目的;利用XML文档文件格式描述、 组织、恢复原有版面的数据和样式.通过重构生成保持原版面格式的通用电子文档。达到“原文 重现”的目的.结果对大量的书籍样张和带表格、图像以及横竖混排等复杂样张的试验,结果 表明改进的版面分析方法分割准确,速度快;基于XML技术的重构方法实现了对文档版面较 精确的重构.结论采用统计特征得出的阈值参数用在了改进的版面分析方法中,提高了系统 的适应性.该方法对较规范的文档效果较好,对复杂版面在一定的人工干预下基本可以适用. 关键词:版面分析;版面理解;版面重构;xML 中图分类号:TP391 文献标识码:A
Research on Chinese Document Layout Analysis and Reconstruction ZHONG Hui, SUN Shilan, LIU Qian (School of Information & Control Engineering, Shenyang Jianzhu University, Shenyang China, 110168) Abstract:We try to automatically extract and resume Chinese document layout in the process of converting paper media documents into electronic format. First, non - text region was extracted by searching connected domain, according to the size feature of connected domain. Then extracted non - text region forms, lines and images were distinguished according to characteristics of projection histogram, aspect ratio and the ratio of black and white pixels. The correct segmentation for text region was achieved on the basis of the vertical projection and horizontal - cut method. And the original layout' s data and style were described, organized, and restored by XML document file format. The purpose of resuming the original text can be realized by re- constructing and generating universal electronic document that maintains the format of original layout. The results show that the improved layout analysis has accurate division and "faster rate. The reconstruction method based on XML technology achieves more accurate reconstruction for document layout. The threshold parameters, obtained by adopting statistical characteristics, are used in the improved layout analysis meth- ods, which have improved the system adaptability. This method suits standardized document better, and can be applied to complex layouts with certain manual intervention. Key words: layout analysis; layout understanding; layout reconstruction; XML
分享到:
收藏