Project Description
This project is to extract bilingual sentence pairs from text. As an important basic module, it is widely applied in many different tasks in natural language process field(NLP), such as machine translation, search engine, language study and so on.

The solution contains three projects. MatchBilingualSent is core algorithm project for bilingual text matching, CrawlBilingualSentPairFromWeb and SplitSentPair are two demo projects to show how to leverage MatchBilingualSent to solve problem in real. Currently, the project supports both Chinese and English. To support other languages, other language's word breaker and sentences split rules should be provided.

CrawlBilingualSentPairFromWeb is used to crawl and parse web pages, and extract bilingual sentence pairs from pages automatically. When the program starts, the main logic as follows:
1. loads seed urls and constraint rule list from file, and push them into crawling-queue.
2. pop urls and its rule from crawling-queue, if crawling-queue is empty, exit.
3. check whether the url is in crawled-list, if yes, goto 2, else goto 4
4. download and parse web page from url. In the page, all hyper-links which match the rule will be pushed into crawling-queue.
5. extract bilingual sentence pairs from the parsed page
6. save page's url into crawled-list and goto 2

seed urls and constraint rule example

http://www.ftchinese.com/
^http:\/\/www\.ftchinese\.com\/story\/.+?$
http://article.yeeyan.org/
^http:\/\/article\.yeeyan\.org\/view\/\d?\/\d?$
http://www.hjenglish.com/fanyi/shuangyu/
^http:\/\/www\.hjenglish\.com\/fanyi\/.+?$
http://www.chinadaily.com.cn/languagetips/news/newsbilingual.html
^http:\/\/www\.chinadaily\.com\.cn\/language_tips\/news\/.+?$

As aboved examples show, each example contains two rows. The first row is seed url and the second row is constraint rule. The constraint rule is a regular expression which describes what kind of urls should be extracted and saved into crawling-queue.

extracted bilingual sentence pairs example

Though her choice of career may no longer be that of a conventional MBA graduate , her business strategy certainly is .
尽管她的职业选择可能不再是传统MBA毕业生的选择,但她的商业策略显然仍符合传统的MBA毕业生.
0.172857142857143
http://www.ftchinese.com/story/001044221/ce

She spends considerable time fundraising across the UK , but is wary of becoming what she describes as " donor-led " .
她花费大量时间在英国各地筹集资金,但对自己所称的"受捐赠者主导"心存警惕.
0.164473684210526
http://www.ftchinese.com/story/001044221/ce

These days she dedicates herself to her NGO , which , if she is successful , might yet become a competitive recruiter of MBA graduates .
这些日子她全身心地投入到自己的非政府组织上,如果她做得成功的话,那么它可能会成为富有竞争力的MBA毕业生雇主.
0.181720430107527
http://www.ftchinese.com/story/001044221/ce

each bilingual sentences pair includes four rows and its format as follows:
[sentence in language 1]
[sentence in langugae 2]
[confidence score]
[source url]

SplitSentPair is another simple tool used to check whether two bilingual sentence are pair.

中英文双语例句对齐项目

本项目用于从双语文本段落中实现对应句子匹配对齐。其作为重要的基础工具广泛应用于自然语言处理(NLP)的各个领域,如:统计机器翻译语料构建,双语搜索引擎数据抓取与处理,双语语言学习等。项目包含三个主要模块:
MatchBilingualSent是核心算法模块。其包含了双语文本段落分句,以及双语句子对齐的核心算法。
CrawlBilingualSentPairFromWeb 该模块基于MatchBilingualSent模块的算法,用于从给定的web网页集合中抽取双语句对,并且根据约束条件收集其它web网页并抽取句对,实现对于web网页的周游抓取。该工具可用于站点内例句挖掘工作。
SplitSentPair用于判别给定句子是否能够形成句对。

Last edited May 10, 2012 at 4:23 AM by monkeyfu, version 4