Project Description
This project is to extract bilingual sentence pairs from text. As an important basic module, it is widely applied in many different tasks in natural language process field(NLP), such as machine translation, search engine, language study and so on.

The solution contains three projects. MatchBilingualSent is core algorithm project for bilingual text matching, CrawlBilingualSentPairFromWeb and SplitSentPair are two demo projects to show how to leverage MatchBilingualSent to solve problem in real. Currently, the project supports both Chinese and English. To support other languages, other language's word breaker and sentences split rules should be provided.

CrawlBilingualSentPairFromWeb is used to crawl and parse web pages, and extract bilingual sentence pairs from pages automatically. When the program starts, the main logic as follows:
1. loads seed urls and constraint rule list from file, and push them into crawling-queue.
2. pop urls and its rule from crawling-queue, if crawling-queue is empty, exit.
3. check whether the url is in crawled-list, if yes, goto 2, else goto 4
4. download and parse web page from url. In the page, all hyper-links which match the rule will be pushed into crawling-queue.
5. extract bilingual sentence pairs from the parsed page
6. save page's url into crawled-list and goto 2

seed urls and constraint rule example

As aboved examples show, each example contains two rows. The first row is seed url and the second row is constraint rule. The constraint rule is a regular expression which describes what kind of urls should be extracted and saved into crawling-queue.

extracted bilingual sentence pairs example

Though her choice of career may no longer be that of a conventional MBA graduate , her business strategy certainly is .

She spends considerable time fundraising across the UK , but is wary of becoming what she describes as " donor-led " .

These days she dedicates herself to her NGO , which , if she is successful , might yet become a competitive recruiter of MBA graduates .

each bilingual sentences pair includes four rows and its format as follows:
[sentence in language 1]
[sentence in langugae 2]
[confidence score]
[source url]

SplitSentPair is another simple tool used to check whether two bilingual sentence are pair.


CrawlBilingualSentPairFromWeb 该模块基于MatchBilingualSent模块的算法,用于从给定的web网页集合中抽取双语句对,并且根据约束条件收集其它web网页并抽取句对,实现对于web网页的周游抓取。该工具可用于站点内例句挖掘工作。

Last edited May 10, 2012 at 3:23 AM by monkeyfu, version 4