This project is to extract bilingual sentence pairs from text. As an important basic module, it is widely applied in many different tasks in natural language process field(NLP), such as machine translation, search engine, language study and so on.
The solution contains three projects. MatchBilingualSent is core algorithm project for bilingual text matching,
CrawlBilingualSentPairFromWeb and SplitSentPair are two demo projects to show how to leverage MatchBilingualSent to solve problem in real. Currently, the project supports both Chinese and English. To support other languages, other language's
word breaker and sentences split rules should be provided.
CrawlBilingualSentPairFromWeb is used to crawl and parse web pages, and extract bilingual sentence pairs from pages automatically. When the program starts, the main logic as follows:
1. loads seed urls and constraint rule list from file, and push them into crawling-queue.
2. pop urls and its rule from crawling-queue, if crawling-queue is empty, exit.
3. check whether the url is in crawled-list, if yes, goto 2, else goto 4
4. download and parse web page from url. In the page, all hyper-links which match the rule will be pushed into crawling-queue.
5. extract bilingual sentence pairs from the parsed page
6. save page's url into crawled-list and goto 2
seed urls and constraint rule example
As aboved examples show, each example contains two rows. The first row is seed url and the second row is constraint rule. The constraint rule is a regular expression which describes what kind of urls should be extracted and saved into crawling-queue.
extracted bilingual sentence pairs example
Though her choice of career may no longer be that of a conventional MBA graduate , her business strategy certainly is .
She spends considerable time fundraising across the UK , but is wary of becoming what she describes as " donor-led " .
These days she dedicates herself to her NGO , which , if she is successful , might yet become a competitive recruiter of MBA graduates .
each bilingual sentences pair includes four rows and its format as follows:
[sentence in language 1]
[sentence in langugae 2]
SplitSentPair is another simple tool used to check whether two bilingual sentence are pair.