Projects tagged ‘chinese’ and ‘cjk’


[12 total ]

1 Users

Eclectus is a small Han character dictionary especially designed for learners of Chinese character based languages like Mandarin Chinese or Japanese.
Created 4 months ago.

1 Users

Cjklib provides language routines related to Han characters (characters based on Chinese characters named Hanzi, Kanji, Hanja and chu Han respectively) used in writing of the Chinese, the Japanese ... [More] , infrequently the Korean and formerly the Vietnamese language(s). Functionality is included for character pronunciations, radicals, glyph components, stroke decomposition and variant information. Cjklib is implemented in Python. [Less]
Created 12 months ago.

0 Users
 

pymmseg-cpp is a Python port of the rmmseg-cpp project. rmmseg-cpp is a MMSEG Chinese word segmenting algorithm implemented in C++ with a Ruby interface.
Created about 1 year ago.

0 Users

It is a little tool to typeset CJK Characters vertically. 最近网上流行将网络审查中不能正常发送的信息转成竖版发送。 ... [More] 我在Chinaunix上找到了一个可以实现同一功能的脚本,并将其扩展成兼容unicode的版本。 我觉得一些网站上提供的这个功能实在是太简单了。 ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐ │能┆些┆d┆扩┆一┆找┆C┆息┆查┆最│ │实┆网┆e┆展┆功┆到┆h┆转┆中┆近│ │在┆站┆的┆成┆能┆了┆i┆成┆不┆网│ │是┆上┆版┆兼┆的┆一┆n┆竖┆能┆上│ │太┆提┆本┆容┆脚┆个┆a┆版┆正┆流│ │简┆供┆。┆u┆本┆可┆u┆发┆常┆行│ │单┆的┆我┆n┆,┆以┆n┆送┆发┆将│ │了┆这┆觉┆i┆并┆实┆i┆。┆送┆网│ │。┆个┆得┆c┆将┆现┆x┆我┆的┆络│ │ ┆功┆一┆o┆其┆同┆上┆在┆信┆审│ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘ [Less]
Created 12 months ago.

0 Users

Introduction in EnglishHTTPCWS is a open-source Chinese Word Segmentation System Based on the HTTP protocol, using ICTCLAS Chinese word segmentation algorithms. ICTCLAS is a Chinese lexical analysis ... [More] system, which is able to make Chinese word segmentation, Part-Of-Speech tagging, word sense disambiguation and named entity recognition. The detailed linguistic information provided by ICTCLAS increases the accuracy and depth of any application related to Chinese language, such as machine translation, retrieval, filtering, text mining and many others. 发布版本:httpcws 1.0.0 (最新版本:2009-08-10发布) 安装使用手册:http://blog.s135.com/httpcws_v100/ 下载地址(32位版):http://httpcws.googlecode.com/files/httpcws-1.0.0-i386-bin.tar.gz 下载地址(64位版):http://httpcws.googlecode.com/files/httpcws-1.0.0-x86_64-bin.tar.gz 中文分词在线演示:http://blog.s135.com/demo/httpcws/ PHP演示程序下载:http://blog.s135.com/demo/httpcws/httpcws-php-demo.zip httpcws 中文简介1、什么是 httpcws ?HTTPCWS 是一款基于HTTP协议的开源中文分词系统,目前仅支持Linux系统。HTTPCWS 使用“ICTCLAS 3.0 2009共享版中文分词算法”的API进行分词处理,得出分词结果。 ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System)是中国科学院计算技术研究所在多年研究工作积累的基础上,基于多层隐马模型研制出的汉语词法分析系统,主要功能包括中文分词;词性标注;命名实体识别;新词识别;同时支持用户词典。ICTCLAS经过五年精心打造,内核升级6次,目前已经升级到了ICTCLAS3.0,分词精度98.45%,各种词典数据压缩后不到3M。ICTCLAS在国内973专家组组织的评测中活动获得了第一名,在第一届国际中文处理研究机构SigHan组织的评测中都获得了多项第一名,是当前世界上最好的汉语词法分析器。 ICTCLAS 3.0 商业版是收费的,而免费提供的 ICTCLAS 3.0 共享版不开源,词库是根据人民日报一个月的语料得出的,很多词语不存在。所以本人补充的一个19万条词语的自定义词库,对ICTCLAS分词结果进行合并处理,输出最终分词结果。 由于 ICTCLAS 3.0 2009 共享版只支持GBK编码,因此,如果是UTF-8编码的字符串,可以先用iconv函数转换成GBK编码,再用httpcws进行分词处理,最后转换回UTF-8编码。 HTTPCWS 软件自身(包括httpcws.cpp源文件、dict/httpcws_dict.txt自定义词库)采用NewBSD开源协议,可以自由修改。HTTPCWS 使用的 ICTCLAS 共享版 API 及 dict/Data/ 目录内的语料库,版权及著作权归中国科学院计算技术研究所、ictclas.org所有,使用需遵循其相关协议。 2、httpcws 中文分词在线演示演示网址:http://blog.s135.com/demo/httpcws/ 3、httpcws 中文分词下载安装32位版: cd /usr/local/ wget http://httpcws.googlecode.com/files/httpcws-1.0.0-i386-bin.tar.gz tar zxvf httpcws-1.0.0-i386-bin.tar.gz rm -f httpcws-1.0.0-i386-bin.tar.gz cd httpcws-1.0.0-i386-bin/ ulimit -SHn 65535 /usr/local/httpcws-1.0.0-i386-bin/httpcws -d -x /usr/local/httpcws-1.0.0-i386-bin/dict/64位版: cd /usr/local/ wget http://httpcws.googlecode.com/files/httpcws-1.0.0-x86_64-bin.tar.gz tar zxvf httpcws-1.0.0-x86_64-bin.tar.gz rm -f httpcws-1.0.0-x86_64-bin.tar.gz cd httpcws-1.0.0-x86_64-bin/ ulimit -SHn 65535 /usr/local/httpcws-1.0.0-x86_64-bin/httpcws -d -x /usr/local/httpcws-1.0.0-x86_64-bin/dict/命令行启动参数: 4、httpcws 使用方法GET方法(文本长度受URL的长度限制,需要分词的文本为GBK编码,最好采用urlencode对文本进行编码): http://192.168.8.42:1985/?w=有人的地方就有江湖 http://192.168.8.42:1985/?w=%D3%D0%C8%CB%B5%C4%B5%D8%B7%BD%BE%CD%D3%D0%BD%AD%BA%FEPOST方法(文本长度无限制,适用于大文本分词,需要分词的文本为GBK编码,最好采用urlencode对文本进行编码): curl -d "有人的地方就有江湖" http://192.168.8.42:1985 curl -d "%D3%D0%C8%CB%B5%C4%B5%D8%B7%BD%BE%CD%D3%D0%BD%AD%BA%FE" http://192.168.8.42:1985PHP 调用 HTTPCWS 示例见:http://blog.s135.com/httpcws_v100/ 5、httpcws 分词速度及用途局域网内 HTTPCWS 接口中文分词平均处理速度(Wait时间):0.001秒,每秒可处理5000~20000次请求。 HTTPCWS 属于《亿级数据的高并发通用搜索引擎架构设计》的一部分,用作“搜索查询接口”的关键字分词处理。在此架构中,Sphinx索引引擎对于CJK(中日韩)语言支持一元切分,假设【反恐行动是国产主视角射击网络游戏】这段文字,Sphinx会将其切成【反 恐 行 动 是 国 产 主 视 角 射 击 网 络 游 戏】,然后对每个字建立反向索引。如果用这句话中包含的字组成一个不存在的词语,例如【恐动】,也会被搜索到,所以搜索时,需要加引号,例如搜索【"反恐行动"】,就能完全匹配连在一起的四个字,不连续的【"恐动"】就不会被搜索到。但是,这样还有一个问题,搜索【"反恐行动游戏"】或【"国产网络游戏"】就会搜索不到。所以,我在搜索层写了个PHP中文分词扩展,搜索“反恐行动游戏”、“国产网络游戏”,会被httpcws中文分词函数分别切分为“反恐行动 游戏”、“国产 网络游戏”,这时候,用PHP函数给以空格分隔的词语加上引号,去搜索【"反恐行动" "游戏"】或【"国产" "网络游戏"】,就能搜索到这条记录了。由于httpcws位于搜索层,中文分词词库发生增、删、改,只需重启httpcws进程即可,无需重建搜索索引。 根据上述情况,对于那些采用二元交叉切分的搜索引擎,httpcws用在前端搜索层对用户输入的搜索关键字、短语进行分词处理,同样适合。httpcws开发的目的正在于此,对于短句、小文本中文分词切分,速度非常之快。 6、自定义词库修改dict/httpcws_dict.txt文件,可以自由增加自己需要的词语。重启httpcws即可生效。 [Less]
Created 4 months ago.

0 Users

zhspacing fine-tunes several details in typesetting Chinese using XeTeX and XeLaTeX, such as automatic font switch between Chinese and Western characters, skip adjustment of fullwidth punctuations ... [More] , punctuation prohibitions, automatic skip insertion between Chinese and Western characters or math formulas, etc. Notice 08.4.5. For your information, currently zhspacing is not the only package for Chinese typesetting under XeLaTeX, and it doesn't update often. Mr. Sun Wenchang's xeCJK is probably a better choice for most users. The basic usage is like this: \documentclass{article} \usepackage{zhspacing} \zhspacing \begin{document} 中Eng文混排,“标点压缩”,间 距 调 整 …… \end{document}Typesetting Chinese document using XeLaTeX + zhspacing is much like using LaTeX + CJK + CJKpunct, except that you don't need to add the annoying ~ between Chinese and English for spacing adjustment, and opening punctuations (line-end prohibition) can occur at the beginning of the line, which is forbid in CJKpunct. The amount of adjusted space can be easily customized to your own taste. A recent feature of zhspacing is that it supports typesetting Chinese in math formulas, using zhmath.sty. And using zhfont.sty can simplify font definition. Underdot is now supported using zhfont. [Less]
Created 12 months ago.

0 Users

DESCRIPTIONThis module is a word tokenizer for CJK texts. It supports n-gram tokenization. It is handy for users if they are building inverted indexes using Xapian or any other search engine tool. The ... [More] module is originally written to be used with Xapian. Please also read this post on xapian-discuss mailing list. If you are a Perl user, you can also use the perl binding. Currently, there is totally no documentation. Please check out the repository and hack it. FEATURESN-gram tokenization on CJK texts. Conversion from Traditional Chinese to Simplified Chinese, and vice versa. USERShttp://code.google.com/p/cjk-tokenizer/wiki/Users TODO full-width half-width conversion [Less]
Created about 1 year ago.

0 Users

A python language binding for SCIM
Created about 1 year ago.

0 Users

和協小學辭典· What's ... [More] this和協小學辭典(ideodict)是一個CJKV表意文字的動態檢索系統。顧名思義,辭典二字即意味它是按語詞系統編排竝用來查索語詞解釋的工具書,但和協小學辭典不只是一部辭典,它是一個[公共領域]的辭書集,而且這箇辭書集也不是它的全部,它還提供一個服務於辭書用戶的介面和志願者合作進行辭書編纂校讎工作的平臺: 和協是和衷共計、同心協力的意思 小學是指傳統語文學,即文字、訓詁和音韻學 辭典是ideodict的核心數據集,是供用戶檢索訓釋的查詢對象,同時,這個數據集也是參與編輯、整理、補充、注釋、校訂工作的志願者的操作對象。 收錄辭書的範圍·The Scope of dictionary collectionIdeodict 收錄各種以字、詞爲辭目(headword),以詞條(entries)爲單位的CJKV字書、韻書、雅書等辭書。發軔階段準備彙入[unicode.org/charts/unihan.html Unihan]、IDS(表意文字描述序列)、康熙字典、說文解字、說文解字注、廣韻、爾雅等辭書。 框架·Framework本地辭書集Ideodict將各種以CJKV字、詞爲辭目的辭書保存在sqlite數據庫文件中。 檢索系統Ideodict 用 python語言實現包裹辭書集數據庫的應用程序後端,以服務器的形勢響應前端的各種用戶操作(查詢、編輯等),並與遠端(互聯網)上的辭書集交互,隨時更新本地的辭書集,即用戶的編輯成果(修改、校訂、補充等)提交到遠端,更新互聯網上的辭書集,實現多人合作編輯。 遠端與本地辭書集對應的活動編輯版本將以spreadsheet(電子表格)的形式保存在google网站上,我們把它看是一個匯總數據庫。參與辭書集整理的志願者可通過本地的檢索系統與遠端數據庫對話,將新的編輯成果提交到匯總數據庫。一般用戶也可以從匯總庫中下載最新編輯的辭條,更新本地辭書集。 用戶介面Ideodict以網頁瀏覽器(Internet Explore, Firefox, etc)爲用戶介面。Javascript和ajax技術將用來實現這個介面,我們用python在用戶的本地硬盤建立一個獨立運行的web server,javascript 與web server 溝通,實現查詢等操作。 [Less]
Created 11 months ago.

0 Users

In recent years, Plone(Contents Management System) is attracting attention in Japan. It has rich function and very security, many businesses and public institution use to creating website. However, it ... [More] has search system is don’t understand the East Asian languages (Chinese, Japanese, Korean), so we use it unfavorable for the state. Therefore, I develop search system for East Asian languages of Plone3.x.x. If development is successful, it is more effective use can be in East Asian. The present time, search system of Plone compare the query and with the word of the pages. In the case of English and European, words is in the between space. But, CJK languages have no space in the between each word. So, this is the cause of the failure. I think that Search of CJK languages for a way to achieve is N-gram. It method is at regular intervals to share the string, and make up the index by it, the occurrence frequency of search terms from search method. This method is Korean and Chinese which can be applied. This project is editing source of Plone using by Python language. After implementation, verify search system operation. 1 Operating environment of Plone. 2 Chinese, Japanese, Korean languages can be properly search. 3 English and other languages to verify the impact. This is maintenance of recall and precision. Thank you very much for reading. [Less]
Created 12 months ago.