Projects tagged ‘corpus’


[28 total ]

2 Users
 

Use the internet as a linguistic corpus: Provide tools and infrastructure for acquisition, visual annotation, merging and storage of web pages as parts of bigger corpora. Develop a ... [More] classification engine that learns to automatically annotate pages, provide visual tools for inspection of results. [Less]
Created about 1 year ago.

1 Users

CORSIS (formerly Tenka Text) is a performance‐oriented, open‐source library for corpus analysis. It utilizes typed assembly, task‐specific compilers and parallelization to deliver the best ... [More] performance with elegant design. Demonstrative GUI of the project comes with Wordlister - an advanced, extremely fast graphical wordlist tool and a regex concordance tool. CORSIS - the open-source answer to WordSmith Tools. [Less]
Created over 3 years ago.

1 Users

Greenstone is a suite of software for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM. Greenstone ... [More] is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO. [Less]
Created over 2 years ago.

1 Users

NLTK — the Natural Language Toolkit — is a suite of open source Python modules, data and documentation for research and development in natural language processing. This site is the home of NLTK ... [More] development. NLTK Home - the starting point for users including download instructions Source - browse NLTK source code (Git Mirror updated hourly) Issue Tracker - submit feature requests, bug reports, and patches DevelopersGuide - useful guidelines for code cutters BuildBot - automated testing of NLTK code (add a new buildbot) People - the NLTK team .gadget-title { margin-bottom: 2px; } function resizeIframeHandler(opt_height) { var elem = document.getElementById(this.f); if (!elem) return; if (!opt_height) { elem.style.height = undefined; } else { opt_height = Math.max(10, opt_height); elem.style.height = opt_height + 'px'; } } gadgets.rpc.register("resize_iframe", resizeIframeHandler); gadgets.rpc.register('set_title', function(title) { var elem = document.getElementById(this.f + '_title'); if (elem) { elem.innerHTML = gadgets.util.escape(title); } }); [Less]
Created 10 months ago.

1 Users

The LexAt "lexical attraction" aka the RelEx Statistical Linguistics package adds statistical algorithms to the RelEx. Corpus statistics, including mutual information, are maintained in an SQL ... [More] database, and drawn on to enhance various RelEx functions, such as parse ranking and chunk ranking, and word-sense disambiguation (Mihalcea algo). [Less]
Created 9 months ago.

0 Users

This is a project of the Nara Institute of Science and Technology Computational Linguistics lab. It is a Ruby on Rails Corpus Search web application.
Created about 1 year ago.

0 Users

Poliqarp is a universal suite of utilities for large corpora processing.
Created about 1 year ago.

0 Users

Estudio del año de 1985
Created 4 months ago.

0 Users

Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants. FeaturesFree and open source, even for commercial use. For languages with more complex ... [More] morphologies, HMM tagging could be quite competitive with the current generation of learning algorithms applying e.g. SVM and CRF methods. A major advantage is that the training/tagging cycle is orders of magnitude faster than in more complex models. Precision of tagging on unknown and unseen words was a major priority for us during the development of hunpos. Works smoothly with large tag sets. For example in Hungarian, as in other highly inflecting languages, it is important to preserve detailed morphological information in the POS tags in order to provide useful clues for higher level processing tasks. This leads to a significantly larger tagset than is common in English (744 tags here as opposed to the 36 standardly used in Treebank work), but does not degrade training and tagging performance. Although it would make the training process of non-generative models computationally expensive. Effortless integration of knowledge from morphological analyzers/dictionaries into best path calculation. Contextualized lexical probabilities with a context window of any size. Unlike traditional HMM models, HunPos estimates emission (lexical) probabilities based on the current tag and previous tags as well. Hunpos has been implemented in OCaml, a high-level language which supports a succinct, well-maintainable coding style. OCaml has a high-performance compiler that produces native code with speed comparable to C/C++ implementations. [Less]
Created about 1 year ago.

0 Users

Spelt is a simple graphical program that can be used to classify words in a language. It is particularly designed to identify word roots and to classify them according to part-of-speech. The initial ... [More] development of this program was specifically meant to simplify work on spell checkers, but you might find it useful for many other purposes. [Less]
Created about 1 year ago.