Browsing projects by Tag(s)

Select a tag to browse associated projects and drill deeper into the tag cloud.

Showing page 1 of 2

Apache OpenNLP is a Java machine learning toolkit for natural language processing (NLP).

5.0
 
  0 reviews  |  10 users  |  467,231 lines of code  |  5 current contributors  |  Analyzed 3 days ago
 
 

DKPro is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. Many powerful and state-of-the-art NLP components are already freely available in the NLP research community. New and improved components are being developed and released ... [More] continuously. The components cover the whole range of NLP-related processing tasks. DKPro provides wrappers for such third-party tool as well as original NLP components. DKPro builds heavily on uimaFIT which allows for rapid and easy development of NLP processing pipelines. [Less]

4.75
   
  0 reviews  |  7 users  |  162,207 lines of code  |  13 current contributors  |  Analyzed 10 days ago
 
 

CORSIS (formerly Tenka Text) is a performance‐oriented, open‐source library for corpus analysis. It utilizes typed assembly, task‐specific compilers and parallelization to deliver the best performance with elegant design. Demonstrative GUI of the project comes with Wordlister - an advanced ... [More] , extremely fast graphical wordlist tool and a regex concordance tool. CORSIS - the open-source answer to WordSmith Tools. [Less]

0
 
  0 reviews  |  1 user  |  50,431 lines of code  |  0 current contributors  |  Analyzed about 22 hours ago
 
 

Splender is a JavaScript-based, token-driven syntax highlighting engine with theme support. It allows for very efficient syntax highlighting of plain text content embedded in HTML documents. By utilizing a proper lexer/tokenizer, Splender offers optimal performance. Other similar solutions use ... [More] cryptic and inefficient regular expressions to achieve similar results, resulting in poor performance. The design of Splender allows for new lexers to be written that provide support for different formats. Splender's base lexer engine provides support for all languages with a similar syntax to C. Such languages include C, C#, Java, and JavaScript. [Less]

0
 
  0 reviews  |  0 users  |  463 lines of code  |  0 current contributors  |  Analyzed 5 days ago
 
 

PyGrams converts text to n-grams. Conversion is a three step process. 1) Extract all possible n-grams. Run "form_candidates.py" to create a file containing all possible n-grams. 2) Filter possible n-grams. Run "filter_candidates.py" to find just the n-grams which appear ... [More] sufficiently frequently in relation to the frequency of their components. 3) Convert documents to n-grams. Run "convert_docs.py" to convert documents into approved n-grams. Sample Input: We introduce a family of rings of symmetric functions depending on an infinite sequence of parameters. Sample Output: introduc famili ring symmetr_function depend infinit_sequenc paramet Additional documentation appears in the README file. Note that this software depends on the porter_stemmer.py module, which is available from http://tartarus.org/~martin/PorterStemmer/python.txt. PyGrams has been tested with Python 2.5 on Linux. PyGrams has been developed with support from The Bibliographic Knowledge Network. [Less]

0
 
  0 reviews  |  0 users  |  148 lines of code  |  0 current contributors  |  Analyzed 11 days ago
 
 

SummaryThe tokstream library allows you to read text files and split them up into individual tokens. It is, in a sense, a glorified version of strtok with file reading and a few tricks to make the process as efficient as possible. Featuresclean and minimal interface simple to use wraps file I/O ... [More] high performance no reading overhead, input buffering File readingGetting tokens from files is easy enough: tokstream* ts = ts_open('myfile'); while(!ts_eof(ts)) { const char* tok = ts_get(ts); ... } ts_close(ts);Additionally, the library contains a state stack onto which you can push the current input state. It is then possible to further process the file and jump back to this state, with minimal reading overhead. This can be very handy in certain situations, ie. if one needs to do counting of input elements. StructureThe library itself consists of nothing more than a pair of header and implementation files, allowing for direct inclusion into projects. (Of course, it can be built as an external library as well.) CMake files for building tokstream and examples are included. [Less]

0
 
  0 reviews  |  0 users  |  2,869 lines of code  |  0 current contributors  |  Analyzed 4 months ago
 
 

html5cppThis library aims to implement the tokenization and tree construction algorithms described in the WHATWG HTML5 working draft. It will not handle XHTML parsing.

0
 
  0 reviews  |  0 users  |  0 current contributors  |  Analyzed 6 days ago
 
 

jTokeniser is a set of classes that provide a variety of tokenisers for your Java projects. Simple tokenisers such as WhiteSpaceTokeniser or StringTokeniser provide basic token extraction whereas RegexTokeniser and BreakIteratorTokeniser give more advantage possibilities for more thorough tokenisers ... [More] that discard punctuation too. Recent additions include RegexSeparatorTokeniser that allows complex definition of token delimiters. Also a SentenceTokeniser has been provided for segmenting text into a set of sentences. There is also a GUI frontend to experiment without having to code. [Less]

0
 
  0 reviews  |  0 users  |  2,443 lines of code  |  0 current contributors  |  Analyzed 4 days ago
 
 

creates a compressed trie that maps keys to values and values to keys. Compression is on the front end of keys. Useful for lightweight reserved word creation in constrained memory/processor power situations. Written in C.

0
 
  0 reviews  |  0 users  |  1,181 lines of code  |  0 current contributors  |  Analyzed 1 day ago
 
 

WDependency is a PHP tool that analyzes the content of a directory to analyzes dependencies between files and classes and generate dependencies schema in various export format (dot, png, graphml, json, php...)

0
 
  0 reviews  |  0 users  |  3,954 lines of code  |  0 current contributors  |  Analyzed 3 days ago
 
 
 
 

Creative Commons License Copyright © 2013 Black Duck Software, Inc. and its contributors, Some Rights Reserved. Unless otherwise marked, this work is licensed under a Creative Commons Attribution 3.0 Unported License . Ohloh ® and the Ohloh logo are trademarks of Black Duck Software, Inc. in the United States and/or other jurisdictions. All other trademarks are the property of their respective holders.