This project implemented lots of popular Data-Mining/Machine-Learning algorithms.All candidate algorithms must be proper to implemented on Distribution and|or Parallel computing platform, such as Hadoop.
The ultimate goal of this project is to resolve the store and compute for very large dataset
... [More]
, especial for high-dimension. I known it is very difficult for this topic, if you would like to join into this challenge, please mail to me: moonblue333@hotmail.com.
Thanks Wei.Dong at cs.princeton.edu for LSH.
Additional, There is a 'proof of concept' software about distribution database, the attachment is ting-0.5.0.zip. More information about it please refer to: http://www.sadbit.com or sadbit333.appspot.com (Do not ask for source-code password for this package: ting-0.5.0.zip(binary is OK); but password for any other package is OK.)
The research focus in 2009:
1) how to prepare data input such as special normolization to fit the LSH to get better 3-rate. 2) how to construct a better kernel-LSH to fit the final similarity-metric, such EMD, grid-feature. 3) search and research the better similarity-metric algirthms. (so far, the EMD and grid-freature are better, at least better than original L1, L2.)
I will update this summary to introduce all implemented main algorithms:
Hash-Family:
LocalitySensitiveHash ConsistentHash PerfectHash MinimalPerfectHash BloomFilter(Hash) CuckooHash DynamicHash ExtendableHash LinearHash Image-Processing:
Color-Space Transformation Edge-Histogram EMD ImageGridFeatureExtraction Others Dimension-Reduction/Feature-Extraction:
LLE Wavelet PCA ICA AI-Related:
ANN SVM Distribution-Computing:
Paxos (TODO)Failer-Detection-Algorithms [Less]