搜索资源列表
Word2VEC_java-master
- 谷歌开源的计算词语相似度的源码,word2vec-compue the similarity of words
lda_perplexity
- 用训练出的模型测试词以及概率,并统计词数和计算困惑度-With the trained model test and the probability and statistics of words, words and perplexity calculation
R4
- 短文本数据集,各大论文的数据集取材,英文文本,已经stemming,去停词,提炼后的。-R4 short text dataset,english. stemming and non-stop words.
InfoRetri
- 基于朴素贝叶斯的文本分类,包含去停用词,分词,特征提取,分类等-Text classification, based libsvm, included to stop words, segmentation, feature extraction and classification
2
- 每一位作家都有自己的写作风格,词语的使用方面会有很大的不同,根据文章中的虚词的使用情况,计算两篇文章的余弦值。检测两篇文章的相似度。-Every writer has their own style of writing aspect, the use of the words will be very different, according to the article in the function words usage, calculate the cosine of the two
GoogleNormalDistanceCalculator
- 计算两个词语的google相似度,即google normal distance-To calculate the distance between two words with google normal distance algorithm.
LDA-topic-model
- 首先声明,这是别人写的LDA主题模型代码,本人测试过,可以运行,但是输出跟输出有点不尽人意,输入的是词的序号和该词在文档中出现的次数,要是可以直接读取文档就完美了。输出是主题以及词在该主题出现的概率,其中得到的主题我就看不懂了,不知道是算法问题,还是因为我的水平有限。在研究LDA主题模型的朋友,可以下载试一下-First statement, which is written by someone else LDA topic model code, I tested, you can run,
WordSimilarity
- 基于知网的相似度计算算法,可计算不同词语间的相似度-HowNet based similarity calculation algorithm, calculating the similarity between words can be different
Naive-bayes
- 本文以拼写检查作为例子,讲解Naive Bayes分类器是如何实现的。对于用户输入的一个单词(words),拼写检查试图推断出最有可能的那个正确单词(correct)。当然,输入的单词有可能本身就是正确的。比如,输入的单词thew,用户有可能是想输入the,也有可能是想输入thaw。为了解决这个问题,Naive Bayes分类器采用了后验概率P(c|w)来解决这个问题。P(c|w)表示在发生了w的情况下推断出c的概率。为了找出最有可能c,应找出有最大值的P(c|w),即求解问题-In this
SOPMIPwindowbuilder
- 最近提出的利用互信息算法来计算词语权重,对细粒度情感分析是十分重要的一步。-Recently proposed mutual information algorithm to calculate the weight of the words, fine-grained sentiment analysis is a very important step.
WordFrequenceCount
- 基于文本的词频计算,对文本内的单词进行统计,可统计上万单词,一次输出。-Based on the text of the word frequency calculation, the words within the text statistics, statistics can be tens of thousands of words, an output.
kctp
- 此代码实现数据的预处理,包括分词、去符号、去停用词等。(This code realizes the preprocessing of data, including participle, symbol, stop words, etc.)
textclustering-master
- 对于大文本进行挖掘聚类,该方法不考虑文字词语出现的频率信息,考虑上下文语境,将所有的字根据预定义的特征进行词位特征学习,获得一个训练模型。然后对待分字符串的每一个字进行词位标注,最后根据词位定义获得最终的分词结果。(Digging for large text clustering, the method does not consider the text word frequency of information, considering the context, all the words
wordcount3
- hadoop的wordcount程序,去除标点和部分停词(Hadoop's wordcount program, removing punctuation and partial parked words)