seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles,从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息,分别是:(1)DocumentTokenizer(2)WordCount(3)MakePartialVectors(4)MergePartialVectors(5)VectorTfIdf Document Frequency Count(6)MakePartialVectors(7)MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息:
Usage:
[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma
<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>
--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>
--overwrite --help --sequentialAccessVector --namedVector --logNormalize]
Options
--minSupport (-s) minSupport (Optional) Minimum Support. Default
Value: 2
--analyzerName (-a) analyzerName The class name of the analyzer
--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB
--output (-o) output The directory pathname for output.
--input (-i) input Path to job input directory.
--minDF (-md) minDF The minimum document frequency. Default
is 1
--maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors
to be used, expressed in times the
standard deviation (sigma) of the
document frequencies of these vectors.
Can be used to remove really high
frequency terms. Expressed as a double
value. Good value to be specified is 3.0.
In case the value is less then 0 no
vectors will be filtered out. Default is
-1.0. Overrides maxDFPercent
--maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF.
Can be used to remove really high
frequency terms. Expressed as an integer
between 0 and 100. Default is 99. If
maxDFSigma is also set, it will override
this value.
--weight (-wt) weight The kind of weight to use. Currently TF
or TFIDF
--norm (-n) norm The norm to use, expressed as either a
float or "INF" if you want to use the
Infinite norm. Must be greater or equal
to 0. The default is not to normalize
--minLLR (-ml) minLLR (Optional)The minimum Log Likelihood
Ratio(Float) Default is 1.0
--numReducers (-nr) numReducers (Optional) Number of reduce tasks.
Default Value: 1
--maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to
create (2 = bigrams, 3 = trigrams, etc)
Default Value:1
--overwrite (-ow) If set, overwrite the output directory
--help (-h) Print out help
--sequentialAccessVector (-seq) (Optional) Whether output vectors should
be SequentialAccessVectors. If set true
else false
--namedVector (-nv) (Optional) Whether output vectors should
be NamedVectors. If set true else false
--logNormalize (-lnorm) (Optional) Whether output vectors should
be logNormalize. If set true else false
在昨天算法的终端信息中该步骤的调用命令如下:
./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf
我们只看对应的参数,首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化(设置则为true),-nv解释为输出向量被设置为named 向量,这里的named是啥意思?(暂时不清楚),-wt tfidf解释为使用权重的算法,具体参考
http://zh.wikipedia.org/wiki/TF-IDF。
第(1)步在SparseVectorsFromSequenceFiles的253行的:
DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);
这里进入可以看到使用的Mapper是:SequenceFileTokenizerMapper,没有使用Reducer。Mapper的代码如下:
protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
StringTuple document = new StringTuple();
stream.reset();
while (stream.incrementToken()) {
if (termAtt.length() > 0) {
document.add(new String(termAtt.buffer(), 0, termAtt.length()));
}
}
context.write(key, document);
}
该Mapper的setup函数主要设置Analyzer的,关于Analyzer的api参考:
http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html,其中在map中用到的函数为
reusableTokenStream(
StringfieldName,
Readerreader):Creates
a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序:
package mahout.fansy.test.bayes;
import java.io.IOException;
import java.io.StringReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.mahout.common.ClassUtils;
import org.apache.mahout.common.StringTuple;
import org.apache.mahout.vectorizer.DefaultAnalyzer;
import org.apache.mahout.vectorizer.DocumentProcessor;
public class TestSequenceFileTokenizerMapper {
/**
* @param args
*/
private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",
Analyzer.class);
public static void main(String[] args) throws IOException {
testMap();
}
public static void testMap() throws IOException{
Text key=new Text("4096");
Text value=new Text("today is also late.what about tomorrow?");
TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
StringTuple document = new StringTuple();
stream.reset();
while (stream.incrementToken()) {
if (termAtt.length() > 0) {
document.add(new String(termAtt.buffer(), 0, termAtt.length()));
}
}
System.out.println("key:"+key.toString()+",document"+document);
}
}
得出的结果如下:
key:4096,document[today, also, late.what, about, tomorrow]
其中,TokenStream有一个stopwords属性,值为:[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of],所以当遇到这些单词的时候就不进行计算了。
额,又太晚了。哎,早困了,刷个牙线。。。
分享,快乐,成长
转载请注明出处:http://blog.csdn.net/fansy1990
分享到:
相关推荐
Twenty Newsgroups 数据集,顾名思义,该数据集涵盖新闻组相关信息,包含从 20 个不同新闻组获取的 20000 篇报道,信息量巨大,欢迎下载使用。
PJBlog2 Twenty风格
Twenty Lectures on Algorithmic Game Theory By Tim Roughgarden 2016 | 250 Pages | ISBN: 131662479X , 1107172667 | EPUB | 2 MB Computer science and economics have engaged in a lively interaction over ...
Effective learning- Twenty rules of formulating k.pdf
tensorflow seq2seq聊天机器人注意:不维护存储库。 如果您想承担维护费用,请随时与我联系。 基于在tensorflow中实现的热方法构建通用对话聊天。 由于到目前为止效果不佳,因此请考虑其他实现。 当前结果非常糟糕:...
张飞硬件视频1到二十三部:张飞视频可以让想学习硬件的同学,节省很多时间,他可以让你爱上嵌入式硬件开发,感觉硬件开发会很简单。让同学们少走很多弯路。特此分享出张飞硬件视频资料。无心学习的同学勿扰,谢谢
SUMS78 Twenty-One Lectures on Complex Analysis -- A First Course, Alexander Isaev (2017).zip
twenty20
20news数据集。... The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
game twenty four.exe
Twenty-First Century Intelligence
Trade Gothic-Bold Cond Twenty
20newsgroup数据集是机器学习中的一个标准数据集。它包含18828个文档,来自于20个不同的新闻组。如果把每个新闻组看作是一个簇,那么很容易测试出我们寻找相关文档的方法是否有效。
html5单页模版Twenty是一款清爽风格的城市交通工具html5单页网站模版。
量化研究新思维(十五):20_for_Twenty,AQR_20周年经典文献摘要2-0312-海通证券-27页.pdf
Twenty Years of Attacks on the Cryptosystem
Handwriting recognition research-- Twenty years of achievement
这本书描述了实时软件开发中最常见的25个错误,及如何避免这些错误。
Software Reading Techniques Twenty Techniques for More Effective Software Review and Inspection 英文pdf
html5模板-html5单页模版Twenty.zip