Twenty Newsgroups Classification任务之二seq2sparse（1）

thecloud

浏览: 880780 次

最近访客更多访客>>

Ani521smile

song0394

空空儿

aaron198

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (1953)

社区版块

存档分类

2013-10 ( 13)
2013-09 ( 38)
2013-08 ( 75)
更多存档...

seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles，从昨天跑的算法中的任务监控界面可以看到这一步包含了7个Job信息，分别是：（1）DocumentTokenizer（2）WordCount（3）MakePartialVectors（4）MergePartialVectors（5）VectorTfIdf Document Frequency Count（6）MakePartialVectors（7）MergePartialVectors。打印SparseVectorsFromSequenceFiles的参数帮助信息可以看到如下的信息：

Usage:                                                                          
 [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize           
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma      
<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>      
--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>        
--overwrite --help --sequentialAccessVector --namedVector --logNormalize]       
Options                                                                         
  --minSupport (-s) minSupport        (Optional) Minimum Support. Default       
                                      Value: 2                                  
  --analyzerName (-a) analyzerName    The class name of the analyzer            
  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB  
  --output (-o) output                The directory pathname for output.        
  --input (-i) input                  Path to job input directory.              
  --minDF (-md) minDF                 The minimum document frequency.  Default  
                                      is 1                                      
  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf) vectors   
                                      to be used, expressed in times the        
                                      standard deviation (sigma) of the         
                                      document frequencies of these vectors.    
                                      Can be used to remove really high         
                                      frequency terms. Expressed as a double    
                                      value. Good value to be specified is 3.0. 
                                      In case the value is less then 0 no       
                                      vectors will be filtered out. Default is  
                                      -1.0.  Overrides maxDFPercent             
  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.    
                                      Can be used to remove really high         
                                      frequency terms. Expressed as an integer  
                                      between 0 and 100. Default is 99.  If     
                                      maxDFSigma is also set, it will override  
                                      this value.                               
  --weight (-wt) weight               The kind of weight to use. Currently TF   
                                      or TFIDF                                  
  --norm (-n) norm                    The norm to use, expressed as either a    
                                      float or "INF" if you want to use the     
                                      Infinite norm.  Must be greater or equal  
                                      to 0.  The default is not to normalize    
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood      
                                      Ratio(Float)  Default is 1.0              
  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.        
                                      Default Value: 1                          
  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  
                                      create (2 = bigrams, 3 = trigrams, etc)   
                                      Default Value:1                           
  --overwrite (-ow)                   If set, overwrite the output directory    
  --help (-h)                         Print out help                            
  --sequentialAccessVector (-seq)     (Optional) Whether output vectors should  
                                      be SequentialAccessVectors. If set true   
                                      else false                                
  --namedVector (-nv)                 (Optional) Whether output vectors should  
                                      be NamedVectors. If set true else false   
  --logNormalize (-lnorm)             (Optional) Whether output vectors should  
                                      be logNormalize. If set true else false

在昨天算法的终端信息中该步骤的调用命令如下：

./bin/mahout seq2sparse -i /home/mahout/mahout-work-mahout/20news-seq -o /home/mahout/mahout-work-mahout/20news-vectors -lnorm -nv -wt tfidf

我们只看对应的参数，首先是-lnorm 对应的解释为输出向量是否要使用log函数进行归一化（设置则为true），-nv解释为输出向量被设置为named 向量，这里的named是啥意思？（暂时不清楚），-wt tfidf解释为使用权重的算法，具体参考http://zh.wikipedia.org/wiki/TF-IDF。

第（1）步在SparseVectorsFromSequenceFiles的253行的：

DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

这里进入可以看到使用的Mapper是：SequenceFileTokenizerMapper，没有使用Reducer。Mapper的代码如下：

protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
    TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
    StringTuple document = new StringTuple();
    stream.reset();
    while (stream.incrementToken()) {
      if (termAtt.length() > 0) {
        document.add(new String(termAtt.buffer(), 0, termAtt.length()));
      }
    }
    context.write(key, document);
  }

该Mapper的setup函数主要设置Analyzer的，关于Analyzer的api参考：http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html，其中在map中用到的函数为reusableTokenStream(StringfieldName,Readerreader)：Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
编写下面的测试程序：

package mahout.fansy.test.bayes;

import java.io.IOException;
import java.io.StringReader;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.mahout.common.ClassUtils;
import org.apache.mahout.common.StringTuple;
import org.apache.mahout.vectorizer.DefaultAnalyzer;
import org.apache.mahout.vectorizer.DocumentProcessor;

public class TestSequenceFileTokenizerMapper {

	/**
	 * @param args
	 */
	private static Analyzer analyzer = ClassUtils.instantiateAs("org.apache.mahout.vectorizer.DefaultAnalyzer",
Analyzer.class);
	public static void main(String[] args) throws IOException {
		testMap();
	}
	
	public static void testMap() throws IOException{
		Text key=new Text("4096");
		Text value=new Text("today is also late.what about tomorrow?");
		TokenStream stream = analyzer.reusableTokenStream(key.toString(), new StringReader(value.toString()));
	    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
	    StringTuple document = new StringTuple();
	    stream.reset();
	    while (stream.incrementToken()) {
	      if (termAtt.length() > 0) {
	        document.add(new String(termAtt.buffer(), 0, termAtt.length()));
	      }
	    }
	    System.out.println("key:"+key.toString()+",document"+document);
	}

}

得出的结果如下：

key:4096,document[today, also, late.what, about, tomorrow]

其中，TokenStream有一个stopwords属性，值为：[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]，所以当遇到这些单词的时候就不进行计算了。

额，又太晚了。哎，早困了，刷个牙线。。。

分享，快乐，成长

转载请注明出处：http://blog.csdn.net/fansy1990

分享到：