按一定概率以行为单位分割文件

thecloud

浏览: 885587 次

最近访客更多访客>>

Ani521smile

song0394

空空儿

aaron198

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (1953)

社区版块

存档分类

2013-10 ( 13)
2013-09 ( 38)
2013-08 ( 75)
更多存档...

最近处理决策树输入数据的时候遇到一个问题：如果输入文件为一个文件，如何分为训练样本和测试样本呢？而且一般是训练数据多于测试数据。下面就说下我的实现思路：

假如以7：3的概率分割输入文件，那么可以使用一个随机数。随机产生一个0~9的随机数，判断此数是否小于7（此处数值可以根据训练数据和测试数据的比值进行设置），若小于则应该把此行输入数据归为训练数据，否则归为测试数据；

下面是java实现代码：

package org.fansy.filesplit.random;

import java.io.*;
import java.util.*;

public class SplitFile {

	/**
	 * 随机把一个文件以行为单位按照一定概率分为两个
	 * 主要是为了一个做训练样本，一个做测试样本;
	 * @param args
	 * @throws IOException 
	 */
	public static void main(String[] args) throws IOException {
		String sourceStr="/home/fansy/data/forest/car.txt";
		String des1Str="/home/fansy/data/forest/car_train.txt";
		String des2Str="/home/fansy/data/forest/car_test.txt";
		File source =new File(sourceStr);
		File des1=new File(des1Str);
		File des2=new File(des2Str);
		if(!source.exists()){
			System.out.println("source file does not exist");
			return ;
		}
		exist(des1);
		exist(des2);
		FileWriter des1W=new FileWriter(des1,true);
		FileWriter des2W=new FileWriter(des2,true);
		
		// read source file and split it into two files
		FileReader fileReader =new FileReader(source);
		BufferedReader bf=new BufferedReader(fileReader);
		String line=null;
		Random r=new Random();
		int temp=0;
		while((line=bf.readLine())!=null){
			 temp=Math.abs(r.nextInt())%10;
			 if(temp<7){  //  '7' can be changed in order to set the probability of train data and test data
				 des1W.append(line+"\n");
			 }else{
				 des2W.append(line+"\n");
			 }
		}
		bf.close();
		fileReader.close();
		des1W.close();
		des2W.close();
		System.out.println("split file done ...");
	}
	
	private static void exist(File f){
		if(f.exists()){
			f.delete();
			boolean flag=false;
			try {
				flag = f.createNewFile();
			} catch (IOException e) {
				e.printStackTrace();
			}
			System.out.println("create file:"+f.getName()+" :"+flag);
		}
	}

}

分享，快乐，成长

转载请注明出处：http://blog.csdn.net/fansy1990

分享到：