There's a need recently , Computing user portraits .

The system probably has 800W Of users , Calculate some data of each user .

Large amount of data , Calculation hive There's no pressure , But it's written oracle, Before giving the data to the front end , It's hard .

And then there's another solution :

1.hive Calculation , Written HDFS

2.API Read it out , writes hbase(hdfs and hbase The version of does not match , No way sqoop Direct to )

Then the question came .

Need to write a API, read HDFS File on .

Main categories :ReadHDFS 

public class ReadHDFS {
public static void main(String[]args){
long startLong = System.currentTimeMillis();
HDFSReadLog.writeLog("start read file");
String path;
if (args.length > 1) {
// path = args[0];
HDFSReadLog.writeLog(Constant.PATH); try {
getFile(Constant.URI + Constant.PATH);
} catch (IOException e) {
} long endLong = System.currentTimeMillis();
HDFSReadLog.writeLog("cost " + (endLong -startLong)/1000 + " seconds");
HDFSReadLog.writeLog("cost " + (endLong -startLong)/1000/60 + " minute");
} private static void getFile(String filePath) throws IOException { FileSystem fs = FileSystem.get(URI.create(filePath), HDFSConf.getConf());
Path path = new Path(filePath);
if (fs.exists(path) && fs.isDirectory(path)) { FileStatus[] stats = fs.listStatus(path);
FSDataInputStream is;
FileStatus stat;
byte[] buffer;
int index;
StringBuilder lastStr = new StringBuilder();
for(FileStatus file : stats){
HDFSReadLog.writeLog("start read : " + file.getPath());
is =;
stat = fs.getFileStatus(path);
int sum = is.available();
if(sum == 0){
HDFSReadLog.writeLog("have no data : " + file.getPath() );
HDFSReadLog.writeLog("there have : " + sum + " bytes" );
buffer = new byte[sum];
// Be careful. , If the file is too large , There may not be enough memory . When measured by this machine , Read a 100 many M The file of , This leads to insufficient memory .
String result = Bytes.toString(buffer);
// writes hbase
WriteHBase.writeHbase(result); is.close();
HDFSReadLog.writeLog("read : " + file.getPath() + " end");
}catch (IOException e){
HDFSReadLog.writeLog("read " + file.getPath() +" error");
HDFSReadLog.writeLog("Read End");
fs.close(); }else {
HDFSReadLog.writeLog(path + " is not exists");
} }

Configuration class :HDFSConfie( It's no use rushing ,url and path It's ready , You can read without configuration )

public class HDFSConf {
public static Configuration conf = null;
public static Configuration getConf(){
if (conf == null){
conf = new Configuration();
String path = Constant.getSysEnv("HADOOP_HOME")+"/etc/hadoop/";
HDFSReadLog.writeLog("Get hadoop home : " + Constant.getSysEnv("HADOOP_HOME"));
// hdfs conf
return conf;
} }

Some constants :

url : hdfs:ip:prot

path : HDFS The path of

notes : Considering the reading table , There may be more than one file , It's a loop .

Look at the next chapter , Go to hbase Writing data

