环境:
ambari:2.6.1
spark 2.1
python 3.6
oracle 11.2
sqoop 1.4
将sqoop采集到HDFS中的文件导入到HIVE数据库,导入成功后,发现hive数据库中出现多个小文件的问题,严重影响后期数据分析的加载速度。
解决方法:
SJTable = spark.sql("select * from " + tablename + "_tmp where att = '1E'")
datanum = SJTable.count()
#解决小文件
SJTable_tmp = SJTable.repartition(1).persist()
SJTable_tmp.createOrReplaceTempView(tablename + "_cpu_tmp")
spark.sql("insert into table " + tablename + "_cpusj PARTITION(area,timdate) select lcn,pid,tim,tf,fee,bal,epid,etim,card_type,service_code,is_area_code,use_area_code \
,clea_day,CURRENT_TIMESTAMP,use_area_code as area,substr(tim,1,6) as timdate from " + tablename + "_cpu_tmp")
修改后的文件: