spark 导入文件到hive出现多小文件的问题

环境:

ambari:2.6.1

spark 2.1

python 3.6

oracle 11.2

sqoop 1.4

将sqoop采集到HDFS中的文件导入到HIVE数据库,导入成功后,发现hive数据库中出现多个小文件的问题,严重影响后期数据分析的加载速度。

解决方法:

SJTable = spark.sql("select  *          from " + tablename + "_tmp where att = '1E'")
datanum = SJTable.count()
#解决小文件
SJTable_tmp = SJTable.repartition(1).persist()
SJTable_tmp.createOrReplaceTempView(tablename + "_cpu_tmp")

    spark.sql("insert into table " + tablename + "_cpusj PARTITION(area,timdate) select  lcn,pid,tim,tf,fee,bal,epid,etim,card_type,service_code,is_area_code,use_area_code \
                       ,clea_day,CURRENT_TIMESTAMP,use_area_code as area,substr(tim,1,6) as timdate from " + tablename + "_cpu_tmp")

修改后的文件:

猜你喜欢

转载自blog.csdn.net/qq_39160721/article/details/82387328