Hive表有几个T数据包含了20万个Partition,Hive表删除过程

版权声明:本文为博主原创文章,出处为 http://blog.csdn.net/silentwolfyh https://blog.csdn.net/silentwolfyh/article/details/81224151

目录

1、需求

2、问题

3、过程

————————————————————————————-

1、需求

Hive表有几个T数据包含了20万个Partition,需要将hive表删除

2、问题

drop table if exists table_name;

出现的错误信息如下:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out

3、过程

3.1、有人说是hive表锁了,锁的信息如下:(没成功)

http://www.ericlin.me/2015/05/how-table-locking-works-in-hive/

3.2、有人说配置问题,配置信息如下:(没成功)

https://stackoverflow.com/questions/34198339/attempt-to-do-update-or-delete-using-transaction-manager-that-does-not-support-t

3.3、删除单个分区命令:

delete from table_name where dt=’2018-07-23’;(报错没成功)

3.4、删除单个分区命令:

ALTER TABLE table_name DROP PARTITION(dt=’2018-07-23’)(成功)
https://stackoverflow.com/questions/46307667/how-do-i-drop-all-partitions-at-once-in-hive

3.5、我通过Python按照天进行删除,代码如下:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import os
import time
import logging

'''
Hive几百万个Partition或者几个T的数据删除过程。
1、获取所有的partion,再一个一个删除
2、最后drop table table_name;

我的Partition格式
dt=2015-02-23/pkey=20160430
dt=2015-02-23/pkey=47121231E
dt=2015-02-24/pkey=20150620
dt=2015-02-24/pkey=20160430
dt=2015-02-24/pkey=47121231E
dt=2015-02-25/pkey=20150620
dt=2015-02-25/pkey=20151231
dt=2015-02-25/pkey=20160430
'''

if __name__ == '__main__':
    logging.basicConfig(filename='dropHiveTable.log', filemode="w", level=logging.DEBUG)

    start = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

    #partition日期的集合
    dateSet = set()
    lines = os.popen('hive -e " show partitions database.table_name" ')
    for partiton in lines:
        dateSet.add(partiton.split("=")[1].split("/")[0])

    # partition日期的集合的排序
    dateList = list(dateSet)
    dateList.sort()
    logging.info("所有的日期如下:")
    logging.info(dateList)

    # 请求hive中每个Partition的数据
    for hiveDate in dateList:
        logStart = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
        os.popen('hive -e " ALTER TABLE database.table_name DROP PARTITION(dt=\'%s\') ;"' % (hiveDate))
        logEnd = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
        logging.info("Partition [" + hiveDate + "] 删除完毕" + "开始时间【" + logStart + "】,结束时间【" + logEnd + "】")

    os.popen('hive -e " drop table if exists database.table_name;"')
    # 结束且打印时间
    end = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
    logging.info("程序开始时间【"+start+"】,结束时间【"+end+"】")

猜你喜欢

转载自blog.csdn.net/silentwolfyh/article/details/81224151