最近项目中需要自己制造大量数据,并对greenplum的插件--gptext(全文检索)的性能进行测试。如果直接用python进行写入,性能太差,每小时大约插入几十万的数据。后面想到解决方法,先用python生成数据,写入到文本文件中,后使用greenplum的copy函数,进行数据的批量导入,大大节约了的时间,提升了效率:
1、使用Python随机生成数据:
具体方法见:https://blog.csdn.net/weixin_43315211/article/details/87929993
2、写入文本文件:
def write_to_csv():
count=0
with open('C:\\Users\\Administrator\\Desktop\\people_info.csv', 'a') as f:
for i in range (10000):
count+=1
items=mkitems() ##mkitems()是随机生成信息的函数,返回的是一个字典
j=items.values()
f.writelines(",".join(j) + '\n')
if (count%1000==0):
print(count)
3、使用copy函数写入greenplum数据库:
copy函数的使用方法:
COPY table [(column [, ...])] FROM {'file' | STDIN}
[ [WITH]
[OIDS]
[HEADER]
[DELIMITER [ AS ] 'delimiter']
[NULL [ AS ] 'null string']
[ESCAPE [ AS ] 'escape' | 'OFF']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[CSV [QUOTE [ AS ] 'quote']
[FORCE NOT NULL column [, ...]]
[FILL MISSING FIELDS]
[ [LOG ERRORS INTO error_table] [KEEP]
SEGMENT REJECT LIMIT count [ROWS | PERCENT] ]
COPY {table [(column [, ...])] | (query)} TO {'file' | STDOUT}
[ [WITH]
[OIDS]
[HEADER]
[DELIMITER [ AS ] 'delimiter']
[NULL [ AS ] 'null string']
[ESCAPE [ AS ] 'escape' | 'OFF']
[CSV [QUOTE [ AS ] 'quote']
[FORCE QUOTE column [, ...]] ]
copy people_info(id_number,name,birthday,gender,phone,birth_place) from '/home/people_info.csv' with header delimiter ',' csv;