一,问题描述:
在用python的hdfs库操作HDFS时,可以正常的获取到hdfs的文件
from hdfs.client import Client
#读取hdfs文件内容,将每行存入数组返回
def read_hdfs_file(client,filename):
#with client.read('samples.csv', encoding='utf-8', delimiter='\n') as reader:
# for line in reader:
#pass
lines = []
with client.read(filename, encoding='utf-8', delimiter='\n') as reader:
for line in reader:
#pass
#print line.strip()
lines.append(line.strip())
print(lines)
return lines
client = Client("http://192.168.129.14:50070", root='/')
print('连接没问题')
data = read_hdfs_file(client,'/input/python/data1')
报错信息:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='big08', port=50075): Max retries exceeded with url: /webhdfs/v1/input/python/data1?op=OPEN&namenoderpcaddress=ns&offset=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1557e064a8>: Failed to establish a new connection: [Errno 111] Connection refused',))
二,解决办办法
(1)因为没有指定根路径(root path),需要在调用Client方法连接hdfs时指定root path
client = Client("http://10.0.30.9:50070", root='/')
(2)在运行python程序的主机的hosts文件中没有加上访问集群的主机名和ip的映射
解析:博主这里配置了6台虚拟机服务器的集群,在第7台虚拟机服务器上访问集群,按理说上面已经制定了入口ip地址,映射文件的作用就是解析主机名去找ip,但是这里直接指定ip反而不行,具体内部机制未知,从报错信息猜测:虽然指定了active的namenode节点ip,但是这个节点应该是又去找其他节点,会用到这个hosts文件。