ipfs 切片技术

众所周知，ipfs里面使用的技术，是借鉴了现有的技术方案，当然不是说ipfs团队技术不行，而是如果站在巨人的肩膀上，可以一开始就望的更远。关于文件切片，git做的是非常优秀的，git出生名门，由linux torvalds一己之力实现，后由众多武林高手打磨。

git仓库目录

[root@localhost ipfs-cxx]# ll .git/
total 60
drwxr-xr-x.  2 root root  4096 Sep 19 17:06 branches
-rw-r--r--.  1 root root    54 Sep 19 17:45 COMMIT_EDITMSG
-rw-r--r--.  1 root root   260 Sep 19 17:06 config
-rw-r--r--.  1 root root    73 Sep 19 17:06 description
-rw-r--r--.  1 root root    23 Sep 19 17:06 HEAD
drwxr-xr-x.  2 root root  4096 Sep 19 17:06 hooks
-rw-r--r--.  1 root root 12720 Sep 19 17:43 index
drwxr-xr-x.  2 root root  4096 Sep 19 17:06 info
drwxr-xr-x.  3 root root  4096 Sep 19 17:06 logs
drwxr-xr-x. 25 root root  4096 Sep 19 17:45 objects
-rw-r--r--.  1 root root   107 Sep 19 17:06 packed-refs
drwxr-xr-x.  5 root root  4096 Sep 19 17:06 refs
[root@localhost ipfs-cxx]#

branches是分支目录，记录版本分支的拓扑
hooks是example shell脚本，高手可以看，普通用户一辈子也用不到一次
index文件是版本目录文件，以行为单位，记录文件名，文件sha1()哈希值，提交状态（add or commit），日期
objects/xy 存放切片文件。xy编码范围：00~ff。在ipfs中是x的范围是[2`8,A-Z], y的范围是[A-Z]
refs记录current分支，remote目标分支信息

ipfs仓库目录：

以windows为例：C:\Users\Administrator.ipfs\datastore 存储的是links文件，也就是元数据，使用leveldb数据库
C:\Users\Administrator.ipfs\blocks\xx 存储内容数据，使用普通的文件系统切片

无视golang这种垃圾语言写的版本，优秀的cxx版本的ipfs元数据设计格式为：

"source file name":
"{
	"type":"tree",
	"file_sha1":"5f01473a8c4d050bd2df5dabaca3d5e31e6b52f6",
	"file_name":"document.zip",
	"size":8236956
}"

"file_sha1":
"{
    "links":[
        {
            "hash":"166ec216e3b0848c17cbb323e9db7e4f96d71a44",
            "size":262144
        },
        {
            "hash":"5d0ac4a0daf189f3a1ef60483221a21b5d18cea4",
            "size":262144
        },
        {
            "hash":"92d3ad03c76c5aad4df7325ad9902164f57c249d",
            "size":262144
        },
        {
            "hash":"a721fb84fbff46920debe9b0d05ba0203899e274",
            "size":262144
        },
        {
            "hash":"fd8a834168dc8f7308a9c917ec4a4dd742a24d64",
            "size":110492
        }
    ],
    "slices_num":32,
    "type":"list",
    "file_name":"document.zip",
}
"

cxx版本文件仓库类

namespace ipfs {
	using namespace file_utils;
	using namespace crc8;
	using namespace db;

	namespace file {
		
		class repository final
		{
			/*
			 *注意：
			 *初期适用单库分表，框架全部文件形成后，适用再哈希，分库分表。对于元数据，10T以内文件的元数据，应该是可以撑得住的。
			 *1.0版本暂时不考虑分库分表。
			 *写入list.ldb数据库（leveldb）作为一个原始文件切片后的索引文件，key为list_name，value为root_list
			 *
			 *1.1版本设计思想：
			 文件object links元数据作为一条数据写入数据库，在文件大于100M时，需考虑将object links元数据写成多条数据（
			 例如一条list文件，links最大为256个，可支持索引64M的数据文件。
			 参照minix文件系统的设计方法，同时兼顾文件平均大小为100M的小文件，计划使用两级list后，
			 	第一级"type"="list_top1"
			 	第二级"type"="list_top2"，最大支持256 * 256 * 256kb = 16GB的文件切片存储和索引）
			 */
				bool add_file(const char * filepath)
				{
								if (false == is_file_exist(filepath)) {
									cout << "please source file exist!" << endl;
									return false;
								}
				
								int64_t total_size = file_size(filepath);
								if (-1 == total_size) {
									return false;
								}
								size_t slices_cnt = total_size / block_unit;
								size_t free = total_size % block_unit;
								size_t all_slices_count = slices_cnt;
								if (free) {
									all_slices_count += 1;
								}
				
								cout << filepath << " Number of slices=" << slices_cnt << " last slices size=" << free << endl;
								// ...
				}


			bool get_file_block(const char * path, void * buf, size_t offset, size_t len) {
							// 先打开leveldb目录文件，获取links key
							// 从leveldb获取key对应的links
							// 从links获取block sha1值，进而计算出object在仓库的绝对目录
							// 读取仓库object文件，写入buf返回
			
							// 读取目录文件
							string key(basename(path));
							string links_str;
							if (false == db->get_item(key, links_str)) {
								cout << "tree 元数据不存在" << endl;
								return false;
							}
							
							//cout << "get_file_block() 读取到的目录文件为：" << endl << links << endl; 
			
							Json::Reader reader;
							Json::Value root_tree;
							if (!reader.parse(links_str, root_tree)) {
								return false;
							}
							string type = root_tree["type"].asString();
							string file_sha1 = root_tree["file_sha1"].asString();
							string file_name = root_tree["file_name"].asString();				
							int64_t size = root_tree["size"].asInt64();
			}
			
			bool del_file(const string & sha1);
			bool del_file(const char *path);
			
			static void callback_repo_gc(const boost::system::error_code&,
					boost::asio::deadline_timer* pt)
			{
				std::cout << "定时10 sec 清理仓库垃圾文件!" << std::endl;

				pt->expires_at(pt->expires_at() + boost::posix_time::seconds(10)) ;

				pt->async_wait(boost::bind(callback_repo_gc, boost::asio::placeholders::error, pt));
			}

			static void repo_gc(void)
			{
				// unlink file which markede as removed
				//
				boost::asio::io_service io;
				boost::asio::deadline_timer t(io, boost::posix_time::seconds(10));
				t.async_wait(boost::bind(callback_repo_gc, boost::asio::placeholders::error, &t));
				std::cout << "launch up garbage clean timer !" << std::endl;
				io.run();
			}
			
		public:
			string path_{};
			string block_{};
			string datastore_{};
			size_t const 	block_unit{1024 * 256};
			Cleveldb * db{nullptr};
		}
	}
}

关于本文：

写本文作为技术分析之外，顺带说一点其它方面。

c++是一门及其优秀的开发语言，承上启下，既有高级语言的全部特性，也有c语言这样直接面向api底层的特性。
不像有些语言，如golang，python等所谓的高级语言，本身的库和编译器，解释器就是用c++实现的，反过来说c++不好。

现在有一些不明就里的工程师或者领导或pm，颇具积极无私指导他人的精神，站着说话不腰疼，仅仅是看了几篇误导性的技术博客，自己也不懂golang，也不懂python，就不厌其烦的要求别人用golang/python，是自己已经学会会这种开发语言知道了优点吗？还是把信息当知识，冒充学识，误导观众？

linux操作系统目前还是是纯c写的，很多大型软件和基础组件都是用c++些的。
一个简单项目的开发效率可能与语言有一点点关，一个中大型的项目与语言无关，个人认为主要取决于工程师的水平。
只要linux没有换成golang和python或java去写，c/c++还是不可替代的。

jacky画笔

发布了61 篇原创文章 · 获赞 63 · 访问量 2万+

私信关注

《第四节 ipfs 文件切片设计》

ipfs 切片技术

git仓库目录

ipfs仓库目录：

cxx版本文件仓库类

猜你喜欢