用C++11的多线程读文件

曾经写了一篇博客做一些大数据的处理，但是其中在读取一个大文件的时候，并没有采取并行处理的方式。那么，一个大文件，为了能够批量处理，现在由多个线程来同时读它，各自读取一块（所读内容互不相同）。这么做会有问题吗？
答：如果只有读线程，那么没有问题。因为，不同的线程可以创建自己的文件描述符表项，再分别指向不同的文件表项，而每个文件表项里面可以有不同的当前文件偏移量，所以没有问题。而且这种情况也根本不需要用到锁。

以下是一个实际的例子。
首先，假设当前目录有一个文件1.txt（注意：必须是UNIX格式而非DOS格式），其内容如下：

处理代码如下：

#include <thread>
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <vector>
#include <chrono>
using namespace std;


void thread_read_file(int tid, const string& file_path)
{
    ifstream file(file_path.c_str(), ios::in);
    if (!file.good()) {
        stringstream ss;
        ss << "Thread " << tid << " failed to open file: " << file_path << "\n";
        cout << ss.str();
        return;
    }
    
    int pos; 
    if (tid == 0) pos = 0;
    else pos = tid*10; 
    
    file.seekg(pos, ios::beg);
    string line;
    getline(file, line);
    stringstream ss;
    ss << "Thread " << tid << ", pos=" << pos << ": " << line << "\n";
    cout << ss.str();
}

void test_detach(const string& file_path)
{
    for (int i=0; i<10; ++i) {
        std::thread  th(thread_read_file, i, file_path);
        th.detach(); 
    }
}

void test_join(const string& file_path)
{
    vector<std::thread> vec_threads;
    for (int i=0; i<10; ++i) {
        std::thread  th(thread_read_file, i, file_path);
        vec_threads.emplace_back(std::move(th));  // push_back() is also OK
    }
    
    auto it = vec_threads.begin();
    for (; it != vec_threads.end(); ++it) {
        (*it).join();
    }
}


int main()
{
    string file_path = "./1.txt";
    test_detach(file_path);
    std::this_thread::sleep_for(std::chrono::seconds(1));  // wait for detached threads done
    test_join(file_path);
    return 0;
}

以上代码中，展示了线程的detach和join的2种写法。对于本例来说，实际程序中，还是应该写成join. 道理也很简单。如果写成了detach，一旦主线程先结束了，那么还没来得及打印的detach的子线程也就再也不会打印其信息了。

（完）

用C++11的多线程读文件

猜你喜欢