场景：

Logstash 、Kibana、ES版本：6.3.1。

使用Logstash从mysql同步用户和用户所有的宠物到ES中。

希望的格式：

"register_name": "孟林洁",
    "id": 80469531,
    "pets": [
      {
    
    
        "breed_name": "万能梗",
        "birthday": null,
        "pet_id": 999044,
        "name": "一只狗",
        "images": "{\"result\":[\"https://petkit-img3.oss-cn-hangzhou.aliyuncs.com/img/tmp_6f4c8e92de0c53ab355fdb69214d4bf3.jpg\"]}",
        "breed_id": 130
      },
      {
    
    
        "breed_name": "万能梗",
        "birthday": null,
        "pet_id": 999097,
        "name": "一只狗2",
        "images": "{\"result\":[\"https://petkit-img3.oss-cn-hangzhou.aliyuncs.com/img/tmp_6f4c8e92de0c53ab355fdb69214d4bf3.jpg\"]}",
        "breed_id": 130
      }
    ],
    "mobile": "*******",
    "avatar": null,
    "pet_list": [
      999044,
      999097
    ]

问题

logstash同步nested嵌套类型到ES中。
logstash同步嵌套数组对象时，聚合过程中数据丢失(用户宠物会随机丢失，偶而数据不丢失)。
logstash同步时少同步一条数据，在停止logstash服务时才进行同步
(更新) mysql的多条数据同步到es只有一条

解决：

1、解决logstash同步nested嵌套类型到ES中
先创建索引，并且修改索引类型为nested

创建索引：PUT /user
修改索引映射：
PUT /user/_mapping/doc
{
    
    
    "doc": {
    
    
      "properties": {
    
    
        "avatar": {
    
    
          "type": "text"
        },
        "id": {
    
    
          "type": "long"
        },
        "mobile": {
    
    
          "type": "text"
        },
        "pets": {
    
    
          "type": "nested",
          "properties": {
    
    
            "birthday": {
    
    
              "type": "date"
            },
            "breed_id": {
    
    
              "type": "long"
            },
            "breed_name": {
    
    
              "type": "text",
              "analyzer": "ik_max_word",
              "search_analyzer": "ik_max_word"
            },
            "images": {
    
    
              "type": "text"
            },
            "name": {
    
    
              "type": "text",
              "analyzer": "ik_max_word",
              "search_analyzer": "ik_max_word"
            },
            "pet_id": {
    
    
              "type": "long"
            }
          }
        },
        "register_name": {
    
    
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
}

使用logstash的过滤器中aggregate插件进行数据聚合。

配置文件jdbc3.conf

input {
    
    
    stdin {
    
    }
    jdbc {
    
    
        jdbc_driver_library => "../mysql-connector-java-6.0.6.jar"
        jdbc_driver_class => "com.mysql.jdbc.Driver"
        jdbc_connection_string => "jdbc:mysql://****.com:3306/food-dev"
        jdbc_user => "****"
        jdbc_password => "****"
        #jdbc_paging_enabled => "true"
        #jdbc_page_size => "50"
        clean_run => true
        use_column_value => true
        record_last_run => "true"
        tracking_column => "id"
        schedule => "*/1 * * * *"
        #last_run_metadata_path => "/Users/menglinjie/ES-node/testdata.text"
        statement => "select u.id,u.register_name,u.mobile,u.avatar,u.status,svp.id  as pet_id,svp.name,svp.images,svp.gender,svp.birthday,pb.id as breed_id,pb.name as breed_name from user u left join store_vip_pet svp on svp.user_id = u.id and svp.pet_status = 1 left join pet_breed pb on svp.breed_id = pb.id order by u.id desc"
   }
 
}
 
filter {
    
    
#这里做聚合
     aggregate {
    
    
        task_id => "%{id}"
        code => "
            map['id'] = event.get('id')
            map['register_name'] = event.get('register_name')
            map['mobile'] = event.get('mobile')
            map['avatar'] = event.get('avatar')
            map['pet_list'] ||=[]
            map['pets'] ||=[]
            if (event.get('pet_id') != nil)
                if !(map['pet_list'].include? event.get('pet_id'))  
                    map['pet_list'] << event.get('pet_id')        
                    map['pets'] << {
                        'pet_id' => event.get('pet_id'),
                        'name' => event.get('name'),
                        'images' => event.get('images'),
                        'breed_id' => event.get('breed_id'),
                        'breed_name' => event.get('breed_name'),
                        'birthday' => event.get('birthday')
                    }
                end
            end
            event.cancel()
        "
        
        push_previous_map_as_event => true
        timeout => 5
    }
    json {
    
    
        source => "message"
        remove_field => ["message"]
        #remove_field => ["message", "type", "@timestamp", "@version"]
    }
    mutate  {
    
    
        #将不需要的JSON字段过滤，且不会被存入 ES 中
        remove_field => ["tags", "@timestamp", "@version"]
    }
}
 
output {
    
    
   stdout {
    
    
        #codec => json_lines
   }
        elasticsearch {
    
    
        hosts => ["127.0.0.1:9200"]
        index => "user"
        document_id => "%{id}"
   }
}

2、解决聚合过程中子数组对象丢失

刚开始考虑到是否是sql查询分页问题，导致多宠物没有一起聚合，而是分开聚合。但是通过看日志和数据发现每次丢失的数据不同，没有任何规律性。

随机性的问题让我想到多线程，然后查找logstash配置。在config/logstash.yml中有以下配置：

# ------------ Pipeline Settings --------------
#
# The ID of the pipeline.
# 管道id
# pipeline.id: test1
#
# Set the number of workers that will, in parallel, execute the filters+outputs
# stage of the pipeline.
#
# This defaults to the number of the host's CPU cores.
#output 和 filter的线程数，默认是cpu核数
# pipeline.workers: 1
#
# How many events to retrieve from inputs before sending to filters+workers
#
# pipeline.batch.size: 1000
#
# How long to wait in milliseconds while polling for the next event
# before dispatching an undersized batch to filters+outputs
#
# pipeline.batch.delay: 50

问题定位：多线程跑聚合过程中，同一个用户的多个宠物可能被分配到不通过的线程，分别做不同的聚合，导致一个用户存在多条数据，分别拥有不同的宠物，然后多线程的进行输出到ES，ES保存过程中会把存在的数据给更新掉，这就是我的宠物丢失的原因，多线程分配的随机性导致数据也随机丢失。

尝试修改线程数，验证猜想是否正确。

指定配置文件运行会使logstash忽略logstash.yml配置。所以在logstash.yml指定配置文件，运行logstash时不用指定配置，logstash会自动寻找logstash.yml配置

# path.config: /Users/menglinjie/ES-node/logstash-6.3.1/conf.d

验证后确认猜想正确。

回想刚才把线程数设置为1，这样肯定会影响性能的吧，万一以后我有不需要聚合的的数据时完全可以多线程跑。Logstash提供的pipelines.yml可以配置多管道，使不同的同步任务绑定不同管道配置。

这里pipeline.workers: 4，pipeline.output.workers: 3，那么执行聚合的filter就是1，这样可以单线程聚合，多线程输出。

多个任务可以配置多个管道，pipeline.id标示管道唯一性。

- pipeline.id: user_pipeline
  pipeline.workers: 4
  pipeline.batch.size: 1000
  # 输出
  pipeline.output.workers: 3
  # 配置文件位置
  path.config: "/Users/menglinjie/ES-node/logstash-6.3.1/conf.d/*.conf"
  # 对基于磁盘的排队进行“持久化”。默认值是内存
  queue.type: persisted

更新：
影响聚合结果的还有sql语句！！
sql语句必须根据聚合task_id排序，也就是需要聚合的数据必须排在一起。否则map[‘pets’]会被覆盖掉，导致数据丢失。

3、logstash同步时少同步一条数据，在停止logstash服务时才进行同步

在filter 聚合配置中添加：

timeout => 3

filter aggregate 创建中 event map 并不知道我、这次事件是不是应该结束，也就是它也不知道到那一条才是最后一条, 因此设置一个 timeout 告诉它这个时间执行多少秒就结束继续执行第二个。但这样并不是很严谨，因为你也不确定你的 event map 到底要执行多久。最好的方式是我们应该给定一个 task end 的条件 ES官网关于 aggregate 的说明

4、es 配置id的问题，必须有唯一性，否则被覆盖

参考链接：

https://segmentfault.com/a/1190000016592277

https://segmentfault.com/q/1010000016861266

https://blog.csdn.net/weixin_33910460/article/details/88719101

https://elasticsearch.cn/question/6648

Logstash同步mysql一对多数据到ES（踩坑日记系列）

场景：

问题

解决：

猜你喜欢