Home

Jump to bottom

zhonghua edited this page Jun 2, 2020 · 2 revisions

A large-scale data integration solution for ElasticSearch

Introduction

ES-Fastloader is incubated from a real project in DiDi which moving and transforming 30TB data to ElasticSearch everyday. If using original ES update API will take dozens of hours that can't meet our needs. This solution will take advantage of parallel computing ability of Hadoop by splitting ES indexing into multiple reducers, each reducer will generate its own index's data. Finally we gather all these data and merge into indices files and load them into online ES clusters.
This project was launched at June, 2016. After that loading 30TB of data speed up to 90 minutes from dozens of hours, highly improved the efficiency of building ElasticSearch indices.
This solution has horizontal scalability. By adding more reducers, the indices' building time can be further reduced.

Indices Building Procedure

Create index for online ES clusters, pre-defined mapping and number of shards.
Start MapReduce task, assign the terabytes of data to the reducer according to the primary key id, the number of reducers should be equal to the number of production cluster's shards, On shuffle phase the mapping of primary key to reducer should be the same as the routing algorithm for id to shard from ES production environment.
Each reducer will create an ES instance and index which has same mapping as production environment, transforming data to ES indices.
After generating, reducers will upload indices' data to the corresponding partition in HDFS.
SCP indices' data to corresponding shard of online ES clusters.
Reboot online ES indices to active new data and provide services.

Architecture Diagram