周华平,刘光宗,张贝贝.基于索引偏移的MapReduce聚类负载均衡策略[J].计算机科学,2018,45(5):303-309
基于索引偏移的MapReduce聚类负载均衡策略
Load Balancing Strategy of MapReduce Clustering Based on Index Shift
投稿时间:2017-03-06  修订日期:2017-06-04
DOI:10.11896/j.issn.1002-137X.2018.05.053
中文关键词:  MapReduce,数据倾斜,负载均衡,分布式聚类,索引偏移
英文关键词:MapReduce,Data skew,Load balance,Distributed clustering,Index shift
基金项目:本文受国家自然科学基金(51174257),安徽理工大学矿业企业安全管理研究中心招标项目(SK2015A084),安徽省高校优秀青年人才支持计划项目资助
作者单位E-mail
周华平 安徽理工大学计算机科学与工程学院 安徽 淮南232000 hpzhou@aust.edu.com 
刘光宗 安徽理工大学计算机科学与工程学院 安徽 淮南232000 819461419@qq.com 
张贝贝 安徽理工大学计算机科学与工程学院 安徽 淮南232000 976982000@qq.com 
摘要点击次数: 290
全文下载次数: 214
中文摘要:
      MapReduce作为一种分布式编程模型,被广泛应用于大规模和高维度数据集的处理中。其采用原始Hash函数 划分 数据,当数据分布不均匀时,常会出现数据倾斜的问题。基于MapReduce的聚类算法,需要多次迭代且不清楚各阶段Reduce的输入数据分布,因此现有的解决数据倾斜的方法并不适用。为解决数据划分的不均衡问题,提出一种当存在数据倾斜时更改剩余分区索引的策略。该方法在Map运行的过程中统计将要分给各reducer的数据量,由JobTrackcr监控全局的分区信息并根据数据倾斜模型动态修改原分区函数;在接下来的分区过程中,Partitioner把即将导致倾斜的分区索引到其余负载较轻的reducer上,使各节点的负载达到均衡。基于Zipf分布数据集和真实数据集,将所提算法与现有的解决数据倾斜的方法进行对比,结果证明,所提策略解决了MapReduce聚类中的数据倾斜问题,且在稳定性与执行时间上优于Hash和基于采样的动态分区法。
英文摘要:
      MapReduce has been widely used in large-scale and high-dimension datasets as a kind of distributed programming model.Original Hash partition function in MapReduce often occurs data skew when data distribution is not uniform.In the clustering algorithm based on MapReduce,existing solutions for data skew are not applicable because the input data distribution of Reduce is unclear at each stage of multiple iteration.To solve the imbalance problem of data partitioning,this paper proposed a strategy to change the remaining partition index when data is tilted.In Map phase,the amount of data which will be distributed to each reducer is counted,then the global partition information is monitored and the original partition function is dynamically modified according to the data skew model by JobTrackcr,so the Partitioner can change the index of these partitions which will cause data skew to the other reducer that has less load in the next partitioning process,and eventually balance the load of each node.Finally,this method was compared with exi-sting methods on both synthetic datasets and real datasets.The experimental results show this strategy can solve data skew on MapReduce clustering with better stability and efficiency than Hash method and dynamic partitioning method based on sampling.
查看全文  查看/发表评论  下载PDF阅读器