王华进,黎建辉,沈志宏,周园春.基于ORC元数据的Hive Join查询Reducer负载均衡方法[J].计算机科学,2018,45(3):158-164
基于ORC元数据的Hive Join查询Reducer负载均衡方法
ORC Metadata Based Reducer Load Balancing Method for Hive Join Queries
投稿时间:2017-01-15  修订日期:2017-06-15
DOI:10.11896/j.issn.1002-137X.2018.03.025
中文关键词:  负载均衡,MapReduce,Hive,Join,Reducer,ORC
英文关键词:Load balancing,MapReduce,Hive,Join,Reducer,ORC
基金项目:本文受国家重点研发计划项目:科学大数据管理系统(2016YFB1000600),协同精密定位技术(2016YFB0501900)资助
作者单位E-mail
王华进 中国科学院计算机网络信息中心 北京100190
中国科学院大学 北京100049 
 
黎建辉 中国科学院计算机网络信息中心 北京100190 lijh@cnic.cn 
沈志宏 中国科学院计算机网络信息中心 北京100190  
周园春 中国科学院计算机网络信息中心 北京100190  
摘要点击次数: 267
全文下载次数: 189
中文摘要:
      负载不均衡问题位列影响大规模MapReduce集群性能因素的首位,而Hive join查询非常容易触发该问题。通用解决方案是基于中间键值对的key频率分布设计能够实现负载均衡的key划分算法。现有工作估算key频率分布时依赖于对map的输出进行监控采样,使得通信开销较大并显著延后了shuffle的启动。针对Hive join查询,提出了基于ORC元数据的key频率分布估计方法和相应的负载均衡key划分方法。该方法具有计算量小、通信开销小、不影响现有shuffle机制的优点。通过基准测试证明了该方法在key频率分布估算效率上的巨大提升及相应的key划分方法对Hive join查询性能的提升。
英文摘要:
      The load imbalance problem ranks first among the performance issues in large-scale MapReduce cluster,and it’s very prone to be triggered by Hive join queries.An effective solution is to design reducer load balancing partitioning algorithms by consulting the key’s frequency distribution histogram estimated from intermediate key-value pairs.The existing works of key histogram estimation rely on monitoring and sampling the output of map in a distributed way,which triggers huge network traffic load and notably delays the start of the shuffle.A novel key histogram estimation method based on ORC metadata and the corresponding load balancing partitioning strategy was proposed for Hive join queries.The proposals only need some light-weight computation before the start of the job,thus imposing no extra loads on network traffics and the shuffle.Benchmarking test proves the proposal’s significant improvement on both the key histogram estimation and the reducer load balancing.
查看全文  查看/发表评论  下载PDF阅读器