Hive中近似计算Histogram的验证

    xiaoxiao2021-04-16  294

    Histogram可以更直观的反映数据的分布情况,有了Histogram就可以对执行参数和执行计划有着更有针对性的优化。但想要得到准确的Histogram,需要巨大的计算量。如果能近似得到相对准确Histogram,就会变得很有价值。目前HIVE中实现了针对Numeric的近似的Histogram的计算逻辑。NumericHistogram的实现说明如下:

    /** * A generic, re-usable histogram class that supports partial aggregations. * The algorithm is a heuristic adapted from the following paper: * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number * of histogram bins. */

    感兴趣的可以参考论文,“A streaming parallel decision tree algorithm”。

    我简单的测试了下:

    package sunwg.test; public class testHis { public static void main(String[] args) { NumericHistogram numericHistogram = new NumericHistogram(); numericHistogram.allocate(10); for (double i=1.0; i<=50.0; i++) { numericHistogram.add(i); } System.out.println(Math.round(numericHistogram.quantile(0.1))); System.out.println(Math.round(numericHistogram.quantile(0.2))); System.out.println(Math.round(numericHistogram.quantile(0.3))); System.out.println(Math.round(numericHistogram.quantile(0.4))); System.out.println(Math.round(numericHistogram.quantile(0.5))); System.out.println(Math.round(numericHistogram.quantile(0.6))); System.out.println(Math.round(numericHistogram.quantile(0.7))); System.out.println(Math.round(numericHistogram.quantile(0.8))); System.out.println(Math.round(numericHistogram.quantile(0.9))); System.out.println(Math.round(numericHistogram.quantile(1.0))); }

    结果如下:

    3 8 12 18 24 29 33 38 42 48

    基本上还是挺靠谱的,如果想提高准确率,可以增加num_bins的个数,也就是上面的10。

    numericHistogram.allocate(10);

    并且,NumericHistogram也支持多个partial Histogram的merge操作。

    之所以要看这些内容,主要希望数据集成可以通过对数据的研究,获得数据的特征,选择更合适的splitpk,将任务可以拆分得更加平均,减少长尾task,也把用户从优化中解放出来。

    相关资源:七夕情人节表白HTML源码(两款)

    最新回复(0)