PHP HyperLogLog算法实现

[ 2018-01-01 ] [ 回到首页 ]

相关阅读:HyperLogLog 算法详解


代码实现

class HyperLogLog
{
    public $b = 14;
    public $m;
    public $alpha;
    public $registers;

    public function __construct()
    {
        $this->m = pow(2, $this->b);
        $this->registers = new \SplFixedArray($this->m);
        $this->alpha = $this->_getAlpha();
    }

    /**
     * @param $value
     * @return bool
     */
    public function add($value)
    {
        list($count, $m) = [1, $this->m];
        $hash = hexdec(substr(md5($value), 8, 15));
        $hash |= 1 << 60;
        $index = $hash & ($m - 1);
        $hash >>= $this->b;
        while ($hash) {
            if ($hash & 1) break;
            $count++;
            $hash >>= 1;
        }
        $this->registers[$index] = max($this->registers[$index], $count);
        return TRUE;
    }

    /**
     * @return int
     */
    public function count()
    {
        $est = $numZero = 0;
        for ($i = 0; $i < $this->m; $i++) {
            $est += pow(2, -$this->registers[$i]);
            !$this->registers[$i] && $numZero++;
        }
        if ($numZero == $this->m) return 0;
        $est = $this->alpha * pow($this->m, 2) / $est;
        if ($est <= (2.5 * $this->m) && $numZero != 0) {
            $count = $this->m * log($this->m / $numZero);
        } else {
            $count = $est;
        }
        return round($count);
    }

    /**
     * @param $data
     * @return $this
     */
    public function union($data)
    {
        $data = array_values(unpack('C*', $data));
        foreach ($this->registers as $index => $count) {
            $this->registers[$index] = max($this->registers[$index], $data[$index]);
        }
        return $this;
    }

    /**
     * @param $data
     * @return $this
     */
    public function input($data)
    {
        if ($data) {
            $data = unpack('C*', $data);
            $this->registers = $this->registers->fromArray($data, FALSE);
        }
        return $this;
    }

    /**
     * @return string
     */
    public function output()
    {
        $params = array_merge(['C*'], $this->registers->toArray());
        return call_user_func_array('pack', $params);
    }

    /**
     * @return float
     */
    private function _getAlpha()
    {
        switch ($this->m) {
            case 16: {
                $alpha = 0.673;
                break;
            }
            case 32: {
                $alpha = 0.697;
                break;
            }
            case 64: {
                $alpha = 0.709;
                break;
            }
            default: {
                $alpha = 0.7213 / (1 + 1.079 / $this->m);
            }
        }
        return $alpha;
    }
}

代码说明

  • PHP 没有原生 64 位哈希算法,故采用 md5() 作为哈希算法(效果比 crc32() 要好)
  • PHP 不支持无符号整形(PHP_INT_MAX=pow(2,63)-1),所以哈希只截取 md5() 结果的 15 个字符(转成二进制共 60 位,足够使用)
  • 实测 HyperLogLog 大范围修正效果不理想,此算法未对大基数估计进行偏差修正
  • 由于 pack 函数的原因,b=14 时数据结构需要使用 16 KB 空间(比 redis 的实现要多 4 KB

使用示例

$h1 = new HyperLogLog();
for ($i = 1; $i <= 10; $i++) {
    $h1->add($i); // 添加元素
}
$count1 = $h1->count(); // 计算基数

$h2 = new HyperLogLog();
for ($i = 6; $i <= 15; $i++) {
    $h2->add($i);
}
$count2 = $h2->count();

$output2 = $h2->output(); // 输出二进制数据(可保存至文件或数据库)
$union = $h1->union($output2)->count(); // 合并 + 计算并集基数

echo "count1:$count1   count2:$count2   union:$union"; // count1:10 count2:10 union:15

误差数据

基数值 估算值 误差值 误差率
1000 990 -10 -1.00%
3000 3026 +26 +0.87%
5000 5022 +22 +0.44%
10000 9998 -2 -0.02%
30000 29940 -60 -0.20%
50000 50450 +450 +0.90%
100000000 99805167 -194833 -0.19%
200000000 201155087 +1155087 +0.58%
300000000 297895051 -2104949 -0.70%
400000000 399329392 -670608 -0.17%
500000000 495024031 -4975969 -1.00%
600000000 589187398 -10812602 -1.80%
700000000 690067898 -9932102 -1.42%
800000000 791446928 -8553072 -1.07%
900000000 894686581 -5313419 -0.59%