相关阅读:HyperLogLog 算法详解
代码实现
class HyperLogLog
{
public $b = 14;
public $m;
public $alpha;
public $registers;
public function __construct()
{
$this->m = pow(2, $this->b);
$this->registers = new \SplFixedArray($this->m);
$this->alpha = $this->_getAlpha();
}
/**
* @param $value
* @return bool
*/
public function add($value)
{
list($count, $m) = [1, $this->m];
$hash = hexdec(substr(md5($value), 8, 15));
$hash |= 1 << 60;
$index = $hash & ($m - 1);
$hash >>= $this->b;
while ($hash) {
if ($hash & 1) break;
$count++;
$hash >>= 1;
}
$this->registers[$index] = max($this->registers[$index], $count);
return TRUE;
}
/**
* @return int
*/
public function count()
{
$est = $numZero = 0;
for ($i = 0; $i < $this->m; $i++) {
$est += pow(2, -$this->registers[$i]);
!$this->registers[$i] && $numZero++;
}
if ($numZero == $this->m) return 0;
$est = $this->alpha * pow($this->m, 2) / $est;
if ($est <= (2.5 * $this->m) && $numZero != 0) {
$count = $this->m * log($this->m / $numZero);
} else {
$count = $est;
}
return round($count);
}
/**
* @param $data
* @return $this
*/
public function union($data)
{
$data = array_values(unpack('C*', $data));
foreach ($this->registers as $index => $count) {
$this->registers[$index] = max($this->registers[$index], $data[$index]);
}
return $this;
}
/**
* @param $data
* @return $this
*/
public function input($data)
{
if ($data) {
$data = unpack('C*', $data);
$this->registers = $this->registers->fromArray($data, FALSE);
}
return $this;
}
/**
* @return string
*/
public function output()
{
$params = array_merge(['C*'], $this->registers->toArray());
return call_user_func_array('pack', $params);
}
/**
* @return float
*/
private function _getAlpha()
{
switch ($this->m) {
case 16: {
$alpha = 0.673;
break;
}
case 32: {
$alpha = 0.697;
break;
}
case 64: {
$alpha = 0.709;
break;
}
default: {
$alpha = 0.7213 / (1 + 1.079 / $this->m);
}
}
return $alpha;
}
}
代码说明
- PHP 没有原生
64
位哈希算法,故采用md5()
作为哈希算法(效果比crc32()
要好) - PHP 不支持无符号整形(
PHP_INT_MAX=pow(2,63)-1
),所以哈希只截取md5()
结果的15
个字符(转成二进制共60
位,足够使用) - 实测 HyperLogLog 大范围修正效果不理想,此算法未对大基数估计进行偏差修正
- 由于
pack
函数的原因,b=14
时数据结构需要使用16 KB
空间(比 redis 的实现要多4 KB
)
使用示例
$h1 = new HyperLogLog();
for ($i = 1; $i <= 10; $i++) {
$h1->add($i); // 添加元素
}
$count1 = $h1->count(); // 计算基数
$h2 = new HyperLogLog();
for ($i = 6; $i <= 15; $i++) {
$h2->add($i);
}
$count2 = $h2->count();
$output2 = $h2->output(); // 输出二进制数据(可保存至文件或数据库)
$union = $h1->union($output2)->count(); // 合并 + 计算并集基数
echo "count1:$count1 count2:$count2 union:$union"; // count1:10 count2:10 union:15
误差数据
基数值 | 估算值 | 误差值 | 误差率 |
---|---|---|---|
1000 | 990 | -10 | -1.00% |
3000 | 3026 | +26 | +0.87% |
5000 | 5022 | +22 | +0.44% |
10000 | 9998 | -2 | -0.02% |
30000 | 29940 | -60 | -0.20% |
50000 | 50450 | +450 | +0.90% |
100000000 | 99805167 | -194833 | -0.19% |
200000000 | 201155087 | +1155087 | +0.58% |
300000000 | 297895051 | -2104949 | -0.70% |
400000000 | 399329392 | -670608 | -0.17% |
500000000 | 495024031 | -4975969 | -1.00% |
600000000 | 589187398 | -10812602 | -1.80% |
700000000 | 690067898 | -9932102 | -1.42% |
800000000 | 791446928 | -8553072 | -1.07% |
900000000 | 894686581 | -5313419 | -0.59% |