Date Published: February 8, 2019
Publisher: Public Library of Science
Author(s): Jieying Hong, Zhipeng Wang, Wei Niu, J. Alberto Conejero.
Approximation algorithms with linear complexities are required in the treatments of big data, however, present algorithms cannot output the diameter of a set of points with arbitrary accuracy and near-linear complexity. By introducing the partition technique, we introduce a very simple approximation algorithm with arbitrary accuracy ε and a complexity of O(N + ε−1 log ε−1) for the cases that all points are located in an Euclidean plane. The error bounds are proved strictly, and are verified by numerical tests. This complexity is better than existing algorithms, and the present algorithm is also very simple to be implemented in applications.
Given a finite set of points T in a 2D Euclidean plane R2, its diameter, denoted by dT, is defined as the maximum distance between two points of T. Computing the diameter of a point set is a fundamental problem in computer science. It has been proved that in an Euclidean plane, finding the accurate diameter of a set of N points can be reduced to formulating the convex hull of them, with a lower bound of complexity O(N log N) [1–4].
For the finite set of points T in R2, we choose a point O ∈ T arbitrarily as the origin, and then divide the plane into 6n same regions with n∈Z+*. In each region Si (i = 1, …, 6n), we can find a farthest point from the origin, and let ri denote the distance between the farthest point of Si and the origin. By using the origin O as the the center of a circle and ri as the radius, we can obtain 6n sector regions, as Fig 1. We remark that the number of the regions, 6n, will allow in the following parts to provide analytical error estimation. Let pi (i = 1, …, 6n) be the midpoint of the arc of each sector region, and compute the largest distance dp of these 6n midpoints pis. Then we propose the following main theorem of this paper which shows the relationship between the diameter dT of the point set T and the largest distance dp. Here we note that the virtual points pis can be different with the real points in T.
As illustrated in the previous section, expressions (3), (1) and (2) describe the error bounds of dp, do and do* respectively. However, in practice these upper and lower bounds correspond to the worst cases, while for most situations the error will be even smaller. In this section we show by three different point sets this error distribution, respectively. In the first point set T(1), the Cartesian coordinate of each point is (x, y), where x and y are independent random variables uniformly distributed in [0, 1), leading to a diameter close to the diagonal; in the second point set T(2), the polar coordinate of each point is (r, θ), where θ is a random variable homogeneously distributed in [0, π), and r is a random variable with Gaussian distribution N(0, 1) ; the third point set T(3) is chosen from a real database on the positions of fluid particles . Both T(1) and T(2) have 1 × 106 discrete points, while T(3) have 4 × 105 discrete points. We simply use O(n2) brute force method to calculate the diameter of the 6n virtual points. Indeed, this does not yield any inconvenience in the calculation, while the computational time of the case n = 100 is only 1.01 times of that of the case n = 1. Therefore we can conclude that the calculations are of near-linear complexity.
As a fundamental problem of big data, linear approximation algorithms for the diameter of a set of points will be potentially useful. By introducing the partition technique, we introduce an approximation algorithm with arbitrary accuracy and deterministically linear complexity. The implementation of this algorithm is very simple and does not require any complicated data structure. Note that the lower bound of the proposed algorithm is O(N + n log n) with n of the order of ε−1, while a brute force visiting algorithm for virtual points will increase this to O(N + n2). In practice n will be much smaller than N, therefore O(n2) will be negligible by comparing to O(N). In addition, increasing the number of partition n does not increase any multiple coefficient to O(N), which indicates the robustness of the near-linear complexity of our algorithm. Comparing to existing approximation algorithms, the present algorithm shows a lowest complexity O(N + ε−1 log ε−1). Also, another advantage of the present algorithm is that it is very simple to be implemented, which does not require any complicated data structure or geometry calculation.