Approximating (k,)-Median Clustering for Polygonal Curves
Approximating (k,)-Median Clustering for Polygonal Curves
Buchin, Maike; Driemel, Anne; Rohde, Dennis
2023-02-23 00:00:00
Approximating (�, ℓ)-Median Clustering for Polygonal Curves MAIKE BUCHIN, Faculty of Computer Science, Ruhr University Bochum, Germany ANNE DRIEMEL, Hausdorf Center for Mathematics, University of Bonn, Germany DENNIS ROHDE, Faculty of Computer Science, Ruhr University Bochum, Germany In 2015, Driemel, Krivošija and Sohler introduce(�,dℓ)the -median clustering problem for polygonal curves under the Fréchet distance. Given a set of input curves, the problem asks to� ind median curves of at mostℓ vertices each that minimize the sum of Fréchet distances over all input curves to their closest median curve. A major shortcoming of their algorithm is that the input curves are restricted to lie on the real line. In this paper, we present a randomized bicriteria-approximation algorithm that works for polygonal curvRes and in achieves approximation factor (1+ �) with respect to the clustering costs. The algorithm has worst-case running time linear in the number of curves, polynomial in the maximum number of vertices per curve, i.e. their complexity, and exponential �, ℓ, 1/� in and 1/�, i.e., the failure probability. We achieve this result through a shortcutting lemma, which guarantees the existence of a polygonal curve with similar cost as an optimal median curve of complexity ℓ, but of complexity at most ℓ −22, and whose vertices can be computed eiciently. We combine this lemma with the superset sampling technique by Kumar et al. to derive our clustering result. In doing so, we describe and analyze a generalization of the algorithm by Ackermann et al., which may be of independent interest. CCS Concepts: · Theory of computation→ Design and analysis of algorithms; Computational geometry; · Mathe- matics of computing→ Probabilistic algorithms . Additional Key Words and Phrases: clustering, polygonal curves, approximation algorithms, median 1 INTRODUCTION Since the development �of -means ś the pioneer of modern computational clustering ś the last 65 years have brought a diversity of specialize 6, 7, 15 d [, 21, 22, 33, 34] as well as generalized clustering algorithms 2, 5, 26]. [ However, in most cases clustering of point sets was studied. Many clustering problems indeed reduce to clustering of point sets, but for sequential data like time series and trajectories ś which arise in the natural sciences, medicine, sports, inance, ecology, audio/speech analysis, handwriting and many more ś this is not the case. Hence, we need specialized clustering methods for these purposes, c.f. [1, 14, 20, 31, 32]. A promising branch of this active research deals(�,with ℓ)-center and (�, ℓ)-median clustering ś adaptions of the well-known Euclidean �-center and �-median clustering.(�In, ℓ)-center clustering, respectiv(�,eℓ)-median clustering, we are given a set�of polygonal curves in R of complexity (i.e., the number of vertices of the curve) at most � each and want to compute � centers that minimize the objective function ś just as in Euclidean �-clustering. In addition, the complexities of the centers are bounded by aℓconstant to obtain compact aggregates of the clusters, which is enabled by the continuous nature of the employed distance measure. This also prevents over-itting (see the discussions in [8, 17]). A great beneit of regarding the sequential data as polygonal curves is that we introduce an implicit linear interpolation. This does not require any additional storage space since we only need to store the vertices of the curves, which are the sequences at hand. We compare the polygonal curves by their Fréchet distance, that is a continuous distance measure which takes the entire course of the curves into account, not only the pairwise distances among their vertices. Therefore, irregular sampled sequences Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). © 2022 Copyright held by the owner/author(s). 1549-6325/2022/8-ART https://doi.org/10.1145/3559764 ACM Trans. Algor. 2 • Maike Buchin, Anne Driemel, and Dennis Rohde are automatically handled by the interpolation, which is desirable in many cases. Moreo.v[12] er, Buchin et al showed, by using heuristics, that(the �, ℓ)-clustering objectives yield promising results on trajectory data. This branch of research formed only recently, about twenty years after Alt and Godau developed an algorithm to compute the Fréchet distance between polygonal curv4es ]. Se [ veral papers have since studied this type of clustering 10[ś 12, 17, 28]. However, all of these clustering algorithms, except the approximation-schemes for polygonal curves R in[17] and the heuristics in 12],[choose a �-subset of the input as centers. (This is also often callediscr d eteclustering.) This �-subset is later simpliied, or all input-curves are simpliied before choosing a �-subset. Either way, using these techniques one cannot achieve an approximation factor of less than 2. This is because there need not be an input curve with distance to its median which is less than the average distance of a curve to its median. Driemel et. al [17], who were the irst to study clustering of polygonal curves under the Fréchet distance in this setting, already overcame this problem in one dimension by deining and �-signatur analyzing es, which are succinct representations of classes of curves that allow synthetic center curves to be constructed. However, it seems that �-signatures are only applicable R. Her ine, we extend their work and obtain the irst randomized bicriteria approximation algorithm (�, ℓ)-me fordian clustering of polygonal curRves . in 1.1 Related Work Driemel et. al [17] introduced the(�, ℓ)-center and(�, ℓ)-median objectives and developed the irst approximation- schemes for these objectives, for curves R in . Furthermore, they proved that (�, ℓ)-center as well as(�, ℓ)-median clustering is NP-hard, wher � is e a part of the input and ℓ is ixed. Also, they showed that the doubling dimension of the metric space of polygonal curves under the Fréchet distance is unbounded, even when the complexity of the curves is bounded. Following this work, Buchin . [10] et al developed a constant-factor approximation algorithm (�, ℓfor )-center clustering R in. Furthermore, they provide improved results on the hardness of approximating (�, ℓ)-center clustering under the Fréchet distance:(the �, ℓ)-center problem is NP-hard to approximate within a factor of (1.5−�) for curves inR and within a factor(of 2.25−�) for curves inR , where � ≥ 2, in both cases even if � = 1. Furthermore, for the(�, ℓ)-median variant, Buchin et . [11] alproved NP-hardness using a similar reduction. Again, the hardness holds even if � = 1. Also, they provide(d1+ �)-approximation algorithms(�for , ℓ)-center, as well as (�, ℓ)-median clustering, under the discrete Fréchet distance. Nath and T[30] aylor give improved algorithms for(1+ �)-approximation(of �, ℓ)-median clustering under discrete Fréchet and Hausdorf distance. Recently, Meintrup et al . [28] introduced a practical (1 + �)-approximation algorithm for discr �-me ete dian clustering under the Fréchet distance, when the input adheres to a certain natural assumption, i.e., the presence of a certain number of outliers. Our algorithms build upon the clustering algorithm of Kumar . [27], which et al was later extended by Ackermann et al . [2]. As a main result they proved that any dissimilarity�measur (·,·) which e satisies a so-called sampling property ś i.e., given an input set � and a constant-size uniform sample � from� , a median of � can be computed in time depending only|�on| and is a(1+�)-approximate median�ofwith high probability ś admits a randomized (1+ �)-approximation algorithm for �-me thedian problem under �. This algorithm also exists under a weak variant of the sampling property, where the median � doof es not need to be computed exactly but rather a constant-size set of candidates(in time depending only|�|on ,� and the probability of failur �), which e contain a (1+ �)-approximate median�ofwith high probability. It is a recursive approximation-scheme that employs two phases in each call. In the so-calle candidate d phaseit computes candidates by taking a sample � from the input set� and running an algorithm on each subset� of of a certain size. Which algorithm to use depends on the dissimilarity measure at hand. The idea behind this is � contains simple:aif cluster � that takes a constant fraction of its size, then a constant fraction � isof from � with high probability. By brute-force enumeration of ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 3 ′ ′ all subsets �of , we ind this subset� ⊆ � and if� is taken uniformly and independently at random � fr then om ′ ′ � is a uniform and independent sample�fr.om Ackermann et al . proved for various metric and non-metric dissimilarity measures, � that can be used for computing candidates that contain(1+a �)-approximate median for� with high probability. The algorithm recursively calls itself for each candidate to eventually evaluate these together with the candidates for the remaining clusters. The second phase of the algorithm is the so-calle pruning d phase, where it partitions its input according to the candidates at hand into two sets of equal size: one with the smaller distances to the candidates and one with the larger distances to the candidates. It then recursively calls itself with the second set as input. The idea behind this is that small clusters now become large enough to ind candidates for these. Furthermore, the partitioning yields a provably small error. Finally it returns � candidates the set of that together evaluated best. Fig. 1. From let to right: symbolic depiction of the operation principle of Algorithms 1, 2 and 4. Among all approximate ℓ-simplifications (depicted in blue) of the input curves (depicted in black), Algorithm 1 returns the one that evaluates best (the solid curve) with respect to a sample of the input. Algorithm 2 does not return a single curve, but a set of candidates. These include the curve returned by Algorithm 1 plus all curves with ℓ vertices from the cubic grids, covering balls of certain radius centered at the vertices of an input curve that is close to a median, w.h.p. Algorithm 4 is similar to Algorithm 2 but does not only cover the vertices of a single curve, but of multiple curves. We depict the best approximate median that can be generated from the grids in solid green. 1.2 Our Contributions We present several algorithms for approximating (1, ℓ)-median clustering of polygonal curves under the Fréchet distance, see Fig. 1 for an illustration of the operation principles of our algorithms. While the irst one, Algorithm 1, yields only a coarse approximation (factor 34), it is suitable as plugin for the following two algorithms, Algorithms 2 and 4, due to its asymptotically fast running time. These algorithms yield a better approximation + �, (factor 3 respectively+1�). Additionally, Algorithms 2 and 4 are not only able to yield an approximation for the input set � , but for a cluster � ⊆ � that takes a constant fraction of � . We would like to use these as plugins to the (1+ �)-approximation algorithm�-me fordian clustering by Ackermann et . [2] al , but that would require our algorithms to comply with one of the sampling properties. Recall that for�anthe input weakset sampling property requires that a constant-size set of candidates that contains with high probability (1+ �)-appraoximate median of� can be computed from a constant-size uniform and independent sample � . Furthermor of e, the running time for computing the candidates depends only on the size of the�sample and the ,failure probability parameter. The strong sampling property is deined similarly, but instead of a candidate set, an approximate median can be computed directly and the running time may only depend on the size of the sample. In our ACM Trans. Algor. 4 • Maike Buchin, Anne Driemel, and Dennis Rohde algorithms however, the running time for computing the candidate set dep�ends which on is a parameter of the input. Additionally, our irst algorithm for computing candidates, which (3+ �)contain -approximate a (1, ℓ)-median with high probability, does not achieve the required approximation-factor (1+ �). Howof ever, looking into the analysis of Ackermann ., any et al approximation algorithm computing candidates can be used in the recursive approximation-scheme. Therefore, we decided to generalize �-methe dian clustering algorithm of Ackermann et al. [2]. Nath and Taylor[30] use a similar approach, but they developed yet another way to compute candidates: they deine and analyze�-coverability, which is a generalization of the notion of doubling dimension and indeed, for the discrete Fréchet distance the proof builds upon the doubling dimension R of.pHo oints wever in , the doubling dimension of polygonal curves under the Fréchet distance is unbounded, even when the complexities of the curves are bounded and it is an open question whether �-coverability holds for the continuous Fréchet distance. We circumvent this by taking a diferent approach using the idea of shortcutting. It is well-known that shortcutting a polygonal curve (that is, replacing a subcurve by the line segment connecting its endpoints) does not increase its Fréchet distance to a line segment. This idea has been used before for a variety of Fréchet-distance related problems3[, 9, 16, 17]. Speciically, we introduce two new shortcutting lemmata. These lemmata guarantee the existence of good approximate medians, with complexity at ℓ −most 2 and 2 whose vertices can be computed eiciently. The irst one enables us to return candidates, which contain (3+ �)-appr a oximate median for any cluster that takes a constant fraction of the input, w.h.p., and we call simpleit shortcutting . The second one enables us to return candidates, which contain(1a+�)-approximate median for any cluster that takes a constant fraction of the input, w.h.p., and we call advance it d shortcutting. All in all, we obtain as our main result, following from Corollary 7.5: Theorem 1.1. Given a set� of� polygonal curves inR , of complexity at most� each, parameter values�∈ (0, 0.158] and � ∈ (0, 1), and constants �, ℓ ∈ N, there exists an algorithm, which computes a set � of � polygonal curves, each of complexity at most2ℓ − 2, such that with probability at least(1− �), it holds that ︁ ︁ cost(�,�) = mind (�,�) ≤ (1+ �) mind (�,�) = (1+ �) cost(�,� ) F F �∈� �∈� �∈� �∈� where � is an optimal(�, ℓ)-median solution for� under the Fréchet distanced (·,·). The algorithm has worst-case running time linear�in , polynomial in� and exponential in1/�,1/�,� and ℓ. We remark that cost(�,�) < cost(�,� ) may happen due to the possible increased center complexity of up to 2ℓ − 2. 1.3 Organization The paper is organized as follows. First we present a simple and fast 34-approximation algorithm (1, ℓ)-median for clustering. Then, we present the(3+ �)-approximation algorithm(1for , ℓ)-median clustering inside a cluster that takes a constant fraction of the input, which builds upon simple shortcutting and the 34-approximation algorithm. Then, we present a more practical modiication(3+ of�)the -approximation algorithm, which achieves a (5+ �)-approximation for (1, ℓ)-median clustering. Following this, we present the similar but more involved (1+ �)-approximation algorithm(1for , ℓ)-median clustering inside a cluster that takes a constant fraction of the input, which builds upon the advanced shortcutting and the 34-approximation algorithm. Finally we present the generalized recursiv �-me e dian approximation-scheme, which leads to our main result. ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 5 2 PRELIMINARIES Here we introduce all necessary deinitions. In the follo � ∈ Nwing is an arbitrary constant. For � ∈ N we denote [�] = {1, . . .,�} and by � ⊎� we denote the union of the disjoint�sets ,� . By ∥·∥ we denote the Euclidean norm � � and for� ∈ R and �∈ R we denote by �(�,�) = {� ∈ R | ∥� − �∥ ≤ �} the closed ball of radius � with center ≥0 �. ByS we denote the symmetric group of degre�e. We give a standard deinition of grids: Deinition 2.1 (grid).Given a number� ∈ R , for� = (� , . . .,� ) ∈ R we deine by� (�,�) = (⌊� /�⌋ · >0 1 � 1 �, . . .,⌊� /�⌋ · �) the �-grid-point �of . Let � be a subset ofR . The grid of cell width � that covers � is the set G(�,�) = {� (�,�) | � ∈ � }. Such a grid partitions the�set into cubic regions and for each �∈ R and� ∈ � we have that∥�−� (�,�)∥ ≤ ��. >0 We give a standard deinition of polygonal curves: Deinition 2.2 (polygonal curve).A (parameterized) curve is a continuous mapping �: [0, 1] → R . A curve � is polygonal, if there exist � , . . .,� ∈ R , no three consecutive on a line, calle �’s vertices d and� , . . .,� ∈ [0, 1] 1 � 1 � with� < ··· < � , � = 0 and � = 1, calle�d’s instants, such that� connects every two consecutive vertices 1 � 1 � � = �(�),� = �(� ) by a line segment. � � �+1 �+1 We call the line segments � � , . . .,� � the edges of� and � the complexity �of , denoted by |�|. Sometimes 1 2 � −1 � we will argue about a sub-cur�vof e a given curv�e. We will then refer� by to restricting the domain�,of denoted by �| , where � ⊆ [0, 1]. Deinition 2.3 (Fréchet distance).LetH denote the set of all continuous bijections ℎ : [0, 1] → [0, 1] withℎ(0) = 0 and ℎ(1) = 1, which we call reparameterizations. The Fréchet distance between�cur and ves� is deined as d (�,�) = inf max ∥�(�) − �(ℎ(�))∥. ℎ∈H �∈[0,1] Sometimes, given two curv�es,�, we will refer toℎ an∈ H as matching between� and �. Note that there may not exist a matching ℎ ∈ H, such that max ∥�(�)−�(ℎ(�))∥ = d (�,�). This is due to the �∈[0,1] F fact that in some cases a matching realizing the Fréchet distance would need to match multiple � , . . .,�points 1 � on � to a single point � on �, which is not possible since matchings need to be bijections, � , .but . .,�thecan 1 � get matched arbitrarily close �, rto ealizing(�d,�) in the limit, which we formalize in the following lemma: � ∞ Lemma 2.4. Let �,�: [0, 1] → R be curves. Let � = d (�,�). There exists a sequence(ℎ ) inH, such that F � �=1 lim max ∥�(�) − �(ℎ (�))∥ = �. �→∞ �∈[0,1] Proof. Deine� : H → R ,ℎ ↦→ max ∥�(�) − �(ℎ(�))∥ with image � = {�(ℎ) | ℎ ∈ H}. Per deinition, we ≥0 �∈[0,1] have d (�,�) = inf� = �. For any non-empty subset � ofR that is bounded from below and for ev�er>y 0 it holds that there exists an � ∈ � withinf� ≤ � < inf� + �, by deinition of the inimum. �Since ⊆ R and inf� exists, for ever�y > 0 ′ ′ there exists an� ∈ � with inf � ≤ � < inf� + �. Now, let� = 1/�be a zero sequence. For every �∈ N there exists an� ∈ � with� ≤ � < �+ � , thus lim � = �. � � � � � �→∞ −1 ′ ′ −1 ′ ′ Let � (� ) = {ℎ ∈ H | �(ℎ) = � } be the preimage of�. Since� is a function, |� (� )| ≥ 1 for each� ∈ �. Now, −1 for�∈ N, letℎ be an arbitrary element from � (�). By deinition it holds that � � lim max ∥�(�) − �(ℎ (�))∥ = lim �(ℎ ) = lim � = � = inf�, � � � �→∞ �→∞ �→∞ �∈[0,1] which proves the claim. □ ACM Trans. Algor. 6 • Maike Buchin, Anne Driemel, and Dennis Rohde We deine the relation � ∼ � ⇐⇒ ∃ℎ ∈ H : � = �◦ ℎ, which is an equivalence relation 4]. Base[d on this we introduce the classes of curves we are interested in. Deinition 2.5 (polygonal curve classes).For � ∈ N, we deine byR the set of equivalence classes with respect � � to∼, of polygonal curves in ambient R space . For � ∈ N we deine byR the subclass of polygonal curves of complexity at most � . Simpliication is a fundamental problem related to curves, which appears as sub-problem in our algorithms. Deinition 2.6 (minimum-errℓor-simpliication).For a polygonal curv�e∈ R we denote by simpl (�,�) an � ′ �-approximate minimum-errℓ-simpliication or �, i.e of., a curve� ∈ R withd (�,�) ≤ � · d (�,� ) for all F F ′ � � ∈ R . Now we deine the(�, ℓ)-median clustering problem for polygonal curves. Deinition 2.7((�, ℓ)-median clustering).The(�, ℓ)-median clustering problem is deined as follo�ws, ,�∈wher N e are ixed (constant) parameters of the problem: given a inite and non-empty � ⊂setR of polygonal curves, ∗ � ∗ ∗ compute a set of� curves � ⊂ R , such that cost(�,� ) = mind (�,�) is minimal. ∗ ∗ � ∈� �∈� We call cost(·,·) the objective function and we often write cost(�,�) as shorthand forcost(�,{�}). Also, for simplicity we call (1, ℓ)the -median problem the ℓ-median problem and a solution toℓit -mean dian. 2.1 Sampling Sampling is the process of repeatedly drawing elements from some non-empty set with a certain probability, c.f. [29]. The result of a sampling process is a multiset �, which we call sample. The following theorem of Indyk 25] [ utilizes uniform sampling and is useful for evaluating the cost of a curve at hand. Theorem 2.8 ([25, Theorem 31]). Let �∈ (0, 1] and � ⊂ R be a non-empty set of polygonal curves. Further let � be a non-empty sample, drawn uniformly and independently at random from� , with replacement. For�,� ∈ � with cost(�,�) > (1+ �) cost(�,�) it holds thatPr[cost(�,� ) ≤ cost(�,� )] < exp −� |� |/64 . 2.1.1 Superset Sampling. This technique was coined by Kumar et . [27] al and also applied by Ackermann et . al [2]. Here, we draw a sample� uniformly and independently with replacement from a non-empty�inite , and set we want � to contain a uniform and independent sample � from a subset� ⊆ � . As can easily be veriied, the elements� ∈ � contained in � form a uniform sample � from� : Proposition 2.9. Let � be a non-empty inite set and� ⊆ � be a non-empty subset. Let � be sampled uniformly and independently with replacement from� and let � be the event that there is a subset� ⊆ � of size at least� ∈ N, ′ ′ ′ ′ 1 with� ⊆ � . For each � ∈ � and � ∈ � it holds thatPr[� = � | �] = . |� | This statement follows from elementary probability theory, for completeness, however, we provide a proof. |�| Proof of Proposition 2.9. The sample space is � , where we neglect the order of the elements of the elemen- |�| |�| � |�|−� tary events, and consists of|� | tuples, each of which occurs with the same probability·|.�In| ·|� | of |�| � |�|−� ( )·|� | ·|� | them are at least� elements from� . Thus, Pr[�] = . |�| |� | ′ |�| �−1 |�|−� Without loss of generality, we assume� that corresponds to the irst draw from� . There are ·1·|�| ·|� | |�| �−1 |�|−� ( )·1·|� | ·|� | ′ ′ � tuples such that� = � and at least� elements are from� . Hence, Pr[(� = �) ∩ �] = . Finally, |�| |� | ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 7 we have |�| �−1 |�|−� ·1·|� | ·|� | ( ) |�| Pr[(� = �) ∩ �] |� | 1 Pr[� = � | �] = = = . |�| � |�|−� Pr[�] ·|� | ·|� | |�| ( ) |�| |� | We often employ the following, which can be seen as a union bound for superset sampling. Lemma 2.10. Let � , . . ., � , � be events, withPr[�] > 0. It holds that 1 � " # ! � � Ù ︁ Pr � ∩ � ≥ 1− Pr[�] + Pr[� | �] . � � �=1 �=1 Proof. We have: " # " # " # " # � � � � � � Ù Ø Ø Ø Ø Ø Pr � ∩ � = 1− Pr � ∪ � = 1− Pr (� ∩ �) ∪ (� ∩ �) ∪ (� ∩ �) = 1− Pr � ∪ (� ∩ �) � � � � � � �=1 �=1 �=1 �=1 �=1 �=1 ! ! � � ︁ ︁ ≥ 1− Pr[�] + Pr[� ∩ �] ≥ 1− Pr[�] + Pr[� | �] . � � �=1 �=1 In the irst equation we use De Morgan’s law, the irst inequality follows from a union bound, and the last inequality follows, since [� | �Pr] ≥ Pr[� ∩ �] for all �∈ [�]. □ � � The following concentration bound also applies to independent Bernoulli trials, which are a special case of Poisson trials where each trial has same probability of success. In particular it can be used to bound the probability that superset sampling is not successful. Lemma 2.11 (Chernoff bound for independent Poisson trials [29, Theorem 4.5]). Let � , . . .,� be indepen- 1 � dent Poisson trials. For� ∈ (0, 1) it holds that " " # # " # ! � � � ︁ ︁ 2 ︁ Pr � ≤ (1− �) E � ≤ exp − E � . � � � �=1 �=1 �=1 3 SIMPLE AND FAST 34-APPROXIMATION FOR ℓ-MEDIAN Here, we present Algorithm 1, a 34-approximation algorithmℓ-me fordian the problem. It is based on the following. We can obtain a 3-approximate solution (in terms of objective value) ℓ-metodian the problem for a given set � = {� , . . .,� } ⊂ R of polygonal curves with high probability by uniformly and independently sampling a 1 � suicient number of curves from � . In detail, when the sample size only depends on the failure probability, it contains with high probability one of the�at/2least input curves (by an averaging argument) that are within ∗ ∗ ∗ distance 2· cost(�,� )/� to an optimal ℓ-median� for� . These curves have cost up to 3 · cost(�,� ) by the triangle-inequality. We call this sample the candidate sample. We can improve on running time by using Theorem 2.8 and evaluate the cost of each curve in the candidate sample against another sample (which we call witness sample) of similar size instead of against the complete input. Though, we have to accept an approximation factor of 5 (if�w=e 1set in Theorem 2.8). That is indeed acceptable, since we only obtain an approximate solution in terms of objective value and ignore the bound on the number of vertices of the center curve, which is a disadvantage of this approach and may even result in the obtained center having cost less than cost(�,� ), ifℓ > � . To ix this, we simplify the candidate curve that evaluated best against the witness sample, using an eicient minimum-err ℓ-simpliication or approximation ACM Trans. Algor. 8 • Maike Buchin, Anne Driemel, and Dennis Rohde algorithm, which downgrades the approximation factor + 7�to, wher 6 e � is the approximation factor of the minimum-error ℓ-simpliication. However, Algorithm 1 is very fast in terms of the input size. Indeed, it has worst-case running time independent of� and sub-quartic in � . Now, Algorithm 1 has the purpose to provide us an approximate median for a given set of polygonal curves: the bicriteria approximation algorithms (Algorithms 2 and 4), which we present afterwards and which are capable of generating center curves with up ℓ −to22vertices, need an approximate median (and the approximation factor) to bound the optimal objective value. Furthermore, there is a case where Algorithms 2 and 4 may fail to provide a good approximation (i.e. when the cost estimation fails and the grids are thus not set up properly), but it can be proven that the result of Algorithm 1 is then a very good approximation, which can be used instead. Algorithm 1 ℓ-Median by Simpliication 1: procedure ℓ-Median-34-Approximation(� = {� , . . .,� },�) 1 � 2: � ← sample⌈2(ln(2) + ln(1/�))⌉ curves from� uniformly and independently with replacement 3: � ← ⌈64(ln(1/�) + ln(⌈4(ln(2) + ln(1/�))⌉))⌉ 4: � ← sample� curves from� uniformly and independently with replacement 5: �← arbitrary element from arg min cost(�,� ) �∈� 6: return simpl (�,�) ⊲ E.g. combining [4, 24] Next, we prove the quality of approximation of Algorithm 1. Theorem 3.1. Given a parameter� ∈ (0, 1) and a set� = {� , . . .,� } ⊂ R of polygonal curves, Algorithm 1 returns 1 � � ∗ ∗ with probability at least 1− � a polygonal curve �∈ R , such that cost(�,� ) ≤ cost(�,�) ≤ (6+ 7�) · cost(�,� ), where� is an optimalℓ-median for� and � is the approximation factor of the utilized minimum-err ℓ-simpliication or approximation algorithm. Proof. First, we know that d(�,simpl (�,�)) ≤ � · d (�,�), for each� ∈ � , by Deinition 2.6. F F 2 cost(�,� ) � ∗ Now, there are at least curves in� that are within distance at most to � . Otherwise, the cost of the 2 � ∗ 1 remaining curves would excecost ed (�,� ), which is a contradiction. Hence,�each ∈ � has probability at least 2 cost(�,� ) to be within distance to � . Since the elements �ofare sampled independently we conclude that the probability that �∈e�ver has y distance 2 cost(�,� ) 2(ln(2)+ln(1/�)) ∗ 1 |�| � to � greater than is at most(1− ) ≤ exp − = . � 2 2 2 2 cost(�,� ) Now, assume there is an� ∈ � withd (�,�) ≤ . We do not want any � ∈ � \ {�} withcost(�,�) > 2 cost(�,�) to have cost(�,� ) ≤ cost(�,� ). Using Theorem 2.8 we conclude that this happens with probability at most 64(ln(1/�) + ln(⌈4(ln(2) + ln(1/�))⌉)) � � exp − ≤ ≤ , 64 ⌈4(ln(2) + ln(1/�))⌉ 2|�| for each� ∈ �\{�}. Using a union bound over all bad events, we conclude that with probability− �at , Algorithm least 1 1 samples a ∗ ∗ curve �∈ �, withd (�,�) ≤ 2 cost(�,� )/� and returns the simpliication � = simpl (�,�) of a curve�∈ �, with cost(�,�) ≤ 2 cost(�,�). The triangle inequality yields ︁ ︁ ︁ ︁ ∗ ∗ ∗ ∗ (d (�,�) − d (�,�)) ≤ d (�,�) ≤ 2 d (�,�) ≤ 2 (d (�,�) + d (� ,�)), F F F F F F �∈� �∈� �∈� �∈� ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 9 which is equivalent to ∗ ∗ 2 cost(�,� ) 7 cost(�,� ) ∗ ∗ ∗ ∗ � · d (�,�) ≤ 2 cost(�,� ) + cost(�,� ) + 2� ⇐⇒ d (�,�) ≤ . F F � � Hence, we have ︁ ︁ cost(�,�) = d (�,simpl (�,�)) ≤ (d (�,�) + d (�,simpl (�,�))) F F F �∈� �∈� ︁ ︁ ∗ ∗ ∗ ∗ ≤ 2 cost(�,�) + � · d (�,�) ≤ 2 (d (�,�) + d (� ,�)) + 7� · cost(�,� ) F F F �∈� �∈� ∗ ∗ ∗ ∗ ≤ 2 cost(�,� ) + 4 cost(�,� ) + 7� · cost(�,� ) = (6+ 7�) cost(�,� ). ∗ ∗ The lower boundcost(�,� ) ≤ cost(�,�) follows from the fact that the returned curveℓhas vertices and that� has minimum cost among all curvesℓwith vertices. □ The following lemma enables us to obtain a concrete approximation-factor and worst-case running time of Algorithm 1. Lemma 3.2 (Buchin et al. [10, Lemma 7.1]). Given a curve� ∈ R , a 4-approximate minimum-error ℓ-simpliication can be computed in� (� log� ) time. The simpliication algorithm used for obtaining this statement is a combination of the algorithm by Imai and Iri [24] and the algorithm by Alt and Godau [4]. Combining Theorem 3.1 and Lemma 3.2, we obtain the following corollary. Corollary 3.3. Given a parameter� ∈ (0, 1) and a set � ⊂ R of polygonal curves, Algorithm 1 returns with probability at least1− � a polygonal curve �∈ R , such that ∗ ∗ cost(�,� ) ≤ cost(�,�) ≤ 34· cost(�,� ), ∗ 2 3 where � is an optimalℓ-median for� , in time� (� log(� ) ln(1/�)+� log� ), when the algorithms by Imai and Iri [24] and Alt and Godau [4] are combined forℓ-simpliication. Proof. We use Lemma 3.2 together with Theorem 3.1, which yields an approximation factor of 34. Now, drawing the samples takes time � (ln(1/�)) each. Evaluating the samples against each other takes time 2 2 3 � (� log(� ) ln(1/�)) and simplifying one of the curves that evaluates best takes � (�time log� ). We conclude 2 2 3 that Algorithm 1 has running time � (� log(� ) ln(1/�) + � log� ). □ 4 (3+ �)-APPROXIMATION FOR ℓ-MEDIAN BY SIMPLE SHORTCUTTING Here, we present Algorithm 2, which returns candidates, containing with high probability (3+ �)-approximate a (1, ℓ)-median of complexity at most ℓ − 22 for any (ixed) subset that takes a constant fraction of the input. Algorithm 2 can be used as plugin in our generalized version (Algorithm 5, Section 7) of the algorithm by Ackermann et al. [2]. In contrast to Nath and Taylor[30] we cannot use the property that the vertices of a median must be found in the ∗ ∗ balls of radius d (�,�), centered at �’s vertices, where� is an optimal (1, ℓ)-median for a given input � , which� is an element of. This is an immediate consequence of using the continuous Fréchet distance. We circumvent this by proving the following shortcutting lemmata. We start with the simplest, which states that we can indeed search the aforementioned balls, if we accept a resulting curve of complexity ℓ − 2.atSemost e 2 Fig. 2 for a visualization. ACM Trans. Algor. 10 • Maike Buchin, Anne Driemel, and Dennis Rohde Fig. 2. Visualization of a simple shortcut. The black curve is an input-curve that is close to an optimal median, which is depicted in red. By inserting the blue shortcut we can find a curve that has the same distance to the black curve as the median but with all vertices contained in the balls centered at the black curves vertices. � � � Lemma 4.1. Let �,� ∈ R be polygonal curves. Let � , . . .,� be the vertices of� and let � = d (�,�). There 1 |�| ′ � ′ exists a polygonal curve� ∈ R withd (� ,�) ≤ d (�,�) and every vertex contained in at least one of F F 2|� |−2 � � �(� ,�), . . ., �(� ,�). |�| � � � � � � Proof. Let � , . . .,� be the vertices of�. Further, let� , . . .,� and � , . . .,� be the instants of� and �, 1 1 1 |� | |� | |�| respectively. Also,ℎfor∈ H (recall thatH is the set of all continuous bijeℎctions : [0, 1] → [0, 1] withℎ(0) = 0 and ℎ(1) = 1), let� = max ∥�(�) − �(ℎ(�))∥ be the distance realizedℎby . We know from Lemma 2.4 that there �∈[0,1] exists a sequence(ℎ ) inH, such that lim � = d (�,�) = �. � ℎ F � =1 �→∞ � � Now, ix an arbitrarℎy∈ H and assume there is a vertex� of�, with instant � , that is not contained in any � � � � � � � � of�(� ,� ), . . ., �(� ,� ). Let �be the maximum of[|�| − 1], such that � ≤ ℎ(� ) ≤ � . So � is matched to ℎ ℎ 1 � � �+1 |�| � � � � �(� )�(� ) by ℎ. We modify� in such a way that� is replaced by two new vertices that are elements �(�of,� ) � �+1 � � and �(� ,� ), respectively. �+1 − � − � + � Namely, let � be the maximum of[0,� ), such that �(� ) ∈ �(� ,� ) and let� be the minimum(of � , 1], such � � � + � � � � that �(� ) ∈ �(� ,� ). These are the instants when� leaves�(� ,� ) before visiting � and � enters �(� ,� ) ℎ ℎ ℎ �+1 � � �+1 � ′ − + after visiting � , respectively. Let � be the piecewise deined curve, deined just like � on [0,� ] and [� , 1], but − − − + − + �−� − �−� + on (� ,� ) it connects�(� ) and �(� ) with the line segment �(�) = 1− �(� ) + �(� ). + − + − � −� � −� − − + + � − + � We know that ∥�(� )−�(ℎ(� ))∥ ≤ � and ∥�(� )−�(ℎ(� ))∥ ≤ � . Note that � ≤ ℎ(� ) and ℎ(� ) ≤ � since ℎ ℎ � �+1 − + � � � �(� ) and �(� ) are the closest points �to on � that have distance� to � and � , respectively, by deinition. � � �+1 − + ′ − − Therefore,� has no vertices between the instants ℎ(� ) andℎ(� ). Now,ℎ can be used to match� | to�| [0,� ) [0,ℎ(� )) ′ ′ + + − + − + and � | to �| with distance at most � . Since� | and �| are just line segments, they can (� ,1] (� ,1] ℎ [� ,� ] [ℎ(� ),ℎ(� )] ℎ ℎ ′ − − ′ + + be linearly matched to each other with distance atmax most{∥� (� ) − �(ℎ(� ))∥,∥� (� ) − �(ℎ(� ))∥} ≤ � . ℎ ℎ We conclude that d(� ,�) ≤ � . F ℎ ′ ′ Because this modiication works for eℎver∈ yH, we have d (� ,�) ≤ � for everyℎ ∈ H. Thus, lim d (� ,�) ≤ F ℎ F ℎ ℎ �→∞ d (�,�) = �. Now, to prove the claim, for everℎ y∈ H we apply this modiication � and to successively to every other vertex ℎ ′ ′ � of the resulting cur�ve, not contained in one of the balls, until every v�erte is x of contained in a ball. ℎ ℎ Note that the modiication is repeated at most |�| − 2 times for everℎy∈ H, since the start and end vertex of � � ′ � must be contained in �(� ,� ) and �(� ,� ), respectively. Therefore, the number of vertices of e�vercan y ℎ ℎ |�| ℎ be bounded by 2· (|�| − 2) + 2 since every other vertex must not lie in a ball and for each such vertex one new vertex is created. Thus,|� | ≤ 2|�| − 2. □ We now present Algorithm 2, which works similar as Algorithm 1, but uses shortcutting instead of simpliication. As a consequence, we can achieve an approximation factor+of� instead 3 of(2+�)(1+�) ś where (2+�) comes ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 11 from the candidate sampling(and 1+ �) comes from simpliication with approximation � ≥ factor 1. Indeed, this factor is the best we can achieve by the previously used techniques in combination with simpliication. Thus, to achieve an approximation factor(4of+ �) one would need to compute the optimal minimum-err ℓ- or simpliications of the input curves and to the best of our knowledge, there is no such algorithm for the continuous Fréchet distance. Algorithm 2 ℓ-Median for Subset by Simple Shortcutting 1: procedure ℓ-Median-(3+ �)-Candidates(� = {� , . . .,� }, �,�,�) 1 � 2: � ← �/3, � ← ∅ ′ −1 3: � ← sample8�(�) (ln(1/�) + ln(4)) curves from� uniformly and independently with replacement |�| ′ ′ 4: for � ⊆ � with|� | = do 2� 5: �← ℓ-Median-34-Approximation(� ,�/4) (Algorithm 1) ′ �� Δ 1 6: Δ ← cost(� ,�), Δ ← , Δ ← Δ, � ← � ∪{�} � � ′ 2|�| 34 � 7: for �∈ � do 8: � ← ∅ 9: for �∈ [|�|] do � ′ � � th 10: � ← � ∪ G � � ,(1+ �)Δ , Δ ⊲ � : � vertex of� � � � � 11: � ← � ∪ set of all polygonal curves with ℓ − 2 v2ertices from� 12: return � In contrast to Algorithm 1, Algorithm 2 utilizes the superset sampling technique (see Section 2.1.1) to fulill the requirements to be used as plugin algorithm for (�, ℓthe )-median approximation algorithm to be presented in Section 7, i.e., it is able to obtain an approℓximate -median for any cluster � that takes a constant fraction of the input� . Therefore, it has running time exponential in the size of the �. Asample further diference is that ∗ ′ we need an upper and a lower bound on the cost of an optimal ℓ-median� for� , to properly set up the grids we use for shortcutting. The lower bound can be obtained by simple estimation, using Markov’s inequality ś ′ ′ ′ with high probability a multiple of the cost of the result of Algorithm 1�run⊆ on � with a subset � ⊆ � ′ ∗ ′ and with respect to� is a lower bound on the cost �ofwith respect to� . For the upper bound we utilize a case distinction, which guarantees us that if we fail to obtain an upper bound on the optimal cost, the result of Algorithm 1 then is a good approximation (factor + �, an2immediate consequence of the distinction) and can be used instead of a best curve obtained by shortcutting. Algorithm 2 has several parameters: � determines the size (in terms of a fraction of the input) of the smallest subset of the input for which an approximate median can be compute � determines d, the probability of failure of the algorithm and � determines the approximation factor. We prove the quality of approximation of Algorithm 2. Theorem 4.2. Given three parameters� ∈ [1,∞), �,�∈ (0, 1) and a set � = {� , . . .,� } ⊂ R of polygonal curves, 1 � with probability at least 1−� the set of candidates that Algorithm 2 returns contains(a3+�)-approximateℓ-median ′ ′ 1 with up to2ℓ − 2 vertices for any (ixed)� ⊆ � , if|� | ≥ |�|. ′ ′ ′ Proof. We assume that |� | ≥ |�|. Let � be the number of sampled curves in � that are elements of � . Clearly, |�| |�| ′ 1 ′ E[� ] ≥ = . Also�, is the sum of independent Bernoulli trials. A Chernof bound (see Lemma 2.11) �=1 � � ACM Trans. Algor. 12 • Maike Buchin, Anne Driemel, and Dennis Rohde yields: |�| 1 1 |�| ln(�) − ln(4) � � ′ ′ ′ Pr � < ≤ Pr � < E[� ] ≤ exp − ≤ exp = ≤ . 2� 2 4 2� � 4 4 |�| ′ ′ In other words, with probability at most �/4 no subset � ⊆ �, of cardinality at least , is a subset of � . We 2� condition the rest of the proof on the contrary event, denotedEby′, namely, that there is a subset � ⊆ � with |�| ′ ′ ′ ′ ′ � ⊆ � and |� | ≥ . Note that � is then a uniform and independent sample � (of see Section 2.1.1). 2� ∗ ′ ′ ′ ∗ Now, let� ∈ arg mincost(� ,�) be an optimal ℓ-median for � . The expected distance between�∈ � and � is �∈R ︁ ′ ∗ 1 cost(� ,� ) ∗ ∗ E[d (�,�) | E ] = d (� ,�) · = . F � F ′ ′ |� | |� | �∈� |� | ′ ∗ ′ ∗ By linearity, we hav[ecost E (� ,� ) | E ′] = cost(� ,� ). Markov’s inequality yields: � ′ |� | �|� | � ′ ∗ ′ ∗ Pr cost(� ,� ) > cost(� ,� ) E ≤ . 4|� | 4 �|� | ′ ∗ ′ ∗ We conclude that with probability at�/most 4 we have cost(� ,� ) > cost(� ,� ). 4|� | Using Markov’s inequality again, for�e∈ver � ywe have ′ ∗ cost(� ,� ) 1 Pr d (�,�) > (1+ �) E ≤ , F � |� | 1+ � therefore by independence ′ ∗ ( ) cost � ,� 1 �|�| Pr mind (�,�) > (1+ �) E ′ ≤ ≤ exp − . F � ′ |� | �∈� |� | 2 2� (1+ �) l m 8�(ln(1/�)+ln(4)) 2 ′ ∗ Hence, with probability at most exp − ≤ � /16 ≤ �/4 there is no� ∈ � withd (�,�) ≤ 4� ′ ∗ cost(� ,� ) (1+ �) · . Also, with probability at�/most 4 Algorithm 1 fails to compute a 34-approximate ℓ-median |� | � ′ �∈ R for� , see Corollary 3.3. Using Lemma 2.10, we conclude that with probability at−least � all 1 the following events occur simultaneously: ′ ′ (1) There is a subset� ⊆ � of cardinality at least |�|/(2�) that is a uniform and independent sample � , of ′ ∗ cost(� ,� ) ′ ∗ (2) there is a curve�∈ � with d(�,�) ≤ (1+ �) , F ′ |� | � ′ ∗ ′ ′ ∗ (3) Algorithm 1 computes a polygonal cur � ∈veR withcost � ,� ≤ cost(� ,�) ≤ 34 cost � ,� , where ′ ′ ℓ � � ∗ � ′ � ∈ R is an optimal ℓ-median for� , � ℓ �|� | ′ ∗ ′ ∗ (4) and it holds that cost(� ,� ) ≤ cost(� ,� ). 4|� | ∗ ′ Since� is an optimal ℓ-median for� we get the following from the last two items: ′ ′ ′ ′ �|� | �|� | �|� | cost(� ,�) ′ ∗ ′ ∗ ′ ∗ cost(� ,� ) ≥ cost(� ,� ) ≥ cost � ,� ≥ . ′ ′ ′ 4|� | 4|� | 4|� | 34 We now distinguish between two cases: ′ ∗ cost(� ,� ) Case 1: d (�,�) ≥ (1+ 2�) F ′ |� | ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 13 The triangle inequality yields ′ ∗ cost(� ,� ) ∗ ∗ ∗ d (�,�) ≥ d (�,�) − d (� ,�) ≥ d (�,�) − (1+ �) F F F F |� | ′ ∗ ′ ∗ ′ ∗ cost(� ,� ) cost(� ,� ) cost(� ,� ) ≥ (1+ 2�) − (1+ �) = � . ′ ′ ′ |� | |� | |� | ′ ∗ ′ ∗ cost(� ,� ) cost(� ,� ) ′ 1 ′ As a consequence, cost(� ,�) ≥ � ⇐⇒ ≤ cost(� ,�). ′ ′ |� | |� | � � � ′ Now, let� , . . .,� be the vertices of�. By Lemma 4.1 there exists a polygonal cur�vewith up toℓ2− 2 vertices, 1 |�| ′ ∗ cost(� ,� ) � ∗ � ∗ ′ ∗ every vertex contained in one�of(� , d (� ,�)), . . ., �(� , d (� ,�)) and d (�,�) ≤ d (�,�) ≤ (1+�) ≤ F F F F ′ 1 |� | |�| cost(� ,�) (1+ �) . ′′ In the set of candidates that Algorithm 2 returns, a cur � vewith up to ℓ2− 2 vertices from the union of the �|� | �� ′ ′ � cost(� ,�) ′ ∗ � cost(� ,�) 2|�| 4|� | cost(� ,� ) grid covers and distance at most ≤ ≤ � between every corresponding pair of ′ ′ � |� | |� | ′ ∗ � cost(� ,� ) ′ ′′ ′ ′′ vertices of� and � is contained. We conclude that(�d,� ) ≤ . F ′ |� | ′′ We can now bound the cost of� as follows: ︁ ︁ ′ ∗ �cost(� ,� ) ′ ′′ ′′ ′ cost(� ,� ) = d (�,� ) ≤ d (�,�) + F F |� | ′ ′ �∈� �∈� ∗ ∗ ′ ∗ ≤ (d (�,�) + d (� ,� )) + �cost(�,� ) F F �∈� ∗ ∗ ′ ′ ∗ ′ ∗ ≤ (d (�,�) + d (� ,�) + d (�,�)) + �cost(� ,� ) ≤ (3+ 3�) cost(� ,� ). F F F �∈� ′ ∗ cost(� ,� ) Case 2: d (�,�) < (1+ 2�) F ′ |� | The cost of� can easily be bounded: ′ ∗ ∗ ′ ∗ ′ ∗ ′ ∗ cost(� ,�) ≤ (d (�,�) + d (� ,�)) < cost(� ,� ) + (1+ 2�) cost(� ,� ) = (2+ 2�) cost(� ,� ). F F �∈� The claim follows by rescaling � by . □ Next we analyse the worst-case running time of Algorithm 2 and the number of candidates it returns. Theorem 4.3. The running time as well as the number of candidates that Algorithm 2 returns can be bounded by � (ℓ�) � (ℓ) � (ℓ� 1/� ln(1/�)) � � . �� ln(1/�)·� ln(1/�)·� Proof. The sample� has size� and obtaining it takes time � . Let � = |�|. The outer � � for-loop runs ! � 2� ln(1/�) � �� � � 2� � ≤ = (2��) = � � � � � 2� 2� 2 2 3 times. In each iteration, we run Algorithm 1, taking � (�time log(� ) ln(1/�) + � log� ) (see Corollary 3.3), ln(1/�) we compute the cost of the returned curve with respect to� , taking time � · � log(� ) , and per curve in ACM Trans. Algor. 14 • Maike Buchin, Anne Driemel, and Dennis Rohde � we build up�to grids of size � ! √ � (1+� )Δ ′ � � ′ 68 �|�|(1+ �) � ln(1/�) © ª ® = ∈ � � �� Δ ′2 3� � √ � � � � � � 2|�|34 « ¬ each. For each curve �∈ � , Algorithm 2 then enumerates all combinations ℓ − 2of points 2 from these up to � grids, resulting in 2ℓ−2 2ℓ�−2� 2ℓ�−2� � � ln(1/�) 6ℓ�−6� 2ℓ�−2� � � candidates per�∈ � , per iteration of the outer for-loop. � (ℓ) � (�ℓ) Thus, Algorithm 2 enumerates � (� · 1/�· 1/�) candidates per iteration of the outer for-loop. All in all, the running time and number of candidates the algorithm returns can be bounded by � (ℓ) � (ℓ�) � (ℓ� 1/� ln(1/�)) � 1/(��) � . 5 MORE PRACTICAL APPROXIMATION FOR (1, ℓ)-MEDIAN BY SIMPLE SHORTCUTTING The following algorithm is a modiication of Algorithm 2. It is more practical since it needs to cover only up to � (small) balls, using grids. Unfortunately, it is not compatible with the superset sampling technique and can therefore not be used as plugin in Algorithm 5. Algorithm 3 ℓ-Median by Simple Shortcutting 1: procedure ℓ-Median-(5+ �)-Approximation(� = {� , . . .,� },�,�) 1 � 2: b�← ℓ-Median-34-Approximation(�,�/2) (Algorithm 1) 3: Δ ← cost(�, b�), Δ ← Δ/34, � ← �/9, � ← ∅ ′ −1 4: � ← sample2(�) (ln(1/�) + ln(4)) curves from� uniformly and independently with replacement ′ −2 ′ −1 5: � ← sample⌈64(�) (ln(1/�) + ln(⌈8(�) (ln(1/�) + ln(4))⌉))⌉ curves from� uniformly and independently with replacement 6: �← arg mincost(�,� ) �∈� 7: for �∈ [|�|] do (3+4� ) � � th 8: � ← � ∪ G � � , Δ , Δ ⊲ � is the� vertex of� � � � � 9: � ← set of all polygonal curves with ℓ − 2 v2ertices from� 10: return arbitrary element from arg min cost(�,�) �∈� We prove the quality of approximation of Algorithm 3. Theorem 5.1. Given two parameters�,� ∈ (0, 1) and a set � = {� , . . .,� } ⊂ R of polygonal curves, with 1 � probability at least1− � Algorithm 3 returns a(5+ �)-approximateℓ-median for� with up to2ℓ − 2 vertices. ∗ ∗ Proof. Let � ∈ arg mincost(�,�) be an optimal ℓ-median for � . The expected distance between�∈ � and � is �∈R 1 cost(�,� ) ∗ ∗ E[d (�,�)] = d (� ,�) · = . F F � � � �=1 ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 15 Now using Markov’s inequality, for � ev∈er�ywe have ∗ −1 cost(�,� )� 1 ∗ ∗ Pr[d (�,�) > (1+ �) cost(�,� )/�] ≤ = , ∗ −1 (1+ �) cost(�,� )� 1+ � therefore by independence 1 �|�| ∗ ∗ Pr mind (�,�) > (1+ �) cost(�,� )/� ≤ ≤ exp − . |�| �∈� 2 (1+ �) l m 2(ln(1/�)+ln(4)) � ∗ cost(�,� ) Hence, with probability at most exp − ≤ �/4 there is no�∈ � withd (�,�) ≤ (1+ �) . 2 � ∗ ∗ Now, assume there is an� ∈ � withd (�,�) ≤ (1 + �) cost(�,� )/�. We do not want any � ∈ � \ {�} with cost(�,�) > (1+�) cost(�,�) to have cost(�,� ) ≤ cost(�,� ). Using Theorem 2.8, we conclude that this happens with probability at most 2 −2 ′ −1 � ⌈64� (ln(1/�) + ln(⌈8(�) (ln(1/�) + ln(4))⌉))⌉ � � exp − ≤ ≤ , ′ −1 64 ⌈−8(�) (ln(�) − ln(4))⌉ 4|�| for each� ∈ �\{�}. Also, with probability at�/most 2 Algorithm 1 fails to compute a 34-approximate ℓ-median b�∈ R for� , see Corollary 3.3. Using a union bound over these bad events, we conclude that with probability− at�least , 1 ∗ ∗ (1) Algorithm 3 samples a cur�v∈e� with d(�,�) ≤ cost(�,� )/�, (2) Algorithm 3 samples a cur�v∈e� with cost(�,�) ≤ (1+ �) cost(�,�) and � ∗ (3) Algorithm 1 computes a 34-approximate ℓ-medianb� ∈ R for� , i.e.,cost(�,� ) ≤ Δ = cost(�, b�) ≤ 34 cost(�,� ). � � ′ Let � , . . .,� be the vertices of�. By Lemma 4.1 there exists a polygonal cur�vewith up to ℓ2− 2 vertices, |�| � ∗ � ∗ ′ ∗ every vertex contained in one �of(� , d (� ,�)), . . ., �(� , d (� ,�)) and d (�,�) ≤ d (�,�). Using the triangle- F F F F |�| inequality yields ︁ ︁ ︁ ︁ ∗ ∗ ∗ ∗ (d (�,�) − d (�,�)) ≤ d (�,�) ≤ (1+ �) d (�,�) ≤ (1+ �) (d (�,�) + d (� ,�)), F F F F F F �∈� �∈� �∈� �∈� which is equivalent to ∗ ∗ ∗ ∗ ∗ � · d (�,�) ≤ (2+ �) cost(�,� ) + (1+ �)�(1+ �) cost(�,� )/� ⇐⇒ d (�,�) ≤ (3+ 4�) cost(�,� )/�. F F ′ ∗ ∗ Hence, we have d (�,�) ≤ d (�,�) ≤ (3+ 4�) cost(�,� )/� ≤ (3+ 4�)Δ/�. F F ′′ In the last step, Algorithm 3 returns a cur�vefrom the set� of all curves with upℓto− 22 vertices from�, �Δ cost(�,� ) ′′ � the union of the grid covers, that evaluates best. We can assume�thathas distance at most ≤ � � � �Δ cost(�,� ) ′ ′′ ′ ′′ � between every corresponding pair of vertices � of and � . We conclude that d(� ,� ) ≤ ≤ � . � � ′′ We can now bound the cost of� as follows: ︁ ︁ ︁ �Δ ′′ ′′ ′ ′ ∗ cost(�,� ) = d (�,� ) ≤ d (�,�) + ≤ (d (�,�) + d (�,�)) + �cost(�,� ) F F F F �∈� �∈� �∈� ≤ (1+ �) cost(�,�) + (3+ 5�) cost(�,� ) ∗ ∗ ∗ ≤ (1+ �) (d (�,�) + d (� ,�)) + (3+ 5�) cost(�,� ) F F �∈� ∗ 2 ∗ ∗ ≤ (1+ �) cost(�,� ) + (1+ �) cost(�,� ) + (3+ 5�) cost(�,� ) ≤ (5+ 9�) cost(�,� ) ACM Trans. Algor. 16 • Maike Buchin, Anne Driemel, and Dennis Rohde The claim follows by rescaling � by . □ We analyse the worst-case running time of Algorithm 3. 2ℓ−1 3 2 �� log(� ) � log(� ) ln(1/�) Theorem 5.2. Algorithm 3 has running time � + . (2ℓ−2)� 3 � � ln(1/�) 2 3 Proof. Algorithm 1 has running time � (� log(� ) ln(1/�) +� log� ). The sample� has size� and 2 2 ln(1/�) � log(� ) ln(1/�) the sample� has size� . Evaluating each curve �ofagainst� takes time� , using 2 3 � � the algorithm of Alt and Godau [4] to compute the distances. (3+�)Δ (3+�) � Now, � has up to � vertices and every grid consists of = ∈ � points. Therefore, ′ ′ 2� Δ � 2� � � we have � points in � and Algorithm 3 enumerates all combinations ℓ − 2ofpoints 2 from� taking 2ℓ−2 time� . Afterwards, these candidates are evaluated, which takes time � (�� log(� )) per candidate (2ℓ−2)� using the algorithm of Alt and Go [4]dau to compute the distances. All in all, we then have running time 2ℓ−1 3 2 �� log(� ) � log(� ) ln(1/�) � + . □ (2ℓ−2)� 6 (1+ �)-APPROXIMATION FOR ℓ-MEDIAN BY ADVANCED SHORTCUTTING Now we present Algorithm 4, which returns candidates, containing with high probability (1+ �)-approximate a (1, ℓ)-median of complexity at most ℓ − 22 for any (ixed) subset that takes a constant fraction of the input. Before we present the algorithm, we present our second shortcutting lemma. Here, we do not introduce shortcuts with respect to a single curve, but with respect to several curves: by introducing shortcuts with�r|esp �| ect to well-chosen curves from the given�set⊂ R of polygonal curves, for a giv � ∈en(0, 1), we preserve the distances to at least(1−�)|�| curves from� . In this context well-chosen means that there exists a certain number of subsets of� , of each we have to pick a curve for shortcutting. This will enable the high quality of approximation of Algorithm 4 and we formalize this in the following lemma. � � Lemma 6.1. Let � ∈ R be a polygonal curve with|�| > 2 vertices and� = {� , . . .,� } ⊂ R be a set of polygonal 1 � ∗ ∗ � th curves. For �∈ [�], let � = d (� ,�) and for �∈ [|�|], let � be the � vertex of � . For any � ∈ (0, 1) there are � F � � � �� ′ 2|�|− 4 subsets� , . . .,� ⊆ � , not necessarily disjoint, and of curves each, such that for every subset� ⊆ � 1 2|� |−4 2|� | ′ � containing at least one curve out of each� ∈ {� , . . .,� }, a polygonal curve � ∈ R exists with every � 1 2|� |−4 2|� |−2 vertex contained in Ø Ø �(� ,�) � ∈� �∈[|� |] � � and d (�,� ) ≤ d (�,�) for each � ∈ � \ (� ∪···∪� ). F F 1 2|� |−4 The idea is the following, see Fig. 3 for a visualization. One can argue that� of ever � ynot verte containe x d in any of the balls centered at the vertices of the curv�es(and in of radius according to their distance �) can to be shortcut by connecting the last point � before� (in terms of the parameter of �) contained in one ball and irst point� after� contained in one ball. This does not increase the Fréchet distances�betw andethe en � ∈ � , because only matchings among line segments are afected by this modiication. Furthermore, most distances are �� preserved when we do not actually use the last and irst ball before and �, but after one of the balls before 2|� | �� and one of the balls after �, which is the key of the following proof. 2|� | �� Proof of Lemma 6.1. Let ℓ = |�|. For the sake of simplicity, we assume that is integral. �For∈ [�], let 2ℓ � � � � � � � � � � � , . . .,� be the vertices of� with instants � , . . .,� and let� , . . .,� be the vertices of� with instants 1 1 1 ℓ |� | |� | � � ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 17 Fig. 3. By using a subset of well-chosen input curves, a shortcut can be constructed that preserves the majority of distances to the input curves: d (� ,�) ≤ d (�,�) for most� ∈ � . F F � � � , . . .,� . Also, for ℎ ∈ H (recall thatH is the set of all continuous bijeℎctions : [0, 1] → [0, 1] withℎ(0) = 0 and 1 ℓ ℎ(1) = 1) and �∈ [�], let� = max ∥�(�) − � (ℎ(�))∥ be the distance realizedℎby with respect to� . We know �,ℎ � � �∈[0,1] from Lemma 2.4 that for each�∈ [�] there exists a sequence(ℎ ) inH, such that lim � = d (�,�) = � . �,� �,ℎ F � � �,� � =1 �→∞ In the following, given arbitrar ℎ , . . .y,ℎ ∈ H, we describe how to modify �, such that its vertices can be found 1 � in the balls around the vertices of � ∈the� , of radii determinedℎ by , . . .,ℎ . Later we will argue that this 1 � modiication can be applied using ℎ ,the . . .,ℎ , for each� ∈ N, in particular. 1,� �,� � � Now, ix arbitrarℎy , . . .,ℎ ∈ H and for an arbitrar�y∈ {2, . . .,|�| − 1}, ix the vertex� of� with instant � . 1 � � � � � � � � � For �∈ [�], let� be the maximum of[|�|− 1], such that � ≤ ℎ (� ) ≤ � . Namely�, is matched to a point on � � � � � +1 � � � � � � � 1 1 � � the line segment � � , . . .,� � , respectively, by ℎ , . . .,ℎ . 1 � � � 1 � +1 � � +1 1 � − � − � + � For �∈ [�], let� be the maximum of[0,� ], such that �(� ) ∈ �(� ,� ) and let� be the minimum of [� , 1], � �,ℎ � � � � � � � � � + � � � such that �(� ) ∈ �(� ,� ). These are the instants when� visits �(� ,� ) before or when it visits � and �,ℎ � �,ℎ � � � � +1 � � � � � visits �(� ,� ) when or after it visits � , respectively. Furthermore, there is a permutation � ∈ S of the �,ℎ � � +1 � � index set[�], such that − − � ≤ ··· ≤ � . (I) −1 −1 � (1) � (�) Also, there is a permutation � ∈ S of the index set[�], such that + + � ≤ ··· ≤ � . (II) −1 −1 � (1) � (�) Additionally, for�each ∈ [�] we have � − � ≤ ℎ (� ) (III) � � and ℎ (� ) ≤ � , (IV) � � +1 � � − + � � � because�(� ) and�(� ) are the closest points�to on� that have distance at most� to� and� , respectively, �,ℎ � � � � � � +1 by deinition. We will now use Eqs. (I) to (IV) to prove that an advanced shortcut only afects matchings among line segments and hence we can easily bound the resulting distances for(1at− �least )� of the curves. Let � � � �� � (ℎ , . . .,ℎ ) = {� −1 , . . .,� −1 }, � (ℎ , . . .,ℎ ) = {� −1 , . . .,� −1 }. � 1 � � ((1− )�+1) � (�) � 1 � � (1) � ( ) � 2ℓ � 2ℓ �� � � (ℎ , . . .,ℎ ) is the set of the lastcurves whose balls are visite�d, by before or when� visits � . Similarly, � 1 � 2ℓ � � �� � (ℎ , . . .,ℎ ) is the set of the irst curves whose balls are visite�d, by when or immediately after � visited � 1 � 2ℓ � � � � . We now modify�, such that � is replaced by two new vertices that are elements of at least �(�one ,� ), �,ℎ � � � � � for a� ∈ � (ℎ , . . .,ℎ ), respectively for� a∈ � (ℎ , . . .,ℎ ), and �∈ [|�|], each. � � 1 � � � 1 � � � � ACM Trans. Algor. 18 • Maike Buchin, Anne Driemel, and Dennis Rohde h i h i ′ − + Let � be the piecewise deined curve, deined just like � on 0,� and � , 1 for arbitrary −1 −1 ℎ ,...,ℎ � (� ) � (� ) 1 � 1 2 � �� − + − + � ∈ {(1− )� + 1, . . .,�} and � ∈ [ ], but on � ,� it connects� � and � � with 1 2 −1 −1 −1 −1 2ℓ 2ℓ � (� ) � (� ) � (� ) � (� ) 1 2 1 2 the line segment − − �− � �− � −1 −1 � (� ) � (� ) 1 1 − + �(�) = 1− � � + � � . −1 −1 + − + − � (� ) � (� ) 1 2 � − � � − � −1 −1 −1 −1 � (� ) � (� ) � (� ) � (� ) 2 1 2 1 � � We now argue that for all � ∈ � \ (� (ℎ , . . .,ℎ ) ∪ � (ℎ , . . .,ℎ )) the Fréchet distance between� and � � 1 � � 1 � � � ℎ ,...,ℎ 1 � � is upper bounded by� . First, note that by deinition ℎ , . . .,ℎ are strictly increasing functions, since they � �,ℎ 1 � are continuous bijections that map 0 to 0 and 1 to 1. As immediate consequence, we have that � − − � ≤ ℎ (� ) ≤ ℎ � (V) � � −1 � � � � (� ) for each� ∈ � \ � (ℎ , . . .,ℎ ) and � � 1 � + + � ℎ � ≤ ℎ (� ) ≤ � (VI) � −1 � � � +1 � (� ) � � � � for each� ∈ � \� (ℎ , . . .,ℎ ), using Eqs. (I) to (IV). Therefore, each� ∈ � \(� (ℎ , . . .,ℎ )∪� (ℎ , . . .,ℎ )) � � 1 � � � 1 � � 1 � � � � − + has no vertex between the instantsℎ � and ℎ � . We also know that for each� ∈ � � � � −1 −1 � (� ) � (� ) 1 2 − − � � − � ℎ �
≤ � (VII) −1 � � −1 �,ℎ � (� ) � (� ) 1 1 and + + � � − � ℎ �
≤ � . (VIII) −1 � � −1 �,ℎ � (� ) � (� ) 2 2 h h i i h − − + + − Let� = 0,� ,� = � ,� and� = � , 1 . Also, for �∈ [�], let� = 0,ℎ � , �,� �,� �,� �,� � −1 −1 −1 −1 � −1 � (� ) � (� ) � (� ) � (� ) � (� ) 1 1 2 2 1 h i i − + + � = ℎ � ,ℎ � and � = ℎ � , 1 . Now, for each� ∈ � \ (� (ℎ , . . .,ℎ ) ∪ �,� � � �,� � � � 1 � � −1 −1 � −1 � (� ) � (� ) � (� ) � 1 2 2 ′ ′ � (ℎ , . . .,ℎ )) we use ℎ to match � | to �| and � | to �| with distance at most � . � 1 � � � � � � � � �,ℎ �,� �,� �,� �,� � ℎ ,...,ℎ � ℎ ,...,ℎ � � 1 � 1 � Since� | and �| are just line segments by Eqs. (V) and (VI), they can be linearly matched to each � � � �,� �,� ℎ ,...,ℎ � 1 � other with distance at most n o − − + + max
� � − � ℎ �
,
� � − � ℎ �
, −1 � � −1 −1 � � −1 � (� ) � (� ) � (� ) � (� ) 1 1 2 2 which is at most � by Eqs. (VII) and (VIII). We conclude that d(� ,�) ≤ � . �,ℎ F � �,ℎ � � ℎ ,...,ℎ 1 � Because this modiication works for eℎver, .y. .,ℎ ∈ H, we conclude thatd (� ,�) ≤ � for every 1 � F � �,ℎ ℎ ,...,ℎ 1 � � � ℎ , . . .,ℎ ∈ H and � ∈ � \ (� (ℎ , . . .,ℎ ) ∪ � (ℎ , . . .,ℎ )). Thus, 1 � � � 1 � � 1 � � � � � lim d (� ,�) ≤ d (�,�) = � for each� ∈ � \ (� (ℎ , . . .,ℎ ) ∪ � (ℎ , . . .,ℎ )). F � F � � � � 1,� �,� � 1,� �,� ℎ ,...,ℎ 1,� �,� � � �→∞ Now, to prove the claim, for each combination ℎ , . . .,ℎ ∈ H, we apply this modiication � and to successively 1 � ′ ′ ′ � � � ℎ ,...,ℎ ℎ ,...,ℎ ℎ ,...,ℎ � � � 1 ′ 1 1 to every other vertex � of the resulting cur�ve , except � and � , since these must be � ℎ ,...,ℎ |� | 1 � ℎ ,...,ℎ 1 � � � � � elements of�(� ,� ) and �(� ,� ), respectively, for each �∈ [�], by deinition of the Fréchet distance. �,ℎ �,ℎ 1 � � |� | Since the modiication is repeated at|�most | − 2 times for each combination ℎ , . . .ℎ ∈ H, we conclude that 1 � the number of vertices of each � can be bounded by 2· (|�| − 2) + 2. ℎ ,...,ℎ 1 � ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 19 � � � , . . .,� are therefore all the � (ℎ , . . .,ℎ ) and � (ℎ , . . .,ℎ ) for� ∈ {2, . . ., 2|�|− 3}, when � → ∞. 1 2ℓ−4 � 1,� �,� � 1,� �,� � � � � Note that every � (ℎ , . . .,ℎ ) and � (ℎ , . . .,ℎ ) is determined by the visiting order of the balls and since � 1,� �,� � 1,� �,� � � their radii converge, these sets do too. □ We now present Algorithm 4, which is nearly identical to Algorithm 2 but uses the advanced shortcutting lemma. Furthermore, like Algorithm 2, it can be used as plugin in the �-me recursiv dian appr e oximation-scheme (Algorithm 5) that we present in Section 7. Algorithm 4 ℓ-Median for Subset by Advanced Shortcutting 1: procedure ℓ-Median-(1+ �)-Candidates(� = {� , . . .,� }, �,�,�) 1 � 2: � ← �/6, � ← ∅ ′ −1 3: � ← sample8�ℓ(�) (ln(1/�) + ln(4(2ℓ − 4))) curves from� uniformly and independently with replacement |�| ′ ′ 4: for � ⊆ � with|� | = do 2� 5: �← ℓ-Median-34-Approximation(� ,�/4) (Algorithm 1) ′ �� Δ 1 6: Δ ← cost(� ,�), Δ ← , Δ ← Δ � � ′ 2|�| 34 � 7: � ← � ∪{�}, � ← ∅ 8: for �∈ � do 9: for �∈ [|�|] do � 4ℓ � � th 10: � ← � ∪ G � � , Δ , Δ ⊲ � : � vertex of� ′ � � � � � � 11: � ← � ∪ set of all polygonal curves with ℓ − 2 v2ertices from� 12: return � We prove the quality of approximation of Algorithm 4. Theorem 6.2. Given three parameters� ∈ [1,∞), � ∈ (0, 1), � ∈ (0, 0.158] and a set � = {� , . . .,� } ⊂ R 1 � of polygonal curves, with probability at least 1 − � the set of candidates that Algorithm 4 returns contains a ′ ′ 1 (1+ �)-approximateℓ-median with up to2ℓ − 2 vertices for any (ixed)� ⊆ � , if|� | ≥ |�|. In the following proof we make use of a case distinction developed by Nath [30 and, Pr Taylor oof of Theorem 10], which is a key ingredient to enable(1+the�)-approximation, though the domain�of has to be restricted to (0, 0.158]. ′ 1 ′ Proof of Theorem 6.2. We assume that |� | ≥ |�| and ℓ > 2. Let � be the number of sampled curves in � that |�| |�| ′ ′ 1 ′ are elements of � . Clearly E,[� ] ≥ = . Also�, is the sum of independent Bernoulli trials. A Chernof �=1 � � bound (see Lemma 2.11) yields: � 1 ′ ℓ ln ℓ © ª 4(2ℓ−4) |�| E[� ] 1 |�| � � ′ ′ ® Pr � < ≤ Pr � < ≤ exp − ≤ exp ≤ ≤ . ® 2� 2 4 2� � 8 8 « ¬ |�| ′ ′ In other words, with probability at most �/8 no subset � ⊆ �, of cardinality at least , is a subset of � . We 2� condition the rest of the proof on the contrary event, denotedEby, namely, that there is a subset � ⊆ � with |�| ′ ′ ′ ′ ′ � ⊆ � and |� | ≥ . Note that � is then a uniform and independent sample � (of see Section 2.1.1). 2� ACM Trans. Algor. 20 • Maike Buchin, Anne Driemel, and Dennis Rohde ∗ � ′ ′ ∗ Now, let� ∈ R be an optimal ℓ-median for � . The expected distance between�∈ � and � is ︁ ′ ∗ 1 cost(� ,� ) ∗ ∗ E[d (�,�) | E ] = d (� ,�) · = . F � F ′ ′ |� | |� | �∈� |� | ′ ∗ ′ ∗ By linearity, we hav[ecost E (� ,� ) | E ] = cost(� ,� ). Markov’s inequality yields: � ′ |� | �|� | � ′ ∗ ′ ∗ Pr cost(� ,� ) > cost(� ,� ) E ≤ . 4|� | 4 �|� | ′ ∗ ′ ∗ We conclude that with probability at�/most 4 we have cost(� ,� ) > cost(� ,� ). 4|� | �|� | ′ ′ ′ Now, from Lemma 6.1 we know that there are 2ℓ − 4 subsets � , . . .,� ⊆ � , of cardinality each and 1 2ℓ−4 2ℓ ′ ′ which are not necessarily disjoint, such that for ev�ery⊆set� that contains at least one curv�e∈ � for each ′ � �∈ [2ℓ − 4], there exists a curve� ∈ R which has all of its vertices contained in 2ℓ−2 Ø Ø � ∗ �(� , d (�,�)) �∈� �∈[|�|] ′ ′ ′ ′ ′ ∗ and for at least(1− �)|� | curves � ∈ � \ (� ∪···∪� ) it holds that(d�,�) ≤ d (�,�). F F 1 2ℓ−4 ′ ′ ∗ �|� | 4ℓ cost(� ,� ) There are up to curves with distance �to at least . Otherwise, the cost of these curves would 4ℓ �|� | ′ ∗ exceed cost(� ,� ), which is a contradiction. Later we will prove that each ball we cover has radius at most ′ ∗ 4ℓ cost(� ,� ) . Therefore, for each�∈ [2ℓ − 4] we have to ignore up to half of the cur�ves∈ � , since we do not �|� | ∗ ′ cover the balls of radius(�,d�) centered at their vertices. For each �∈ [2ℓ − 4] and �∈ � we now have ′ ∗ 4ℓ cost(� ,� ) � ′ ∗ Pr �∈ � ∧ d (�,�) ≤ E ′ ≥ . F � �|� | 4ℓ ′ ′ Therefore, by independence, for each �∈ [2ℓ−4] the probability that�no∈ � is an element�ofand has distance ′ ∗ 4ℓ cost(� ,� ) 4ℓ(ln(1/�)+ln(4(2ℓ−4))) ∗ � |� | � � � to � at most is at most(1− ) ≤ exp − = exp ln = . �|� | 4ℓ 4ℓ � 4(2ℓ−4) 4(2ℓ−4) � ′ Also, with probability at�/most 4 Algorithm 1 fails to compute a 34-approximate ℓ-median� ∈ R for� , see Corollary 3.3. Using Lemma 2.10, we conclude that with probability at−7least /8� all 1 the following events occur simultaneously: ′ ′ (1) There is a subset� ⊆ � of cardinality at least |�|/(2�) that is a uniform and independent sample � , of ′ ∗ 4ℓ cost(� ,� ) ′ ′ ∗ (2) for each�∈ [2ℓ − 4], � contains at least one curve from � with distance�to up to , � �|� | � ′ ∗ ′ ′ ∗ (3) Algorithm 1 computes a polygonal cur � ∈veR withcost � ,� ≤ cost(� ,�) ≤ 34 cost � ,� , where ′ ′ � � ∗ � ′ � ∈ R is an optimal ℓ-median for� , � ℓ �|� | ′ ∗ ′ ∗ (4) and it holds that cost(� ,� ) ≤ cost(� ,� ). 4|� | n o n o ′ ∗ ′ ∗ cost(� ,� ) cost(� ,� ) ′ ∗ ′ ′ ∗ ∗ Let � = � ∈ � | d (�,�) ≤ and � = � ∈ � | d (�,�) ≤ � . First, note that|� \ � | ≤ � F 2 ′ � F ′ � � |� | |� | 2 ′ ′ ∗ ′ ∗ 2 ′ ∗ ∗ � |� |, otherwisecost(� \ � ,� ) > cost(� ,� ), which is a contradiction, and therefor |� |e≥ (1− � )|� |. We � � now distinguish two cases: ∗ ∗ Case 1: |� \ � | > 2�|� | � � � h i ′ ∗ cost(� ,� ) 2 ′ ′ ′ We have 2�|� ∗| ≥ (1 − � )2�|� | ≥ �|� |, hence Pr d (�,�) > � E ′ ≥ � for each� ∈ � . Using � F ′ � |� | independence we conclude that with probability at most 4ℓ ′ 4ℓ(ln(1/�) + ln(4(2ℓ − 4))) � � |� | (1− �) ≤ exp −� ≤ ≤ 4ℓ � 4 8 ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 21 ′ ∗ cost(� ,� ) no �∈ � has distance to� greater than � . Including this bad event, by Lemma 2.10 we conclude that |� | with probability at least − � Items 1 1 to 4 occur simultaneously and at least�one ∈ � has distance to� greater ′ ∗ ′ ∗ ′ ′ ∗ cost(� ,� ) cost(� ,� ) cost(� ,�) cost(� ,� ) than � , hence cost(� ,�) > � ⇐⇒ > and thus we indeed cover the balls of ′ ′ ′ |� | |� | � |� | ′ ∗ ′ ∗ 4ℓ cost(� ,� ) cost(� ,� ) 4ℓ radius at most < . �|� | � � In the last step, Algorithm 4 returns a�set of all curves with upℓto− 22 vertices from the grids that contains one ′′ ′ curve, denoted by � with same number of vertices�as(recall that this is the curve guaranteed from Lemma 6.1) � � ′ ∗ ′ ′′ and distance at most Δ ≤ cost(� ,� ) between every corresponding pair of vertices � and of � . We conclude � ′ � |� | ′ ′′ � ′ ∗ ′ ∗ ′ ′ ′ ′ that d (� ,� ) ≤ cost(� ,� ). Also, recall that d (�,�) ≤ d (�,�) for� ∈ � \ (� ∪···∪� ). Further, � F ′ F F |� | 1 2ℓ−4 ′ ′ ∗ |� | 2 cost(� ,� ) contains at least curves with distance at most to� , otherwise the cost of the remaining curves would 2 |� | ′ ∗ 1 ′ ′ ′ exceed cost(� ,� ), which is a contradiction, and since � < there is at least one curv�e∈ � \(� ∪···∪� ) 2 1 2ℓ−4 ′ ∗ 2 cost(� ,� ) ′ ∗ ′′ withd (�,� ) ≤ d (�,� ) ≤ by the pigeonhole principle. We can now bound the cost � asoffollows: F F ′ |� | ︁ ︁ ′ ′′ ′′ ′ ′ ∗ cost(� ,� ) = d (�,� ) ≤ d (�,�) + cost(� ,� ) + F F |� | ′ ′ ′ �∈� �∈� \(� ∪···∪� ) 1 2ℓ−4 ∗ ∗ ′ ′ ′′ ( ) d (�,�) + d (� ,�) + d (�,� ) + d (� ,� ) F F F F ′ ′ �∈(� ∪···∪� ) 1 2ℓ−4 ︁ ′ ∗ cost(� ,� ) ′ ∗ ( ) ≤ (1+ �) cost � ,� + (2+ 2+ �) |� | ′ ′ �∈(� ∪···∪� ) 1 2ℓ−4 ′ ∗ ′ ∗ ′ ∗ ′ ∗ ≤ cost(� ,� ) + �cost(� ,� ) + 5�cost(� ,� ) = (1+ 6�) cost(� ,� ). ∗ ∗ Case 2: |� \ � | ≤ 2�|� | � � � Again, we distinguish two cases: ′ ∗ cost(� ,� ) Case 2.1: d (�,�) ≤ 4� F ′ |� | We can easily bound the cost �of : ′ ∗ ∗ ′ ∗ cost(� ,�) ≤ (d (�,�) + d (� ,�)) ≤ (1+ 4�) cost(� ,� ). F F �∈� ′ ∗ cost(� ,� ) Case 2.2: d (�,�) > 4� F ′ |� | 2 ′ Recall that|� | ≥ (1− � )|� |. We have ′ ′ ′ ′ 2 ′ ∗ ∗ ∗ |� \ � | ≤ |� \ � | + 2�|� | = |� | − (1− 2�)|� | ≤ |� | − (1− 2�)(1− � )|� | � � � � 2 3 ′ ′ = (2�+ � − 2� )|� | < |� |. 2 3 ′ 2 ′ ∗ Hence, |� | ≥ (1− 2�− � + 2� )|� | > |� |. Assume we assign all curves�to instead of to� . For � ∈ � we � � now have decrease in cost d(�,�) − d (�,�), which can be bounded as follows: F F ′ ∗ ′ ∗ cost(� ,� ) cost(� ,� ) ∗ ∗ ∗ d (�,�) − d (�,�) ≥ d (�,�) − � ≥ d (�,�) − d (�,�) − � F F F F F ′ ′ |� | |� | ′ ∗ cost(� ,� ) 1 ∗ ∗ ≥ d (�,�) − 2� > d (�,�). F F |� | 2 ACM Trans. Algor. 22 • Maike Buchin, Anne Driemel, and Dennis Rohde ′ ∗ ∗ For � ∈ � \ � we have an increase in cost d (�,�) − d (�,�) ≤ d (�,�). Let the overall increase in cost be � F F F denoted by �, which can be bounded as follows: d (�,�) ′ ∗ � < |� \ � | · d (�,�) − |� | · . � F � ′ ∗ By the fact that|� \ � | < |� | for our choice of �, we conclude that� < 0, which is a contradiction because � � � ′ 1 is an optimal ℓ-median for � . Therefore, Case 2.2 can not occur. Rescaling � by proves the claim. □ We analyse the worst-case running time of Algorithm 4 and the number of candidates it returns. Theorem 6.3. The running time as well as the number of candidates that Algorithm 4 returns can be bounded by � (ℓ�) � (ℓ) � (ℓ� 1/� ln(1/�)) � � �� ln(1/�)·� ln(1/�)·� Proof. The sample� has size� and obtaining it takes time � . Let � = |�|. The outer � � for-loop runs ! � 2� � ln(1/�) � �� � � � 2� ≤ = (2��) = � � � � � 2� 2� 2 2 3 times. In each iteration, we run Algorithm 1, taking � (�time log(� ) ln(1/�) + � log� ) (see Corollary 3.3), ln(1/�) we compute the cost of the returned curve with respect to� , taking time � · � log(� ) , and per curve in � we build up�to grids of size � ! √ � (1+�)Δ � � �|�|(1+ �) � ln(1/�) © ª ® = ∈ � 2�2�� Δ 3� � � � � � � � 4|�| « ¬ each. Algorithm 4 then enumerates all combinations ℓ − 2of points 2 from up to|� | · � grids, resulting in 2ℓ−2 2ℓ�−2�+2ℓ−2 2ℓ�−2�+2ℓ−2 � � ln(1/�) 6ℓ�−6�+2ℓ−2 2ℓ�−2� � � � (ℓ) � (�ℓ) candidates per iteration of the outer for-loop. Thus, Algorithm 4 computes � (� · 1/�· 1/�) candidates per iteration of the outer for-loop. All in all, the running time and number of candidates the algorithm returns can be bounded by � (ℓ�) � (ℓ) � (ℓ� 1/� ln(1/�)) � � . �� 7 (1+ �)-APPROXIMATION FOR (�, ℓ)-MEDIAN We generalize the algorithm of Ackermann . [2] et al in the following way: instead of drawing a uniform sample and running a problem-speciic algorithm on this sample in the candidate phase, we only run a problem-speciic łpluginž-algorithm in the candidate phase, thus dropping the framework around the sampling property. We think that the problem-speciic algorithms use 2]ddo innot [ fulill the role of a plugin, since parts of the problem-speciic operations, e.g. the uniform sampling, remain in the main algorithm. Here, we separate the problem-speciic operations from the main algorithm: any algorithm can serve as plugin, if it is able to return candidates for a cluster that takes a constant fraction of the input, where the fraction is an input-parameter of ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 23 the algorithm and some approximation factor is guaranteed (w.h.p.). The calls to the candidate-inder plugin do not even need to be independent (stochastically), allowing adaptive algorithms. Now, letX = (�, �) be an arbitrary (non-)metric space, wher�eis any non-empty (ground-)set and� : � × � → R is a distance function (not necessarily a metric). We introduce a generalized�deinition -median clustering, of ≥0 where the input is restricted to come from a predeined subset � ⊆ � and the medians are restricted to come from a predeined subset� ⊆ � . Deinition 7.1 (generalize�d-median). The generalized�-median clustering problemis deined as follows, where � ∈ N is a ixed (constant) parameter of the problem: given a inite and non-empty � =set{� , . . .,� } ⊆ � , 1 � compute a set � of� elements from� , such that cost(�,�) = min �(�,�) is minimal. �∈� �∈� The following algorithm, Algorithm 5, can approximate �-median every problem compatible with Deinition 7.1, when provided with such a problem-speciic plugin-algorithm for computing candidates. In particular, it can approximate the(�, ℓ)-median problem for polygonal curves under the Fréchet distance, when provided with � � � � � Algorithm 2 or Algorithm 4. Then, we �hav=eR , � = R ⊆ R = � and � = R ⊆ R = � . Note that the ∗ � ∗ ∗ algorithm computes a bicriteria approximation, that is, the solution is approximated inand terms theof the cost number of vertices of the center curves, i.e., the centers come frRom . 2ℓ−2 Algorithm 5 has several parameters. The irst parameter � is the set of centers found so far and � is the number of centers yet to be found. The following parameters concern only the plugin-algorithm used within the algorithm: � determines the size (in terms of a fraction of the input) of the smallest cluster for which an approximate median can be computed,� determines the probability of failure of the plugin-algorithm � determines andthe approximation factor of the plugin-algorithm. Algorithm 5 works as follows: If it has already computed some centers (and there are still centers to compute) it does pruning: some clusters might be too small for the plugin-algorithm to compute approximate medians for them. Algorithm 5 then calls itself recursively with only half of the input: the elements with larger distances to the centers yet found. This way the small clusters will eventually take a larger fraction of the input and can be found in the candidatephase. In this phase Algorithm 5 calls its plugin and for each candidate that the plugin returned, it calls itself recursively: adding the candidate at hand to the set of centers yet found and decrementing � by one. Eventually, all combinations of computed candidates are evaluated against the original input and the centers that together evaluated best are returned. ACM Trans. Algor. 24 • Maike Buchin, Anne Driemel, and Dennis Rohde Algorithm 5 Recursive Approximation-Scheme�for -Median Clustering 1: procedure �-Median(�,�,�, �,�,�) 2: if � = 0 then 3: return � ⊲ Pruning Phase 4: if � ≠ ∅ and |�| > 1 then j k |� | 5: � ← set of elements� ∈ � , such that min�(�,�) ≤ min�(�,�) for each� ∈ � \ � �∈� �∈� 6: � ← �-Median(� \ �,�, min{�,|� \ �|}, �,�,�) 7: else 8: � ← ∅ ⊲ Candidate Phase 9: � ← Median-Candidates(�, �,�/�,�) 10: for �∈ � do 11: � ← �-Median(�,� ∪{�}, min{� − 1,|�|}, �,�,�) 12: C ← {� }∪ {� } �∈� 13: return arg mincost(�,�) � ∈C The quality of approximation and worst-case running time of Algorithm 5 is stated in the following two theorems, which we prove further below. The proofs are adaptations of corresponding pro2ofs ]. Win e pr[ovide them for the sake of completeness. We note that no metric properties are used in the proofs. Theorem 7.2. Let � ∈ [1,∞) and Median-Candidates be an algorithm that, given three parameters� ∈ [1,∞), �,�∈ [0, 1) and a inite set� ⊆ � , returns with probability at least 1− � an (� + �)-approximate1-median for any ′ ′ 1 � ⊆ � , if|� | ≥ |�|. Let � ⊆ � be a inite set. Algorithm 5 called with parameters(�,∅,�, �,�,�), where � ∈ (2�,∞) and �,�∈ (0, 1), 4� ∗ returns with probability at least 1− � a set � = {� , . . .,� } withcost(�,�) ≤ (1+ )(� + �) cost(�,� ), where �−2� � is an optimal set of� medians for� . Theorem 7.3. Let � (�, �,�,�) denote the worst-case running time ofMedian-Candidates for an arbitrary input set � with|�| = � and let �(�, �,�,�) denote the maximum number of candidates it returns. Also, let � denote the worst-case running time needed to compute� for an input element and a candidate. �+2 �+1 If � and � are non-decreasing in�, Algorithm 5 has running time� (�(�, �,�,�) · � · � + �(�, �,�,�) · 1 � � (�, �,�,�)). Our main results, which we state below, follow from Theorems 4.2 and 4.3, respectively Theorems 6.2 and 6.3, and Theorems 7.2 and 7.3. The irst result is(3a+ �)-approximation that follows by using Algorithm 2 as plugin and the second one is a(1+ �)-approximation that follows by using Algorithm 4 as plugin. Corollary 7.4. Given two parameters�,�∈ (0, 1) and a inite set� ⊂ R of polygonal curves, Algorithm 5 endowed 20� with Algorithm 2 asMedian-Candidates and run with parameters(�,∅,�, + 2�,�,�/5) returns with probability at least 1− � a set � ⊂ R that is a(3+ �)-approximate solution to the(�, ℓ)-median for� . Algorithm 5 then has 2ℓ−2 running time � (�ℓ� ) � (�ℓ� 1/� ln(1/�)) 1 � � (�ℓ) �� . �� � ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 25 We note that the following result does not difer in asymptotic running time from the previous one. However, the hidden constants are larger if we combine Algorithm 5 and Algorithm 4. Corollary 7.5. Given two parameters� ∈ (0, 1),� ∈ (0, 0.158] and a inite set� ⊂ R of polygonal curves, 12� Algorithm 5 endowed with Algorithm 4 as Median-Candidates and run with parameters(�,∅,�, + 2�,�,�/3) returns with probability at least 1− � a set � ⊂ R that is a(1+ �)-approximate solution to the(�, ℓ)-median for 2ℓ−2 � . Algorithm 5 then has running time � (�ℓ� ) � (�ℓ� 1/� ln(1/�)) 1 � � (�ℓ) �� . �� � The following proof is an adaption of [2, Theorem 2.2 - Theorem 2.5]. Proof of Theorem 7.2. For � = 1, the claim trivially holds. We now distinguish two cases. In the irst case the principle of the proof is presented in all its detail. In the second case we only show how to generalize the irst case to � > 2. Case 1: � = 2 ∗ ∗ ∗ ∗ ∗ Let � = {� ,� } be an optimal set �ofmedians for � with clusters � and � , respectively, that form a partition 1 2 1 2 ∗ 1 1 of� . For the sake of simplicity, assume� that is a power of 2 and w.l.o.g. assume that |� | ≥ |�| > |�|. Let � 2 � be the set of candidates returned by Median-Candidates in the initial call. With probability − �/�at , least 1 ∗ ∗ ∗ there is a� ∈ � with cost� ,� ≤ (� + �) cost � ,� . We distinguish two cases: 1 1 1 1 1 1 ′ ∗ ′ 1 ′ Case 1.1: There exists a recursive call with parameters (� ,{� }, 1, �,�,�) and |� ∩� | ≥ |� |. 2 � ′ ∗ ′ ′ First, we assume that� is the maximum cardinality input|� with ∩� | ≥ |� |, occurring in a recursive call of the algorithm. Let � be the set of candidates returned by Median-Candidates in this call. With probability at ∗ ′ ∗ ′ least 1− �/�, there is a� ∈ � withcost � ∩� ,� ≤ (� + �) cost � ∩� , e� , where e� is an optimal median 2 2 2 2 2 2 2 ∗ ′ for� ∩� . Let � be the set of elements of � removed in the� ∈ N, � ≤ log(�), pruning phases between obtaining � and th � . Without loss of generality we assume�that ≠ ∅. For �∈ [� ], let� be the elements removed in the � (in 2 � the order of the recursive calls occurring) pruning phase. Note that � arthe e pairwise disjoint, we have that � � ∗ ∗ ′ ∗ � = ∪ � and |� | = . Since� = � ⊎ (� ∩� ) ⊎ (� ∩ �), we have � � � �=1 1 2 2 ∗ ∗ ′ ∗ cost(�,{� ,� }) ≤ cost � ,� + cost � ∩� ,� + cost � ∩ �,� . (I) 1 2 1 2 1 1 2 2 Our aim is now to prove that the number of elements wrongly assigne � , di.e to.,� ∩ �, is small and further, that their cost is a fraction of the cost of the elements correctly assigne � , i.ed.,�to. th We deine� = � and for� ∈ [� ] we deine� = � \ � . The � are the elements remaining after�the 0 � �−1 � � � ′ pruning phase. Note that by deinition |� | = = |� |. Since� = � is the maximum cardinality input, with � � � � 1 1 ∗ ′ ′ ∗ |� ∩� | ≥ |� |, we have that |� ∩ � | < |� | for all �∈ [� − 1]. Also, for each�∈ [� ] we have � ⊆ � , � � � �−1 2 2 � � therefore 1 2 � ∗ ∗ |� ∩ � | ≤ |� ∩ � | < |� | = (II) � �−1 �−1 2 2 � � 2 and as an immediate consequence 1 2 � ∗ ∗ |� ∩ � | = |� | − |� ∩ � | > |� | − |� | = 1− . (III) � � � � �−1 1 2 � � 2 ACM Trans. Algor. 26 • Maike Buchin, Anne Driemel, and Dennis Rohde ∗ ∗ This tells us that mainly the elements � areof removed in the pruning phase and only very few elements � . of 1 2 By deinition, we have for�all ∈ [� − 1], � ∈ � and � ∈ � that �(�,� ) ≤ �(�,�), hence � �+1 1 1 1 1 ∗ ∗ cost � ∩ � ,� ≤ cost � ∩ � ,� . � 1 �+1 1 2 1 ∗ ∗ |� ∩ � | |� ∩ � | � �+1 2 1 Combining this inequality with Eqs. (II) and (III) we obtain �∈ [� −for 1]: � �+1 �2 2 ∗ ∗ cost � ∩ � ,� < cost � ∩ � ,� � 1 �+1 1 2 1 2� (1− 2/�)� �+1 2 2� 4 ∗ ∗ ∗ ⇐⇒ cost � ∩ � ,� < cost � ∩ � ,� = cost � ∩ � ,� . (IV) � 1 �+1 1 �+1 1 2 1 1 (1− 2/�)��2 (� − 2) We still need such a bound�for = � . Since|� | = |� | and also� ⊆ � we can use Eq. (II) to obtain: � � � � −1 2 � ∗ ∗ ∗ |� ∩ � | = |� | − |� ∩ � | ≥ |� | − |� ∩ � | > 1− . (V) � � � � � −1 1 2 2 � 2 Also, we have for all � ∈ � and � ∈ � that �(�,� ) ≤ �(�,�) by deinition, thus � � 1 1 1 1 ∗ ∗ cost � ∩ � ,� ≤ cost � ∩ � ,� . � 1 � 1 2 1 ∗ ∗ |� ∩ � | |� ∩ � | � � 2 1 We combine this inequality with Eq. (II) and Eq. (V) and obtain: � � �2 2 2� ∗ ∗ cost � ∩ � ,� < cost � ∩ � ,� � 1 � 1 2 1 2� (1− 2/�)��2 ∗ ∗ ⇐⇒ cost � ∩ � ,� < cost � ∩ � ,� . (VI) � 1 � 1 2 1 (� − 2) We are now ready to bound the cost of the elements of � wrongly assigned�to. Combining Eq. (IV) and Eq. (VI) yields: � � −1 ︁ ︁ 4 2 ∗ ∗ ∗ ∗ cost � ∩ �,� = cost � ∩ � ,� < cost � ∩ � ,� + cost � ∩ � ,� 1 � 1 �+1 1 � 1 2 2 1 1 � − 2 � − 2 �=1 �=1 < cost � ,� . � − 2 Here, the last inequality holds, be�cause , . . ., � and � are pairwise disjoint. Also, we have 2 � � ∗ ′ ∗ ′ ∗ ′ ∗ ∗ ∗ cost � ∩� ,� ≤ (� + �) cost � ∩� ,� e ≤ (� + �) cost � ∩� ,� ≤ (� + �) cost � ,� . 2 2 2 2 2 2 2 2 Finally, using Eq. (I) and a union bound, with probability− �atthe least follo 1 wing holds: ∗ ∗ ∗ ∗ ∗ ∗ cost(�,{� ,� }) < (� + �) cost � ,� + (� + �) cost � ,� + (� + �) cost � ,� 1 2 1 1 2 2 1 1 � − 2 4 4� ∗ ∗ < 1+ (� + �) cost(�,� ) = 1+ (� + �) cost(�,� ) � − 2 ��− 2� 4� ≤ 1+ (� + �) cost(�,� ). � − 2� ′ ∗ ′ 1 ′ Case 1.2: For all recursive calls with parameters (� ,{� }, 1, �,�,�) it holds that|� ∩� | < |� |. 2 � ′ ∗ ′ ′ Afterlog(�) pruning phases we end up with a singleton {�} = � as input set. Since|� ∩� | < |� |, it must be ∗ ′ 1 ′ 1 ∗ that 0 = |� ∩� | < |� | = < 1 and thus � ∈ � . 2 � � 1 ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 27 Let � be the set of candidates returned by Median-Candidates in this call. With probability−at�/least � 1 there is a� ∈ � withcost({�},� ) ≤ (� +�) cost({�}, e� ) ≤ (� +�) cost {�},� , where e� is an optimal median 2 2 2 2 2 for{�}. Sincecost � ∩ �,� is bounded as in Case 1.1, by a union bound we have with probability− at�least : 1 ∗ ∗ cost(�,{� ,� }) ≤ cost � \{�},� + cost � ∩ �,� + cost({�},� ) 1 2 1 1 2 1 2 ∗ ∗ ∗ ≤ (� + �) cost � ,� + cost � ∩ �,� 1 1 2 ≤ 1+ (� + �) cost(�,� ) � − 2 4� ≤ 1+ (� + �) cost(�,� ). � − 2� Case 2: � > 2 We only prove the generalization of Case �1.1>to2, the remainder of the proof is analogous to the Case 1. Let ∗ ∗ ∗ ∗ ∗ � = {� , . . .,� } be an optimal set �ofmedians for � with clusters � , . . .,� , respectively, that form a partition 1 � 1 � ∗ ∗ of� . For the sake of simplicity, assume� that is a power of 2 and w.l.o.g. assume |� | ≥ ··· ≥ |� |. For �∈ [�] 1 � ∗ ∗ and �∈ [�] \ [�] we deine� = ⊎ � . �,� �=� LetT = � and let(T = T \P ) be the sequence of input sets in the recursive calls�of∈ the N,� ≤ log(�), 0 � �−1 � �=1 th pruning phases, whereP is the set of elements removed in�the(in the order of the recursive calls occurring) pruning phase. LetT = {T} ∪ {T | �∈ [� ]}. For �∈ [�], let� be the maximum cardinality setT ,in with 0 � � ∗ 1 |� ∩� | ≥ |� |. Note that by assumption and since � > 2�, � = � must hold and also � ⊂ � for�∈ [�] \ [�]. � � 1 � � � � Using a union bound, with probability at−least �, for1each�∈ [�] the call of Median-Candidates with input � yields a candidate � with � � ∗ ∗ ∗ ∗ ∗ ∗ cost � ∩� ,� ≤ (� + �) cost � ∩� , e� ≤ (� + �) cost � ∩� ,� ≤ (� + �) cost � ,� , (I) � � � � � � � � � � � where e� is an optimal 1-median� for∩� . Let � = {� , . . .,� } be the set of these candidates and for �∈ [� − 1], � � 1 � let� = � \� denote the set of elements of � removed by the pruning phases between obtaining � and � . � � �+1 � �+1 Note that the � are pairwise disjoint. By deinition, the sets ∗ ∗ ∗ ∗ � ∩� , . . .,� ∩� ,� ∩ � , . . .,� ∩ � 1 � 1 �−1 1 � 2,� �,� form a partition�of , therefore � �−1 ︁ ︁ ∗ ∗ cost(�,{� , . . .,� }) ≤ cost � ∩� ,� + cost � ∩ � ,{� , . . .,�} 1 � � � � 1 � � �+1,� �=1 �=1 � �−1 ︁ ︁ ∗ ∗ ∗ ≤ (� + �) cost � ,� + cost � ∩ � ,{� , . . .,�} . (II) � 1 � � � �+1,� �=1 �=1 Now, it only remains to bound the cost of the wrongly assigned elements � . For of �∈ [�], let� = |� | and � � �+1,� w.l.o.g. assume that� ≠ ∅ for each�∈ [� − 1]. Each � is the disjoint union ⊎ � of� ∈ N sets of elements � � �,� � �=1 of� removed in the interim pruning phases, and it holds |� that | = . We now prove for each�∈ [� − 1] and �,� � ∗ ∗ �∈ [� ] that � contains many elements from � and only a few elements fr�om . � � 1,� �+1,� For �∈ [� − 1], we deine� = � and for�∈ [� ] we deine� = � \ � . By deinition, |� | = = |� |, �,0 � � �,� �,�−1 �,� �,� �,� ∗ 1 � ⊃ � for each � ∈ [� ] and � ∈ [� ] \ [�], also� = � . Thus, |� ∩ � | < |� | for all �,� �,� 1 � 2 � 1 �,� �+1 �,� �,� 1 2 � � �∈ [� − 1], �∈ [� ] and �∈ [�] \ [�]. As an immediate consequence we obtain |� ∩ � | ≤ |� |. Since � �,� �,� �+1,� � ACM Trans. Algor. 28 • Maike Buchin, Anne Driemel, and Dennis Rohde � ⊆ � for all �∈ [� − 1] and �∈ [� ], we have �,� �,�−1 � � 2� � |� ∩ � | ≤ |� ∩ � | ≤ |� | = , (III) �,� �,�−1 �,�−1 �+1,� �+1,� � � 2 which immediately yields 2� � |� ∩ � | = |� | − |� ∩ � | ≥ 1− . (IV) 1,� �,� �,� �+1,� �,� � 2 Now, by deinition we know that for�all ∈ [�− 1], �∈ [� ]\{� }, � ∈ � and � ∈ � that min �(�,�) ≤ � � �,� �,�+1 �∈{� ,...,� } 1 � min �(�,�). Thus, �∈{� ,...,� } 1 � ∗ ∗ cost � ∩ � ,{� , . . .,�} cost � ∩ � ,{� , . . .,�} �,� 1 � �,�+1 1 � �+1,� 1,� ≤ . ∗ ∗ |� ∩ � | |� ∩ � | �,� �,�+1 1,� �+1,� Combining this inequality with Eqs. (III) and (IV) yields �∈ [� − 1for ] and �∈ [� ] \{� }: � � � �+1 �2 2 ∗ ∗ cost � ∩ � ,{� , . . .,�} ≤ cost � ∩ � ,{� , . . .,�} �,� 1 � �,�+1 1 � 1,� �+1,� 2� 2�� � (1− )� 4� ∗ ∗ ⇐⇒ cost � ∩ � ,{� , . . .,�} ≤ cost � ∩ � ,{� , . . .,�} . (V) �,� 1 � �,�+1 1 � �+1,� 1,� � − 2� For each �∈ [� − 1] we still need an upper boundcost on � ∩ � ,{� , . . .,�} . Since|� | = |� | and �,� 1 � �,� �,� � � � �+1,� also� ⊆ � we can use Eq. (III) to obtain �,� �,� −1 � � 2� � ∗ ∗ ∗ |� ∩ � | = |� |−|� ∩ � | ≥ |� |−|� ∩ � | > 1− . (VI) �,� �,� �,� �,� �,� −1 1,� � � �+1,� � � �+1,� � � 2 By deinition, we also know that for �∈all [�−1], � ∈ � and� ∈ � that min �(�,�) ≤ min �(�,�). �,� �,� � � �∈{� ,...,� } �∈{� ,...,� } 1 � 1 � Thus, ∗ ∗ cost � ∩ � ,{� , . . .,�} cost � ∩ � ,{� , . . .,�} �,� 1 � �,� 1 � � � 1,� �+1,� ≤ . ∗ ∗ |� ∩ � | |� ∩ � | �,� �,� � � �+1,� 1,� Combining this inequality with Eqs. (III) and (VI) yields: � � � � �2 2 ∗ ∗ cost � ∩ � ,{� , . . .,�} < cost � ∩ � ,{� , . . .,�} �,� 1 � �,� 1 � �+1,� � 1,� � 2� 2�� � (1− )� 2� ∗ ∗ ⇐⇒ cost � ∩ � ,{� , . . .,�} < cost � ∩ � ,{� , . . .,�} . (VII) �,� 1 � �,� 1 � � 1,� � �+1,� � − 2� We can now give the following bound, combining Eqs. (V) and (VII), �for∈ [each � − 1]: ∗ ∗ cost � ∩ � ,{� , . . .,�} = cost � ∩ � ,{� , . . .,�} � 1 � �,� 1 � �+1,� �+1,� �=1 � −1 4� < cost � ∩ � ,{� , . . .,�} �,�+1 1 � 1,� � − 2� �=1 2� + cost � ∩ � ,{� , . . .,�} �,� 1 � 1,� � � − 2� ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 29 4� < cost � ∩� ,{� , . . .,�} . (VIII) � 1 � 1,� � − 2� Here, the last inequality holds, be�cause , . . ., � and � are pairwise disjoint subsets � . of �,2 �,� �,� � � � ∗ ∗ Now, we plug this bound into Eq. (II). Note that � ∩� ⊆ � ∩� for each�∈ [�] and �∈ [�] by deinition. We � � � � obtain: � �−1 ︁ ︁ ∗ ∗ ∗ cost(�,{� , . . .,� }) ≤ (� + �) cost � ,� + cost � ∩ � ,{� , . . .,�} 1 � 1 � � � �+1,� �=1 �=1 � �−1 ︁ ︁ 4� ∗ ∗ ∗ < (� + �) cost � ,� + cost � ∩� ,{� , . . .,�} � 1 � � � 1,� � − 2� �=1 �=1 � �−1 � ︁ ︁ ︁ 4� ∗ ∗ ∗ ≤ (� + �) cost � ,� + cost � ∩� ,� � � � � � � − 2� �=1 �=1 �=1 � �−1 � ︁ ︁ ︁ 4� ∗ ∗ ∗ ≤ (� + �) cost � ,� + cost � ∩� ,� � � � � � � − 2� �=1 �=1 �=1 � �−1 ︁ 2 ︁ 4� ∗ ∗ ∗ ≤ (� + �) cost � ,� + cost � ∩� ,� � � � � � � − 2� �=1 �=1 2 ︁ 2 4� 4� ∗ ∗ ∗ ≤ 1+ (� + �) cost � ,� = 1+ (� + �) cost(�,� ). � � � − 2� � − 2� �=1 The last inequality follows from Eq. (I). □ The following analysis of the worst-case running time of Algorithm 4 is a slight 2, Theadaption orem 2.8],of [ which is also provided for the sake of completeness. Proof of Theorem 7.3. Let �(�,�, �,�,�) denote the worst-case running time of Algorithm 5 for input � set with|�| = �. For the sake of simplicity, we assume� that is a power of 2. Note that we always hav � e≤ �. If� = 0, Algorithm 5 has running time � ∈ � (1). If� ≥ � ≥ 1, Algorithm 5 has running time at most � · (� ·� + �) ∈ � (� ·� ) to obtain�, �(�/2,�, �,�,�) for the recursive call in the pruning�phase (�, �,,�,�) 2 � � 1 to obtain the candidates, �(�, �,�,�) · �(�,� − 1, �,�,�) for the recursive calls in the candidate phase, one for each candidate, and� · � ·� · �(�, �,�,�) ∈ � (� ·� · �(�, �,�,�)) to eventually evaluate the candidate sets. Let 3 � � � = � + � + � + 1. We obtain the following recurrence relation: 1 2 3 � if� = 0 �(�,�, �,�,�) ≤ �(�, �,�,�) ·�(�,� − 1, �,�,�) +�(�/2,�, �,�,�) . +� (�, �,�,�) + ��·� · �(�, �,�,�)) else 1 � Let �(�, �,�,�) = ·� (�, �,�,�) +� · �(�, �,�,�). 1 � �� � �+1 We prove that �(�,�, �,�,�) ≤ �· 4 · �(�, �,�,�) · � · �(�, �,�,�), by induction on �,�. For � = 0 we have �(�,�, �,�,�) ≤ � ≤ ��≤ �· 4 · �(�, �,�,�) · � · �(�, �,�,�). ′ ′ ′ ′ Now, let� ≥ � ≥ 1 and assume the claim holds�for (� ,� , �,�,�), for each� ∈ {0, . . .,� − 1} and � ∈ [� − 1]. We have: �(�,�, �,�,�) ≤ �(�, �,�,�) ·�(�,� − 1, �,�,�) +�(�/2,�, �,�,�) +� (�, �,�,�) + ��·� · �(�, �,�,�) 1 � ACM Trans. Algor. 30 • Maike Buchin, Anne Driemel, and Dennis Rohde �−1 � ≤ �(�, �,�,�) · �· 4 · �(�, �,�,�) · � · �(�, �,�,�) � �+1 + �· 4 · �(�/2, �,�,�) · · �(�/2, �,�,�) + ��· �(�, �,�,�) 1 1 1 � �+1 ≤ + + �· 4 · �(�, �,�,�) · � · �(�, �,�,�) � �+1 4 2 4 �(�, �,�,�) � �+1 ≤ �· 4 · �(�, �,�,�) · � · �(�, �,�,�). 1 1 The last inequality holds, because ≤ , and the claim follows by induction. □ � �+1 4 � (�,�,�,� ) 8 CONCLUSION We have developed bicriteria approximation algorithms (�, ℓ)-me for dian clustering of polygonal curves under the Fréchet distance. While it showed to be relatively easy to obtain a good approximation where the centers have up to 2ℓ vertices in reasonable time, a way to obtain good approximate centers with ℓ vertices up to in reasonable time is not in sight. This is due to the continuous Fréchet distance: the vertices of a median need not be anywhere near a vertex of an input-curve, resulting in a huge search-space. If we cover the whole search-space by, say grids, the worst-case running time of the resulting algorithms become dependent on the arc-lengths of the input-curves edges, which is not acceptable. We note�that -coverability of the continuous Fréchet distance would imply the existence of sublinear �-corsize esets for(�, ℓ)-center clustering of polygonal curves under the Fréchet distance. It is an interesting open question,�if -covthe erability holds for the continuous Fréchet distance. In contrast to the doubling dimension, which was shown to be ininite even for curves of bounded complexity 17[], the VC-dimension of metric balls under the continuous Fréchet distance is bounded in terms of the complexities ℓ and � of the curves [18]. Whether this bound can be combined with the framework by Feldman and Langberg19 [ ] to achieve faster approximations for(�the , ℓ)-median problem under the continuous Fréchet distance is an interesting open problem, 13]. c.f. The [ general relationship between the VC-dimension of range spaces derived from metric spaces and their doubling properties is a topic of ongoing research, see for example Huang et al. [23]. REFERENCES [1] C. Abraham, P. A. Cornillon, E. Matzner-Lùber, and N. Molinari. 2003. Unsupervised Curve Clustering usingScandinavian B-Splines. Journal of Statistics30, 3 (2003), 581ś595. [2] Marcel R. Ackermann, Johannes Blömer, and Christian Sohler. 2010. Clustering for metric and nonmetric distance ACM measures. Trans. Algorithms6, 4 (2010), 59:1ś59:26. [3] Pankaj K. Agarwal, Sariel Har-Peled, Nabil H. Mustafa, and Yusu Wang. 2002. Near-Linear Time Approximation Algorithms for Curve Simpliication. Algorithms In - ESA, Rolf Möhring and Rajeev Raman (Eds.). Springer, 29ś41. [4] Helmut Alt and Michael Godau. 1995. Computing the Fréchet Distance between two PolygonalInternational Curves. Journal of Computational Geometry & Applications 5 (1995), 75ś91. [5] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. 2005. Clustering with Bregman Div Journal ergences. of Machine Learning Research6 (2005), 1705ś1749. [6] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. 2004. CorrelationMachine Clustering. Learning56, 1-3 (2004), 89ś113. [7] Asa Ben-Hur, David Horn, Hava T. Siegelmann, and Vladimir Vapnik. 2001. Support Vector Clustering. Journal of Machine Learning Research 2 (2001), 125ś137. [8] Milutin Brankovic, Kevin Buchin, Koen Klaren, André Nusser, Aleksandr Popov, and Sampson Wong. 2020. (k, l)-Medians Clustering of Trajectories Using Continuous Dynamic Time Warping. SIGSPIn ATIAL ’20: 28th International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, November 3-6, 2020, Chang-Tien Lu, Fusheng Wang, Goce Trajcevski, Yan Huang, Shawn D. Newsam, and Li Xiong (Eds.). ACM, 99ś110. [9] Kevin Buchin, Maike Buchin, and Carola Wenk. 2008. Computing the Fréchet distance between simpleComput. polygons. Geom. 41, 1-2 (2008), 2ś20. ACM Trans. Algor. Approximating(�, ℓ)-Median Clustering for Polygonal Curves • 31 [10] Kevin Buchin, Anne Driemel, Joachim Gudmundsson, Michael Horton, Irina Kostitsyna, Maarten Löler, and Martijn Struijs. 2019. Approximating (k, l)-center clustering for curPr ves. oceIn edings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms . 2922ś2938. [11] Kevin Buchin, Anne Driemel, and Martijn Struijs. 2020. On the Hardness of Computing an Average 17th CurvScandinavian e. In Symposium and Workshops on Algorithm Theory (LIPIcs, Vol. 162) , Susanne Albers (Ed.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 19:1ś19:19. [12] Kevin Buchin, Anne Driemel, Natasja van de L’Isle, and André Nusser. 2019. klcluster: Center-based Clustering of Trajectories. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems . 496ś499. [13] Maike Buchin and Dennis Rohde. 2022. Coresets(for �, ℓ)-Median Clustering Under the Fréchet Distance. Algorithms In and Discrete th Applied Mathematics - 8 International Conference, CALDAM, Puducherry, India, February 10-12, Proceedings (Lecture Notes in Computer Science, Vol. 13179), Niranjan Balachandran and R. Inkulu (Eds.). Springer, 167ś180. [14] Jeng-Min Chiou and Pai-Ling Li. 2007. Functional clustering and identifying substructures of longitudinal Journal of the Royal data. Statistical Society: Series B (Statistical Methodology 69,)4 (2007), 679ś699. [15] Rudi Cilibrasi and Paul M. B. Vitányi. 2005. Clustering by compr IEEE Tession. rans. Information Theory51, 4 (2005), 1523ś1545. [16] Anne Driemel and Sariel Har-Peled. 2013. Jaywalking Your Dog: Computing the Fréchet Distance with SIAM Shortcuts. J. Comput. 42, 5 (2013), 1830ś1866. [17] Anne Driemel, Amer Krivosija, and Christian Sohler. 2016. Clustering time series under the Fréchet Prodistance ceedings .ofInthe Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms . 766ś785. [18] Anne Driemel, Jef M. Phillips, and Ioannis Psarros. 2019. The VC Dimension of Metric Balls Under Fréchet and Hausdorf Distances. In 35th International Symposium on Computational Geometr. y28:1ś28:16. [19] Dan Feldman and Michael Langberg. 2011. A uniied framework for approximating and clustering Procedata. edings In of the 43rd ACM Symposium on Theory of Computing , Lance Fortnow and Salil P. Vadhan (Eds.). ACM, 569ś578. [20] Luis Angel Garcia-Escudero and Alfonso Gordaliza. 2005. A Proposal for Robust Curve Clustering. Journal of Classiication22, 2 (2005), 185ś201. [21] Sudipto Guha and Nina Mishra. 2016. Clustering Data Streams. DataInStream Management - Processing High-Speed Data Streams , Minos N. Garofalakis, Johannes Gehrke, and Rajeev Rastogi (Eds.). Springer, 169ś187. [22] Sariel Har-Peled and Soham Mazumdar. 2004. On coresets for k-means and k-median clustering. ProceeIn dings of the 36th Annual ACM Symposium on Theory of Computing . 291ś300. [23] Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, and Xuan Wu. 2018. Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics. In 59th IEEE Annual Symposium on Foundations of Computer Science . IEEE Computer Society, 814ś825. [24] Hiroshi Imai and Masao Iri. 1988. Polygonal Approximations of a Curve Ð Formulations and Algorithms. Machine Intelligence and Pattern Recognition6 (Jan. 1988), 71ś86. [25] Piotr Indyk. 2000.High-dimensional Computational Geometr . P yh. D. Dissertation. Stanford University, CA, USA. Advisor(s) Motwani, Rajeev. [26] Stephen C. Johnson. 1967. Hierarchical clustering schemes. Psychometrika32, 3 (1967), 241ś254. [27] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. 2004. A Simple Linear Time �)-Appr (1+ oximation Algorithm for k-Means Clustering in Any Dimensions.PrIn oceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’04) . IEEE Computer Society, 454ś462. [28] Stefan Meintrup, Alexander Munteanu, and Dennis Rohde. 2019. Random Projections and Sampling Algorithms for Clustering of High-Dimensional Polygonal CurAvdvances es. In in Neural Information Processing Systems 32 . 12807ś12817. [29] Michael Mitzenmacher and Eli Upfal. Probability 2017. and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis(2nd ed.). Cambridge University Press, USA. [30] Abhinandan Nath and Erin Taylor. 2020. k-Median Clustering Under Discrete Fréchet and Hausdorf Distances. 36th International In Symposium on Computational Geometry (LIPIcs, Vol. 164) , Sergio Cabello and Danny Z. Chen (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 58:1ś58:15. [31] François Petitjean and Pierre Gançarski. 2012. Summarizing a set of time series by averaging: From Steiner sequence to compact multiple alignment. Theoretical Computer Science414, 1 (2012), 76 ś 91. [32] François Petitjean, Alain Ketterlin, and Pierre Gançarski. 2011. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition44, 3 (2011), 678 ś 693. [33] Satu Elisa Schaefer. 2007. Graph clustering. Computer Science Review1, 1 (2007), 27 ś 64. [34] René Vidal. 2011. Subspace Clustering. IEEE Signal Processing Magazine 28, 2 (2011), 52ś68. ACM Trans. Algor.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png
ACM Transactions on Algorithms (TALG)
Association for Computing Machinery
http://www.deepdyve.com/lp/association-for-computing-machinery/approximating-k-median-clustering-for-polygonal-curves-slCcccsZ0z