Copyright Notice:
Since most of these papers are published, the copyright has been
transferred to the respective publishers. Therefore, the papers
cannot be duplicated for commercial purposes. The following is
ACM's copyright notice; other
publishers have similar ones.
Copyright © 199x by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that new copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
Random samples are flexible general-purpose synopses of data sets that are too massive to store in full, manipulate, or transmit. Answers to queries posed over the original data can be quickly estimated from the sample, and the same sample can be used for many types of queries. My work on sampling aims to facilitate a more effective use of sampling as synopses by designing sampling schemes and estimators for fundamental classes of queries.Scalable design of sampling algorithms should depend on how the data is presented, to avoid the time and cost of aggregation. We distinguish between data sets presented as key value pairs (which can be streamed or distributed), data sets where there is a vector of values associated with each key and these values are dispersed across servers or time, and finally unaggregated data sets, where the data contains multiple elements with the same key and the value of the key is the sum of element weights.
The value of our sample hinges on the performance of our estimators. The sampling scheme should be geared to the statistics (queries) we would like supported and estimators should optimally use the information in the sample. I am also interested in understanding the limits of a sampling scheme: Characterize the queries we can estimate well and provide optimal estimators for a given query.
Common queries over key-value data sets are segment f-statistics, which are the sum over keys in a selected segment of a function f applied to the value. This formulation includes segment sum, counts, moments, thresholds, and capping statistics. A weighted sample computed with respect to a particular f provides estimaes with statistical guarantees for f-statistics. This is achieved by including each key with probability roughly proportional to f applied to its value. But estimation quality of g-statistics from a weighted sample computed with respect to f deteriorates when g is very different than f (for example, when f and g are sum and count, or are threshold or capping functions with very different parameters.) We study here multi-objective samples which provide estimates with statistical guarantees on quality for a set of different functions. In particular, we show how to construct efficiently a small sample which provides estimates with statistical guarantees for all monotone non-decreasing functions f.
Data is typically structured and queries conform to the structure. Sampling schemes, however, are oblivious to the structure. We propose sampling schemes that are structure-aware and allow for more accurate approximate range queries while retaining the benefits of sample-based synopses. Part of the work builds on the flexibility within the VarOpt family of distributions to gain structure-awareness.
With classic Poisson Probability Proportional to Size (PPS) samples, the inclusion probability of each key is proportional to its weight and inclusions of different keys are independent. VarOpt samples have distinct advantages over Poisson sampling. They have PPS inclusion probabilities but also maintain a fixed-size sample and variance-optimal subset-sum estimators. We present efficient VarOpt stream-sampling algorithms. We also characterize a VarOpt family of sampling distribution, showing they all satisfy some important properties including Chernoff-Hoeffding bounds.
Bottom-k sampling (in Statistics literature also known as order sampling) is performed by assigning keys random ranks that may depend on their value and including the k keys with minimum rank. By appropriately choosing rank distributions, bottom-k includes successive weighted sampling without replacement and Sequential Poisson (Priority) sampling. Bottom-k samples, like Poisson samples, (and as it seems, unlike VarOpt) can be coordinated . The sample has fixed size k, which is an advantage over Poisson samples and over with-replacement samples. We also explore the application of bottom-k sampling to the analysis of large graphs, expanding on my size-estimation work. We study data structures, sampling algorithms, and subset-sum estimators for bottom-k samples.
Data sets, such as request and traffic logs and sensor measurements, are repeatedly collected in multiple instances: time periods, locations, or snapshots. The data can be viewed as a matrix, with rows corresponding to different keys (users, sensors, cookies) and columns to different instances. Each entry is the value of a key in a given instance. Queries which span multiple instances, such as distinct counts and distance measures are used for planning, management, and anomaly or change detection. To scalably summarize such data, the sampling scheme of one instance should not depend on values in other instances. We would like, however, to use the same sample to estimate different statistics, which may depend on multiple instances. Since we are estimating a function from samples which provide only partial information on the set of values, the estimation problem is challenging. We study sampling schemes and estimation when instances are sampled independently and when the samples are coordinated.Coordinated sampling is a method of sampling different instances so that the sampling of each instance is classic Poisson or bottom-k but samples of different instances are more similar when the instances are more similar. Statisticians proposed and used coordination as a method of controlling overlap in repeated surveys. Computer Scientists, myself included, (re-)discovered coordination for data analysis from samples: aggregates such as distinct counts and Jaccard Similarity can be better estimated when samples are coordinated. Coordination generalizes the technique popularly known as MinHash sketches. Sample coordination is a form of Locality Sensitive Hashing (LSH). I first came across coordination when observing that coordinated samples of all neighborhoods (or reachability sets) in a graph can be computed much more efficiently than independent samples. With time, I recognized the potential of coordination for large scale data analysis and opted to better understand it. We presented tighter estimators for basic queries, and more recently (see ), proposed a model of monotone sampling, which generalizes coordinated sampling, and provided a full characterization for the queries we can successfully answer along with efficient and practical optimal estimators.
In particular, since in general there may not exist one estimator with minimum variance for all possible data, we seek admissibility (Pareto variance optimality), meaning that estimators can not be improved for one data set in the domain without increasing variance for another. We also propose the stricter order-based optimality, where an order is specified over the data domain and we look for estimators which can not be improved for one data without increasing variance for preceding data.
Independent sampling was widely believed to be useless for basic queries, including distinct values counts: Information theoretic lower bounds show that we need to sample almost the full data set in order to obtain meaningful estimates. We show that when sampling with "reproducible randomization," these lower bounds do not apply and we obtain good estimators when only a small fraction of the data is sampled for: distinct counts, their generalization of L1 norm of the coordinate-wise maximum, and in follow-up work for Lp difference. We also propose a principled method of deriving order-optimal estimators.
For coordinated sampling, which has simpler structure than independent sampling, we were able to precisely characterize the queries for which good estimators exist. We introduce a notion of variance competitiveness, meaning that for any data, the expectation of the square of the estimate is bounded by the minimum possible for the data. Moreover, we show that O(1) variance competitiveness is always attainable when an estimator with finite variances exists.
We define a model for monotone sampling and monotone estimation: Suppose we are interested in estimating a nonnegative function f of the data from a "sample" of the data. We define a "sampling scheme" here as a function of the data and a (uniform at random) seed value from [0,1]. Monotonicity of the sampling scheme means that the output provides more information on the data when the seed value is lower. Our goal is to obtain a good estimator for f. We are interested in estimators that are unbiased (because for applications we sum aggregate many estimation problems) and nonnegative (because the function f is). We also seek admissible and variance competitive estimators. This formulation generalizes coordinated sampling. Interestingly, the characterization we provided here for this special case extends to monotone sampling.We make several contributions aimed at both a better theoretical understanding of monotone sampling and for providing practical and effective estimators to be used as tools for data mining and analysis. We explore the space of admissible and variance competitive estimators. We show how to derive an order-optimal estimator for any specified order on the data domain. This allows us to customize the estimator by choosing an order which reflects patterns present in our data. We present the L* estimator, which has a very compelling and natural set of properties: The L* estimator is defined (and can be constructed) for any monotone estimation problem for which an estimator with finite variances exists. It is guaranteed to be (at most) 4-competitive. Moreover, it is the unique admissible monotone estimator, meaning that the estimate we obtain is non-decreasing with the information we have on the data. In that, it improves over the classic Horvitz-Thompson estimator. Lastly, we explore and tighten the gap between the worst-case upper and lower bounds of variance competitiveness.
A particularly important class of queries are Lp difference queries. The query specifies a domain via a selection predicate over keys and we are interested in the Lp difference on the values assumed by these keys in the different instances. When the data is sampled, we are interested in approximating the Lp difference from the samples. We derive Lp difference estimators that are applied to sampled instances. We consider both independent and coordinated sampling. For independent sampling extending our technique and for coordinated sampling using our monotone estimation framework. We experimentally study these estimators to data sets with different properties. We demonstrate accurate estimates with only a small fraction of items sampled. For coordinated samples, we consider the L* estimator, which is optimized for small differences, the U* estimator, which is optimized for large differences, and the optimally competitive estimator, which minimizes the worst ratio over the data domain. We demonstrate the benefit of an informed choice of estimators when we have information on data patterns.
Many data sets occur in an unaggregated form, where multiple data points are associated with each key. In the aggregated view of the data, the weight of a key is the sum of the weights of data points associated with the key and queries (such as heavy hitters, sum statistics, distinct counts) are posed over the aggregate view. Examples of unaggreagated data sets are IP packet streams where keys are flow keys, individual requests to resources where key is resource ID, interactions of users with Web services, where keys are cookies, and distributed streams of events registered by sensor networks, where keys are event types. Since data points are scattered in time or across multiple servers, aggregation is subject to resource limitation on storage and transmission. Therefore, we aim for efficient processing of queries and computing sample-based summaries directly over the unaggregated data. We consider both a model where the algorithm can "see" the complete data set (stream) and another model, where the data is subjected to per-entry fixed-rate sampling before it is made accessible to the summarization algorithm. This second model is motivated by sampled NetFlow, which was deployed in high speed IP routers. Also belongs in this category (but listed elsewhere) is the application to approximate distinct counting of my HIP estimators.
Two common statistics over unaggreagated data are distinct keys, which is the number of active keys in a specified segment, and sum, which is the sum of the frequencies of keys in the segment. These are two special cases of frequency cap statistics, defined as the sum of frequencies capped by a parameter T, which are popular in online advertising platforms. We propose the first solution for scalable and accurate estimation of general frequency cap statistics. Our framework brings the benefits of approximate distinct counters to general frequency cap statistics.
The average distance from a node to all other nodes in a graph, or from a query point in a metric space to a set of points, is a fundamental quantity in data analysis. The inverse of the average distance, known as the (classic) closeness centrality of a node, is a popular importance measure in the study of social networks. We develop novel structural insights on the sparsifiability of the distance relation via weighted sampling. Based on that, we present novel algorithms with strong statistical guarantees for estimating average distances from a query point/node or between all pairs of points/nodes. Our results significantly improve over previous work, which was mostly based on uniform sampling, in terms of computation and estimation quality tradeoffs.
Graphs with billions of edges, such as social and Web graphs, are common and are a powerful source of information. The graph structure, alone or combined with meta-data, a static snapshot or evolving, encodes information on centralities, interests, similarities, influence, and communities. To effectively mine this information, algorithms must be highly scalable and models should accurately capture the real-world properties we are after.I use a combination of algorithmic techniques, sketching and estimation techniques, modeling, and data sets, to develop effective tools for mining massive graphs.
Influence functions in a graph are defined for subsets of nodes, according to graph structure. The aim is to find subsets of seed nodes with (approximately) optimal tradeoff of size and influence. We define a rich class of graph-based influence functions which unifies and extends previous work. Influence is defined through pairwise utility values derived from relations such as reachability, distance, reverse rank. The utility of a seed set to a node is then defined as a submodular aggregate of individual seed utilities. Finally, the influence of the seed set is the sum over nodes of its utility to the node. Previous work focused on maximum aggregation only. More general submodular aggregation captures fault tolerance of the seed set and more generally allows value from additional seeds. We present a meta-algorithm for approximate greedy maximization with strong approximation quality guarantees and worst-case near-linear computation for all functions in our class. Our meta-algorithm generalizes recent designs specific for maximum aggregation and specific utility functions.
Distances in a network capture relations between nodes and are the basis of centrality, similarity, and influence measures. Often, however, the relevance of a node u to a node v is more precisely measured not by the magnitude of the distance, but by the number of nodes that are closer to v than u. That is, by the rank of u in an ordering of nodes by increasing distance from v. We identify and address fundamental challenges in rank-based graph mining. We first consider single-source computation of reverse-ranks and design a ``Dijkstra-like'' algorithm which computes nodes in order of increasing approximate reverse rank while only traversing edges adjacent to returned nodes. We then define reverse-rank influence and present a near-linear algorithms for greedy approximate reverse-rank influence maximization. The design relies on our single-source algorithm and on the SKIM sketch-based approximate greedy maximization framework. Our algorithms utilize near-linear preprocessing of the network to compute all-distance sketches. As a contribution of independent interest, we present a novel algorithm for computing these sketches, which have many other applications, on multi-core architectures. We complete this work by establishing the hardness of computing exact reverse-ranks for a single source and exact reverse-rank influence, which show that with near-linear computation, the small relative errors we obtain are the best we can currently hope for. Finally, we conduct an experimental evaluation on graphs with tens of millions of edges, demonstrating both scalability and accuracy. This project is Eliav's M.Sc. thesis (Tel-Aviv university).
A premise at a heart of network analysis is that entities in a network derive utilities from their connections. The influence of a seed set S of nodes is defined as the sum over nodes j of the utility of S to j. Distance-based utility, which is a decreasing function of the distance from S to j, was explored in several successful research threads from social network analysis and economics: Network formation games [Bloch and Jackson 2007], Reachability-based influence [Richardson and Domingos 2002; Kempe et al. 2003]; ``threshold'' influence [Gomez-Rodriguez et al. 2011]; and {\em closeness centrality} [Bavelas 1948]. We formulate a model that unifies and extends this previous work and address the two fundamental computational problems in this domain: Influence oracles and influence maximization (IM). An oracle performs some preprocessing, after which influence queries for arbitrary seed sets can be efficiently computed. With IM, we seek a set of nodes of a given size with maximum influence. Since the IM problem is computationally hard, we instead seek a greedy sequence of nodes, with each prefix having influence that is at least 1-1/e of that of the optimal seed set of the same size. We present the first highly scalable algorithms for both problems, providing statistical guarantees on approximation quality and near-linear worst-case bounds on the computation. We perform an experimental evaluation which demonstrates the effectiveness of our designs on networks with hundreds of millions of edges.
Propagation of contagion is a fundamental process in social, biological, and physical networks. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a probabilistic model, such as Independent Cascade (IC), or captured by a set of representative traces. Basic computational problems in the study of diffusion are influence queries (determining the potency of a specified {\em seed set} of nodes) and Influence Maximization (identifying the most influential seed set of a given size). Answering each influence query involves many edge traversals, and does not scale when there are many queries on very large graphs. The gold standard for Influence Maximization is the greedy algorithm, which iteratively adds to the seed set a node maximizing the marginal gain in influence. Greedy has a guaranteed approximation ratio of at least (1-1/e) and produces a sequence of nodes, with each prefix having approximation guarantee with respect to the same-size optimum. Since Greedy does not scale well beyond a few million edges, for larger inputs one must currently use either heuristics or alternative algorithms designed for a pre-specified small seed set size. We propose a novel sketch-based design for influence computation. Our greedy Sketch-based Influence Maximization (SKIM) algorithm scales to graphs with billions of edges, with one to two orders of magnitude speedup over the best greedy methods. It still has a guaranteed approximation ratio, and in practice its quality nearly matches that of exact greedy. We also present influence oracles , which use linear-time preprocessing to generate a small sketch for each node, allowing the influence of any seed set to be quickly answered from the sketches of its nodes.
In terms of techniques, this work utilizes my 1994 reachability sketching technique for directed graphs and efficient estimators. We extend reachability sketches to be across multiple graphs or a probabilistic model, retaining the same storage requirement and approximation guarantees. For influence maximization, our efficient approximate greedy algorithm SKIM only computes sketches to the point needed to determine the approximate maximum marginal gain with sufficient confidence. It also updates the sketches after each selection to be with respect to the residual problem, so that no work is wasted.
The All-Distances Sketches (ADS), which I introduced back in 1994, turn out to be a very effective tool for mining massive graphs. A small-size ADS can be efficiently computed for each node in the graph. In my original work I used the ADS of each node to estimate neighborhood cardinalities (number of nodes within a certain distance) and also showed how sketches of different nodes can be used to estimate neighborhood relations. We provide a unified view of several ADS definitions which evolved over time and propose the novel HIP estimators. The HIP estimators obtain tighter estimates of neighborhood cardinalities and closeness centralities from the sketch of a node. They also facilitate good estimates for a powerful natural class of queries. We also show how the HIP estimators can be applied to approximate distinct counting on data streams, improving over state-of-the-art practical algorithms.
Closeness centrality is an importance measure of nodes in social and massive graphs which is based on the distances from the node to all other nodes. The classic definition, proposed by Bavelas (1950), Beauchamp (1965), and Sabidussi (1966), is (the inverse of) the average distance to all other nodes. We propose the first highly scalable (near linear-time processing and linear space overhead) algorithm for estimating, within a small relative error, the classic closeness centralities of all nodes in the graph. Our algorithm provides strong probabilistic guarantees on the approximation quality for all nodes of any undirected graph, as well as for centrality computed with respect to round-trip distances in directed graphs. For directed graphs, we also propose algorithms that approximate generalizations of classic closeness centrality to outbound and inbound centralities. This algorithm does not provide worst-case theoretical approximation guarantees, but performs well on the networks we tested. We perform extensive experiments on large networks, demonstrating high scalability and accuracy.
Similarity estimation between nodes based on structural properties of graphs is a basic building block in the analysis of massive networks, used for link prediction, product recommendations, advertisement, attribute completion, and more. Similarity measures that are based on global properties have better recall than local ones but are harder to compute. We make several contributions aiming for both accuracy and scalability: First, we define closeness similarity, a natural measure that compares two nodes based on the similarity of their relations to all other nodes. Second, we show how the all-distances sketch (ADS) node labels, which are efficient to compute, can support the estimation of closeness similarity in logarithmic query time. ADSs can also be used as SP distance oracles, matching the performance of dedicated techniques. Third, we propose the randomized edge lengths (REL) technique and define the corresponding REL distance, which captures both path length and path multiplicity and therefore improves over the SP distance as a similarity measure. The REL distance can also be the basis of closeness similarity and can be estimated using SP computation or using the ADS labels.
My size-estimation scheme is based on a simple but powerful technique: the least-element (LE) method. This method is better known today as min-hash: Suppose you have subset from a universe of items. You (implicitly) assign a random permutation (or random values) to all items. We refer to the permutation rank or the random value assigned to an item as its rank. Now, the LE of a subset is the minimum rank of its members. The LE is in expectation smaller when there are more elements in the set. Therefore, it is possible to estimate the size of the set from its LE rank. When LE ranks of different subsets are obtained using the same random permutation, they are coordinated. Coordination facilitates many applications. In particular, it means that the LE ranks are mergeable: The LE of the union of subsets is the minimum of the LEs of its members. The LE ranks can also be used to estimate similarity of sets. For example, the probability that two sets have the same LE is proportional to their Jaccard similarity, which is the ratio of the intersection size to the union size. Therefore, the Jaccard similarity of the sets can be estimated from the LE similarity.The accuracy of these size or similarity estimates can be enhanced by repeating this k times (using k permutations), by taking the k smallest ranks in a single permutation, or by hashing elements randomly to k buckets and taking the LE of each of the k buckets. I refer to these sketch flavors, respectively, as k-mins, bottom-k, or k-partition sketches. These three flavors have different advantages, depending on the tradeoffs we want to achieve. Asymptotically, the estimators with all three have the same behavior but bottom-k sketches carry the most information. In this early work (1994-1997), I studied mostly k-mins estimators (with mention of bottom-k sketches) and studied cardinality and similarity estimation.
The first application of the LE technique for size estimation I am aware of is for approximate distinct counting [Flajolet and Martin 1985]. As for similarity estimation, LE sketches can be viewed as a special case of coordinated samples [Brewer, Early, Joice 1972]. A very well known application of the LE technique is for near-duplicate detection of Web pages [Broder 1997].
I first applied the LE technique in a graph setting (see below): It turned out that (coordinated) sketches of all neighborhoods, and all reachability sets of all nodes in a graph can be very efficiently computed, with processing that is nearly-linear in graph size. This resulted in a powerful technique for analyzing very large graphs. Other early applications I explored (see below) are determining the optimal order to perform a chain multiplication of sparse matrices and sketch-based 1 tracking of the state of distributed processes for roll-back recovery.
In later papers, I use the term All-Distances sketch for the LE values with respect to all neighborhoods ("balls") of a node in the graph. We consider spatially-decaying agrregation , neighborhood variances , use of bottom-k sketches, and deriving even better estimators.
These algorithms, which I developed in the early 1990's, consider the idealized PRAM model (Parallel Random Access Machine) of parallel computation. The PRAM model assumes we have as many parallel processors as we can use, and considers the tradeoff between processing time and the total "work" we use (which is the product of time and number of processors). The principled exploration of the fundamental chains of dependencies, however, is relevant for the GPUs, multi-core, and map-reduce architecture which dominate massive computation today. Some of the structures I used also turn out to be relevant for the design of sketching, streaming, and dynamic sequential algorithms. Finally, the basic ideas are rather simple, not (only) aiming for improved worst-case bounds but also for simplicity and implementability.
Single-source shortest-paths (SSSP) computation seemed not to parallelize efficiently. We show that for undirected graphs, if we allow a slight fractional stretch (ratio of approximate distance to actual distance), we can perform s approximate SSSP computations in polylogarithmic time using work that is only a little more than performing s sequential applications of Dijkstra's SSSP algorithm. Our chief instrument, which is of independent interest, are efficient constructions of Hopsets. A hopset for a fraction e and integer d is a small set of auxiliary edges E' such that the shortest d-hop path in E+E' between any pair of nodes has length that is at most (1+e) times the distance between the nodes. Hopsets can also be used as sparse spanners with fractional stretch and small additive error.