Intersection Calculs
In search engines as Google, for each key-word there is a precomputed list of pages associated to this key-word. When answering a request with many words, to return the list of relevant references, the server must compute the intersection of all sets of references.
Fine analysis of Intersection Calculus
The intersection calculus worst case complexity has an easy tight lower bound of Theta(n) for n fixed. By defining a difficulty measure on instances, we define equivalence classes and study the worst case complexity on each of those class. This gives a finer analysis of the problem, and help to find and argue for a better algorithm.
Worst case complexity for fixed size of instances is the most natural way to analyse algorithms, and problems that these algorithms solve. But for some applications this analysis is not sufficient, as many distinct algorithms are equivalently optimal in this measure, but strongly constrasted in their practical performances.
To solve this problem on some Computational Geometry algorithm, Kirkman and Seidel proposed to give the complexity of their algorithm in function of the size of the input and of the size of the output. The corresponding lower bound they give in this context proves the optimality of their algorithm in a new way, for fixed size of input and output. The same work has been done on different sorting algorithms by many people (Estivill-Castro and Wood wrote a whole survey on this topic), fixing not the output size but an internal characteristic of the instance, the entropy measuring the initial disorder of the array to be sorted. They give lower bounds and corresponding optimal algorithms for different measures, and some reduction between measures.
On computation of Union and Intersection of sorted arrays, Demaine, Lopez-Ortiz, Munro, Kenyon and I separately give several measures of difficulty (new name for the entropy of sorting adaptive algorithm), lower bounds and corresponding algorithm.
t-Threshold set and opt threshold set
When the intersection of the sorted sets is empty, then so is the answer of the search engine. The user then usually send another request, removing a keyword. This protocol have a communication cost, and take usert time, when the relaxation of the query could be automatised with feeble cost. We define the t-threshold set which is constituted by elements present in at least t sorted sets, and the opt-threshold set (also called pertinent set) which is equal to the smallest non empty t-threshold set, for all t. We give a lower bound and an optimal algorithm for the t-threshold set calcul, and an algorithm nearly optimal for the pertinent set calculus.
Combinations of Decision problems
Consider the element version of the intersection problem: for a given element x, decide wether it belongs to the intersection of k sorted arrays. This decision problem is a conjunction of decision problem: x is in the intersection if and only if it is in the first array AND the secund array AND all other arrays. We study the general combination of decision problems for which computational complexity lower bounds are known, to deduce computational complexity lower bounds on the combined problem.
Problems Considered
Intersection Problem:
Given a set of k ordered sets, compute the intersection of these sets.
t-Threshold Set Problem:
Given a set of k ordered sets, compute the set of elements which are in at least t sets.
Optimal Threshold Set:
In the context of database queries, a natural variant of the t-Threshold Problem is the following: Given a set of k ordered sets, compute the minimal non-empty Threshold set.
InterUnion Problem:
So far, in the applications set Ai was considered to be the set of database elements which contain the ith word in the k-word query. However, more sophisticated algorithms, given a query word w, design a short list of all words which are variants of w, then find a set Av for each variant v. Instead of finding the intersection of all the sets, one is then faced with a new problem, finding the intersection of unions of sets: Given a set of k families of ordered sets compute the intersection of the unions of the elements in the families.
Precise Search Problem:
More generally, given k sorted sets and an element x, one can construct a k-bit word such that xi=1 if and only if x is an element of Ai. Take a boolean function f:{0,1}k -> 1, and consider the problem of finding all the elements (or deciding whether there exist any element, in the decision version of the problem) such that f(x)=1.
When f(x1,... ,xk)=(x1 + ... + xk>0), we get the Union Problem.
When f(x1,... ,xk)=x1 * ... * xk, we get the Intersection Problem.
When f(x1,... ,xk)=(x1 + ... + xk> t-1), we get the t-Threshold Problem.
Can anything be said about the problem for general f ?
We think also to apply these results to other models than the integer comparison model, as the interval comparison model. This model will be used in the context of the indexing of shared files on a LAN. The problems cited above can be defined on this model directly.
Applications
This work has been inspired by project FFSS, a search engine integrated to a file sharing protocol, under development and testing. In this indexed search enine arrays are replaced by trees of arrays, and elements are intervals, but lower bounds and algorithms extends to this model.
Publications
"Adaptive Intersection and t-Threshold Problems", Jérémy Barbay and Claire Kenyon (SODA) (ps.gz/ pdf.gz/ ps/ pdf)
( Abstract: Consider the problem of computing the intersection of k sorted arrays. In the comparison model, we prove a new lower bound which depends on the non-deterministic complexity of the instance, and implies that the algorithm of Demaine, López-Ortiz and Munro is optimal in this "adaptive" sense (for k much smaller than n). We extend the lower bound and the algorithm to the t-Threshold Problem, which consists in finding the elements which are in at least t of the k sets. For the t-threshold Problem, the analysis of our algorithm gives a complexity in O(log k) times the lower bound. These problems are motivated by boolean queries in text database systems. )
"Analyse fine: bornes inférieures et algorithmes de calculs d'intersection pour moteurs de recherche" (Fine Analysis: Lower Bounds and Algorithms of Intersection Computation for Search Engines [IN FRENCH]), Jérémy Barbay
Ph. D. Thesis 2002
(Abstract: Indexed search engines solving conjunctive queries need to compute the intersection of sorted sets. The usual worst case analysis is too crude to compare the algorithms for this problem. Ideally, an efficient algorithm should solve easy instances more quickly. This problem was already studied by Demaine, Lopez-Ortiz and Munro. We define a model of finer analysis with a new measure of difficulty based on the non-deterministic complexity of the problem. We prove the optimality of the algorithm of Demaine, Lopez-Ortiz and Munro in this model. We define the t-threshold set problem, a generalization of the problems of finding the intersection or the union of sorted sets, and we prove similar results on this problem. We define the opt-threshold set, the minimal non-empty threshold set, and we give an almost optimal algorithm to compute this set. We study further generalization to combinations of decision problems. This generalizes previous results to the computation of the interunion (computing the intersection of a union of sorted sets), a problem motivated by search engines indexed on factors of keywords. As an application, we present the search engine integrated to FFSS, a file sharing system. )
Slides of my talks on this subject: LIP, LIX, SODA, Thesis Defense .
"Les étudiants ont le sens du protocole" (Students have a new protocol), Jérémy Barbay, (PLEINSUD)
(Abstract: Fleming File Sharing System (FFSS) developped by students from Orsay is a new protocol to share files on a local network, centralized and indexed, so that network trafic is minimized in comparison with Samba, and so that efficient research can be performed. This is a popularisation article in the University journal.)