Proyectos
Approximate Parikh Matching
Resumen
The classic string pattern matching problem is concerned with finding a pattern within a text, where both strings are drawn from an alphabet set Sigma of size alpha. The output of the string matching problem should list all occurrences of the pattern string in the text string. An existing and well-known variant of the exact pattern matching problem is (delta, gamma)-Matching. (delta, gamma)-Matching algorithms are very effective in searching for all similar but not necessarily identical occurrences of a given pattern. In the $(delta, gamma)-Matching problem, two integer strings of the same length match if the corresponding integers differ by at most a fixed bound delta and the total sum of these differences is bounded by gamma. These problems have found numerous applications in the fields of bioinformatics, data mining, computer vision and music information retrieval, to name some. Many kinds of algorithms have been put forward to resolve (delta, gamma)-Matching and its variants. There is another rather interesting variant of the string pattern matching problem called Parikh Matching. This problem searches for similarities in composition rather than the order of characters. The input to this problem is a string S[1..n], defined over a finite ordered alphabet Sigma={a_1,..., a_sigma}, and a Parikh vector q in N^sigma. A Parikh vector is a sigma-component vector such that its i-th component denotes the number of occurrences of the character a_i in a string. Then, the Parikh matching problem consists of finding all the substrings in S whose Parikh vector is q. For instance, let us consider alphabet Sigma=\{a,b,c\}, q=(2, 1, 2), and S=bccbacabbaacbcc; there are just two occurrences of q in S starting at position 3 and 10, namely cbaca and aacbc, respectively. This same problem has also been referred to in the literature as jumbled pattern matching, Abelian pattern matching, Parikh vector, Parikh mapping, compomer, composition, permuted strings, permuted pattern and others. Correspondingly, there are naturally diverse applications in the area of bioinformatics such as alignment, SNP discovery, interpretation of mass spectrometry data and metabolic network analysis. Both (delta, gamma)-Matching and Parikh Matching have important applications in different areas. (delta, gamma)--Matching is useful in cases where it is necessary to look for close instances of a pattern (not necessarily the pattern itself) considering some criteria of approximation established by delta and gamma. As for Parikh Matching, it is relevant when we search for strings that have the same composition of the pattern even though the order of symbols may vary. The essence of these two problems are not mutually exclusive and it would even be really interesting to study a combination that makes the most of the advantages provided by them. We term this new problem as (delta, gamma)-Parikh Matching, which seeks to allow much more flexibility in the search process. We use delta to restrict the local error and gamma to restrict the global error of an occurrence with respect to q. Thus, given a string S=S[1..n] and a Parikh vector q, we are interested in the substrings whose Parikh vector p (delta, gamma)-matches with q. For example, let us consider alphabet Sigma={a,b,c\}, q=(2,2,3), and $S=caaccbaaaabccbb$; there are not occurrences if exact Parikh matching is used. However, if we set delta=1 and gamma=1 we find two occurrences starting at position 1. Notice that if we set gamma=3 and maintain delta=1, we get 19 occurrences with aaabccbb being the longest one and accb one of the shortest ones. In other words, this problem looks for strings that have similar but not necessarily the same composition established by the Parikh vector. It is highly motivating to study these new problems, from an algorithmic and combinatorial perspective. In conclusion, this project aims to define new approximation paradigms for Parikh matching on strings based on delta and
Convocatoria
Nombre de la convocatoria:Registro único de proyectos
Modalidad:Registro único de proyectos
Responsable