cows: Collections for wildcard strings

cows (collections for wildcard strings) is a Python library that provides efficient collection implementations where equality checking allows for wildcards in both the search string and the strings already in the collection.

Motivation

cows was developed for a common problem in bioinformatics: given a set of DNA sequences with the alphabet A, T, C, G, along with a wildcard N (indicating that the base is unknown), find the unique sequences and perform some operation on them. Examples of the operation are: counting how many times each unique sequence occurs and generate a consensus sequence for each unique sequence.

For a simple example, for counting unique sequences consider the following input and desired output:

input           output
-----           ------
ATNG            ATNG 2 # Comprised of ATNG and ATCN
ATCN            ANNT 1
ANNT            GTTC 1
GTTC

Notice this task requires comparing strings with wilcards not just in one string, but in both. For example, matching ATCN to ATNG requires that the third and fourth characters both be considered wildcards.

Naively one could pairwise compare the sequences, ignore the positions where either contains an N, and check if all other positions match. However, this quickly becomes intractable as it scales with the square of the number of sequences.

cows uses a modified implementation of atrie (cows.trie) to reduce this complexity to scale linearly with the number of sequences.

Provided Data Structures

Below are examples for the data structures included with cows. Please see the documentation in Data Structure Reference for detailed API information.

cows.List

A cows.list is a simple list implementation where insertion functions similarly to the builtin list data structure, but accessor methods take into account ambiguity. For example:

l = cows.List(['ABCD', 'ABC*', '****', 'DEFG'])

print(l.index('D***'))

The print statement outputs 2 since the first match for D*** is at position 2 (with a value of ****).

cows.Set

A cows.set stores unique strings similar to the builtin set data structure. Instead of using hashes for equality checks, the underlying cows.trie is used to check if the pattern being inserted matches any existing member of the set, taking into account wildcards in both. For example:

import cows

s = cows.Set(wildcard='*')
s.add('ABCD')
s.add('*EFG')
s.add('T')
s.add('ABC*')  # Matches ABCD, so not added
s.add('HEF*')  # Matches *EFG, so not added

print(s)

Produces:

cows.Set(['*EFG', 'ABCD', 'T'])

cows.Dict

cows dictionaries are similar to the builtin dict type insofar as they are key/value stores. They have a few key differences, however.

First, when setting a value, if there is an existing (potentially ambiguous) match already in the dictionary, you can set an updater function to update the existing value rather than simply overwrite it. Further, when inserting a key/value pair, multiple existing keys may match the new key due to ambiguity. Specifying a selector function at instantiation lets you define to which of the matches the updater should be applied.

See cows.dictionary for more detailed information.

import cows

def increment(match, old_value, new_value):
    return old_value + new_value

my_dict = cows.Dict(updater=increment)
my_dict['ABC'] = 1
my_dict['DEF'] = 2
my_dict['AB*'] = 10

for k, v in sorted(my_dict.items()):
    print('{} --> {}'.format(k, v))

Produces:

ABC --> 11
DEF --> 2

cows.Trie

Note

Generally the cows.trie data structure shouldn’t be used directly. Consider using one of its abstractions.

All other cows data structures are based on the cows.trie class. It allows for ambiguous queries taking into account wildcards both in the query string and elements in the trie.

An example of it’s use:

import cows

t = cows.Trie()
t['ABCD'] = 1
t['DE*G'] = 5

print('Matches for ABC* {}'.format(list(t.get_matches("ABC*"))))
print('Matches for D*FG {}'.format(list(t.get_matches("D*FG"))))

Outputs:

Matches for ABC* [('ABCD', cows.Trie(D, 1))]
Matches for D*FG [('DE*G', cows.Trie(G, 5))]

Performance

cows is performant, requiring O(n) time for insertions and lookups with an input size of n strings. The naive approach which is currently quite common involves pairwise comparing the sequences in a collection resulting in O(n2), quickly becoming intractable.