# cows: Collections for wildcard strings¶

**cows** (**co**llections for **w**ildcard **s**trings) is a Python
library that provides efficient collection implementations where equality
checking allows for wildcards in both the search string and the strings already
in the collection.

## Motivation¶

cows was developed for a common problem in bioinformatics: given a set of DNA
sequences with the alphabet `A`

, `T`

, `C`

, `G`

, along with a wildcard
`N`

(indicating that the base is unknown), find the unique sequences and
perform some operation on them. Examples of the operation are: counting how
many times each unique sequence occurs and generate a consensus sequence for
each unique sequence.

For a simple example, for counting unique sequences consider the following input and desired output:

```
input output
----- ------
ATNG ATNG 2 # Comprised of ATNG and ATCN
ATCN ANNT 1
ANNT GTTC 1
GTTC
```

Notice this task requires comparing strings with wilcards not just in one
string, but in both. For example, matching `ATCN`

to `ATNG`

requires that
the third and fourth characters both be considered wildcards.

Naively one could pairwise compare the sequences, ignore the positions where
either contains an `N`

, and check if all other positions match. However,
this quickly becomes intractable as it scales with the square of the number of
sequences.

cows uses a modified implementation of atrie (`cows.trie`

) to reduce
this complexity to scale linearly with the number of sequences.

## Provided Data Structures¶

Below are examples for the data structures included with cows. Please see the documentation in Data Structure Reference for detailed API information.

`cows.List`

¶

A `cows.list`

is a simple list implementation where insertion functions
similarly to the builtin `list`

data structure, but accessor methods take
into account ambiguity. For example:

```
l = cows.List(['ABCD', 'ABC*', '****', 'DEFG'])
print(l.index('D***'))
```

The print statement outputs `2`

since the first match for `D***`

is at
position 2 (with a value of `****`

).

`cows.Set`

¶

A `cows.set`

stores unique strings similar to the builtin `set`

data
structure. Instead of using hashes for equality checks, the underlying
`cows.trie`

is used to check if the pattern being inserted matches any
existing member of the set, taking into account wildcards in both. For
example:

```
import cows
s = cows.Set(wildcard='*')
s.add('ABCD')
s.add('*EFG')
s.add('T')
s.add('ABC*') # Matches ABCD, so not added
s.add('HEF*') # Matches *EFG, so not added
print(s)
```

Produces:

```
cows.Set(['*EFG', 'ABCD', 'T'])
```

`cows.Dict`

¶

cows dictionaries are similar to the builtin `dict`

type insofar as they are
key/value stores. They have a few key differences, however

First, when setting a value, if there is an existing (potentially ambiguous)
match already in the dictionary, you can set an `updater`

function to update
the existing value, rather than simply overwrite it. Further, when inserting a
key, because of ambiguity, multiple existing keys may match. Providing a
`selector`

function lets you define to which of the matches the `updater`

should be applied.

See `cows.dictionary`

for more detailed information.

```
import cows
def increment(match, old_value, new_value):
return old_value + new_value
my_dict = cows.Dict(updater=increment)
my_dict['ABC'] = 1
my_dict['DEF'] = 2
my_dict['AB*'] = 10
for k, v in sorted(my_dict.items()):
print(f'{k} --> {v}')
```

Produces:

```
ABC --> 11
DEF --> 2
```

### cows.Trie¶

Note

Generally the `cows.trie`

data structure shouldn’t be used
directly. Consider using one of its abstractions.

All other cows data structures are based on the `cows.trie`

class. It
allows for ambiguous queries taking into account wildcards both in the query
string and elements in the trie.

An example of it’s use:

```
import cows
t = cows.Trie()
t['ABCD'] = 1
t['DE*G'] = 5
print(f'Matches for ABC* {list(t.get_matches("ABC*"))}')
print(f'Matches for D*FG {list(t.get_matches("D*FG"))}')
```

Outputs:

```
Matches for ABC* [('ABCD', cows.Trie(D, 1))]
Matches for D*FG [('DE*G', cows.Trie(G, 5))]
```

## Performance¶

cows is performant, requiring *O(n)* time for insertions and lookups with
an input size of *n* strings. The naive approach which is currently quite
common involves pairwise comparing the sequences in a collection resulting in
*O(n*^{2}*)*, quickly becoming intractable.