Ambiguous symbols
The symbols page
lists all symbols that the underlying sequence files in .FASTA format can contain.
The ambiguous symbols arise from imperfect reads in the sequencer.
While one mostly queries for the symbols A, C, G, T and - to look for specific features and mutations of a sequence,
or N for quality control of the underlying data,
the ambiguous symbols R through V are often too cumbersome to consider in analyses.
LAPIS supports the flexible consideration of these ambiguous symbols through an extension of the boolean logic syntax in the variant queries.
Here we introduce a new expression MAYBE to consider sequences that have an ambiguous code which maybe matches the queried value.
Example
Section titled “Example”Consider the following sequences:
12345AAACGAARCGAANCGAAGCGAAACGA filter for the mutation 3G returns only the sequence AAGCG, as it is the only sequence with the symbol G at position 3.
The filter MAYBE(3G) however also considers that the sequences AARCG and AANCG may have the symbol G at position 3,
because the symbols R and N can represent Guanine.
MAYBE and NOT
Section titled “MAYBE and NOT”Ambiguous symbols and negation (not or !) can sometimes seem a bit unintuitive.
When querying for sequences that do not have - for example - a G at position 3, should that include ambiguous sequences or not?
When querying for !3G the result includes sequences that might not have a G, i.e. also sequences with N or R at position 3.
Using !MAYBE(3G) gives us the set of sequences that definitely don’t have a G at 3; they have a symbol that cannot represent G.
Example
Section titled “Example”Consider the same sequences as above.
3G returns only AAGCG and !3G returns the four other sequences.
!MAYBE(3G) will only return the two AAACG sequences, because only for these we know that 3 is definitely not a G.