three or more apparent haplotypes at repeats #29

ekg · 2016-05-31T13:42:04Z

A lot of our errors look like this:

But when we go to tview, we see that the problem. The reads match the reference, but only when they don't fully overlap the locus.

There are a few possible solutions to this.

Detect the repeated sequences and require full overlap (similar to freebayes).

Include the alignment start and end coordinates as a feature.

@nikete thoughts?

nikete · 2016-05-31T14:46:22Z

A mix of the two might be good; I dont see how the coordinates directly would map linearly without exaustively enumerating that space in the training, whcih seems unlikely given the sub1% errors.

Detecting the repeated seq and requiring full overlap seems like it would waste a lot of knowledge, right?

ekg · 2016-05-31T14:48:13Z

The coordinates would be relative to the window, or maybe to any underlying
repeats.

The learner can't seem to figure out that there is a repeat and the
sequence is the same as that in the reads.

We could add a feature which was the length beyond a repeat at which the
read starts and ends.

On Tue, May 31, 2016 at 3:46 PM Nicolás Della Penna <
notifications@github.com> wrote:

A mix of the two might be good; I dont see how the coordinates directly
would map linearly without exaustively enumerating that space in the
training, whcih seems unlikely given the sub1% errors.

Detecting the repeated seq and requiring full overlap seems like it would
waste a lot of knowledge, right?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#29 (comment), or mute
the thread
/~https://github.com/notifications/unsubscribe/AAI4EVaf42DbFB1azZydWocXR0P0X0lgks5qHEm-gaJpZM4IqixN
.

ekg · 2016-05-31T14:52:13Z

By the way, we do retain information from the alignments to the graph, so
we're not necessarily throwing these all out. We might just want to mark
which reads don't completely overlap the locus. Maybe we put them in the
incomplete pile. And we should be exposing the repeat structures in some
way. Otherwise it would seem to have no mechanism to learn them.

On Tue, May 31, 2016 at 3:47 PM Erik Garrison erik.garrison@gmail.com
wrote:

The coordinates would be relative to the window, or maybe to any
underlying repeats.

The learner can't seem to figure out that there is a repeat and the
sequence is the same as that in the reads.

We could add a feature which was the length beyond a repeat at which the
read starts and ends.

On Tue, May 31, 2016 at 3:46 PM Nicolás Della Penna <
notifications@github.com> wrote:

A mix of the two might be good; I dont see how the coordinates directly
would map linearly without exaustively enumerating that space in the
training, whcih seems unlikely given the sub1% errors.

Detecting the repeated seq and requiring full overlap seems like it would
waste a lot of knowledge, right?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#29 (comment), or mute
the thread
/~https://github.com/notifications/unsubscribe/AAI4EVaf42DbFB1azZydWocXR0P0X0lgks5qHEm-gaJpZM4IqixN
.

nikete · 2016-05-31T14:54:23Z

even the relative coordinates to make them linearly express this would have to be quadratic to the reference, right?

marking the reads that dont overlap the locus seems like it is asking a lot of the linear learner, in articula,r here those are the oen with all the info and in other cases it is the oposite, im fraid it average out

exposing the repeat structure seem important, agreed.

ekg · 2016-05-31T14:54:43Z

Another example:

In tview it's clear that the reads supporting the reference aren't fully overlapping the repeat.

ekg · 2016-05-31T14:58:46Z

In freebayes we can exclude these. In fact, although that isn't default there is discussion and support from folks like @chapmanb that we should do so by default as it improves performance. --no-partial-observations and --min-repeat-entropy are used to change this behavior.

@nikete to explain: freebayes decides on a haplotype window over which it infers the genotype(s) for the samples in the analysis. In cases where there is an exact repeat or the sequence at the locus is a short repeat followed by low-complexity sequence, we use a haploytpe window long enough to reach one shannon per base (--min-repeat-entropy 1), and exclude any reads that only partially overlap the resulting window --no-partial-observations. These are rather difficult to use correctly, but many people would be interested if we can figure out a nice way to do so.

The graph feature should be capturing even the stuff that doesn't fully overlap the locus.

ekg · 2016-05-31T15:02:39Z

In the last case the graph feature doesn't help us because we've inappropriately broken the site into two. That's another problem that I find a bit confusing... I thought I'd resolved this as well but apparently not enough.

ekg · 2016-05-31T15:03:09Z

It's not the right thing to do to call reference. At these examples we have non-reference genotypes.

nikete · 2016-06-03T08:15:02Z

just as a note for the future: this will work well on 50X stuff, but for miniIOn or low coverage methods it might be better to not take them, to center we could use the middle window to bethe high entropy region

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

three or more apparent haplotypes at repeats #29

three or more apparent haplotypes at repeats #29

ekg commented May 31, 2016 •

edited

Loading

nikete commented May 31, 2016

ekg commented May 31, 2016

ekg commented May 31, 2016

nikete commented May 31, 2016

ekg commented May 31, 2016

ekg commented May 31, 2016

ekg commented May 31, 2016

ekg commented May 31, 2016

nikete commented Jun 3, 2016

three or more apparent haplotypes at repeats #29

three or more apparent haplotypes at repeats #29

Comments

ekg commented May 31, 2016 • edited Loading

nikete commented May 31, 2016

ekg commented May 31, 2016

ekg commented May 31, 2016

nikete commented May 31, 2016

ekg commented May 31, 2016

ekg commented May 31, 2016

ekg commented May 31, 2016

ekg commented May 31, 2016

nikete commented Jun 3, 2016

ekg commented May 31, 2016 •

edited

Loading