Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

three or more apparent haplotypes at repeats #29

Open
ekg opened this issue May 31, 2016 · 9 comments
Open

three or more apparent haplotypes at repeats #29

ekg opened this issue May 31, 2016 · 9 comments

Comments

@ekg
Copy link
Owner

ekg commented May 31, 2016

A lot of our errors look like this:

image

But when we go to tview, we see that the problem. The reads match the reference, but only when they don't fully overlap the locus.

image

There are a few possible solutions to this.

Detect the repeated sequences and require full overlap (similar to freebayes).

Include the alignment start and end coordinates as a feature.

@nikete thoughts?

@nikete
Copy link
Collaborator

nikete commented May 31, 2016

A mix of the two might be good; I dont see how the coordinates directly would map linearly without exaustively enumerating that space in the training, whcih seems unlikely given the sub1% errors.

Detecting the repeated seq and requiring full overlap seems like it would waste a lot of knowledge, right?

@ekg
Copy link
Owner Author

ekg commented May 31, 2016

The coordinates would be relative to the window, or maybe to any underlying
repeats.

The learner can't seem to figure out that there is a repeat and the
sequence is the same as that in the reads.

We could add a feature which was the length beyond a repeat at which the
read starts and ends.

On Tue, May 31, 2016 at 3:46 PM Nicolás Della Penna <
notifications@github.com> wrote:

A mix of the two might be good; I dont see how the coordinates directly
would map linearly without exaustively enumerating that space in the
training, whcih seems unlikely given the sub1% errors.

Detecting the repeated seq and requiring full overlap seems like it would
waste a lot of knowledge, right?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#29 (comment), or mute
the thread
/~https://github.com/notifications/unsubscribe/AAI4EVaf42DbFB1azZydWocXR0P0X0lgks5qHEm-gaJpZM4IqixN
.

@ekg
Copy link
Owner Author

ekg commented May 31, 2016

By the way, we do retain information from the alignments to the graph, so
we're not necessarily throwing these all out. We might just want to mark
which reads don't completely overlap the locus. Maybe we put them in the
incomplete pile. And we should be exposing the repeat structures in some
way. Otherwise it would seem to have no mechanism to learn them.

On Tue, May 31, 2016 at 3:47 PM Erik Garrison erik.garrison@gmail.com
wrote:

The coordinates would be relative to the window, or maybe to any
underlying repeats.

The learner can't seem to figure out that there is a repeat and the
sequence is the same as that in the reads.

We could add a feature which was the length beyond a repeat at which the
read starts and ends.

On Tue, May 31, 2016 at 3:46 PM Nicolás Della Penna <
notifications@github.com> wrote:

A mix of the two might be good; I dont see how the coordinates directly
would map linearly without exaustively enumerating that space in the
training, whcih seems unlikely given the sub1% errors.

Detecting the repeated seq and requiring full overlap seems like it would
waste a lot of knowledge, right?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#29 (comment), or mute
the thread
/~https://github.com/notifications/unsubscribe/AAI4EVaf42DbFB1azZydWocXR0P0X0lgks5qHEm-gaJpZM4IqixN
.

@nikete
Copy link
Collaborator

nikete commented May 31, 2016

even the relative coordinates to make them linearly express this would have to be quadratic to the reference, right?

marking the reads that dont overlap the locus seems like it is asking a lot of the linear learner, in articula,r here those are the oen with all the info and in other cases it is the oposite, im fraid it average out

exposing the repeat structure seem important, agreed.

@ekg
Copy link
Owner Author

ekg commented May 31, 2016

Another example:

image

In tview it's clear that the reads supporting the reference aren't fully overlapping the repeat.

image

@ekg
Copy link
Owner Author

ekg commented May 31, 2016

In freebayes we can exclude these. In fact, although that isn't default there is discussion and support from folks like @chapmanb that we should do so by default as it improves performance. --no-partial-observations and --min-repeat-entropy are used to change this behavior.

@nikete to explain: freebayes decides on a haplotype window over which it infers the genotype(s) for the samples in the analysis. In cases where there is an exact repeat or the sequence at the locus is a short repeat followed by low-complexity sequence, we use a haploytpe window long enough to reach one shannon per base (--min-repeat-entropy 1), and exclude any reads that only partially overlap the resulting window --no-partial-observations. These are rather difficult to use correctly, but many people would be interested if we can figure out a nice way to do so.

The graph feature should be capturing even the stuff that doesn't fully overlap the locus.

@ekg
Copy link
Owner Author

ekg commented May 31, 2016

In the last case the graph feature doesn't help us because we've inappropriately broken the site into two. That's another problem that I find a bit confusing... I thought I'd resolved this as well but apparently not enough.

@ekg
Copy link
Owner Author

ekg commented May 31, 2016

It's not the right thing to do to call reference. At these examples we have non-reference genotypes.

@nikete
Copy link
Collaborator

nikete commented Jun 3, 2016

just as a note for the future: this will work well on 50X stuff, but for miniIOn or low coverage methods it might be better to not take them, to center we could use the middle window to bethe high entropy region

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants