randomized svd draft #3008

hanbin973 · 2024-10-03T18:32:08Z

Description

A draft of randomized principal component analysis (PCA) using the TreeSequence.genetic_relatedness_vector. The implementation contains spicy.sparse which should eventually be removed.
This part of the code is only used when collapsing a #sample * #sample GRM into a #individual * #individual matrix.
Therefore, it will not be difficult to replace with pure numpy.

The API was partially taken from scikit-learn.

To add some details, iterated_power is the number of power iterations in the range finder in the randomized algorithm. The error of SVD decreases exponentially as a function of this number.
The effect of power iteration is profound when the eigen spectrum of the matrix decays slowly, which seems to be the case of tree sequence GRMs in my experience.

indices specifies the individuals to be included in the PCA, although decreasing the number of individuals does not meaningfully reduce the amount of computation.

hanbin973 · 2024-10-03T18:34:34Z

@petrelharp Here's the code.

codecov · 2024-10-03T18:35:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.07%. Comparing base (76ab046) to head (587409b).
Report is 64 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3008      +/-   ##
==========================================
- Coverage   89.82%   87.07%   -2.75%     
==========================================
  Files          29       11      -18     
  Lines       31986    24666    -7320     
  Branches     6192     4556    -1636     
==========================================
- Hits        28730    21478    -7252     
+ Misses       1859     1824      -35     
+ Partials     1397     1364      -33

Flag	Coverage Δ
c-tests	`86.69% <ø> (ø)`
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`89.05% <ø> (ø)`
python-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 18 files with indirect coverage changes

python/tskit/trees.py

petrelharp · 2024-10-04T01:45:27Z

This looks great! Very elegant. I think probably we ought to include a samples argument, though? For consistency, but also since the tree sequence represents phased data, and so it's actually informative to look at the PCs of maternally- and paternally-inherited chromosomes separately.

So, how about the signature is like

def pca(samples=None, individuals=None, ...)

and:

the default is equivalent to samples=ts.samples(), individuals=None
you can't have both samples and individuals specified
if individuals is a list of individual IDs then it does as in the code currently
otherwise, is just skips the "sum over individuals" step

Note that we could be getting PCs for non-sample nodes (since individual's nodes need not be samples); I haven't thought through whether the values you get are correct or informative. My guess is that maybe they are? But we need a "user beware" note for this?

python/tskit/trees.py

petrelharp · 2024-10-07T15:41:25Z

Ah, sorry - one more thing - does this work with windows? (It looks like not?)

I think the way to do the windows would be something like

drop_windows = windows is None
if drop_windows:
    windows = [0, self.sequence_length]

# then do stuff; with these windows genetic_relatedness will always return an array where the first dimension is "window";
# so you can operate on each slice separately

if drop_windows:
    # get rid of the first dimension in the output

Basically - get it to work in the case where windows are specified (ie not None) and then we can get it to have the right behavior.

hanbin973 · 2024-10-08T04:00:03Z

A simple test case for the windows feature.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(
    samples={"A": 500, "B": 500},
    sequence_length=1e6,
    recombination_rate=3e-8,
    demography=demography, 
    random_seed=12)
seq_length = ts.sequence_length

U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[:,idx]), np.corrcoef(U[1][:,idx], U1[:,idx])

Because of the randomness of the algo, the correlation is not exactly 1, although it's nearly 1 like 0.99995623-ish.

hanbin973 · 2024-10-11T18:33:00Z

I just noticed that centre doesn't work with nodes option. The new commit fixed this problem.

hanbin973 · 2024-10-11T18:49:51Z

Check results for two windows.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
seq_length =1e6
ts = msprime.sim_ancestry(
    samples={"A": 500, "B": 500},
    sequence_length=seq_length,
    recombination_rate=3e-8,
    demography=demography, 
    random_seed=12)

# for individuals
U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])

# for nodes
U, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])

python/tskit/trees.py

…tion, it omits return in the end of the function

jeromekelleher · 2024-10-14T09:01:33Z

Re the result object, I'd imagined something like

@dataclasses.dataclass
class PcaResult:
    descriptive_name1: np.ndarray # Or whatever type hints we can get to work
    descriptive_name2...

hanbin973 · 2024-10-17T19:03:47Z

Now, pca() returns a dataclass of the following

@dataclass
class PCAResult:
    U: np.ndarray
    D: np.ndarray
    Q: np.ndarray
    E: np.ndarray

U and D are as before. Q is the range sketch matrix that is used as the approximate orthonormal basis of the GRM. It is also the most and the only expensive part of the algorithm that involves GRM*matrix operations. E is the error bounds for the singular values. Both Q and E will have different values for each windows if present.

A user can continuously improve their estimate through Q. pca now has a range_sketch: np.ndarray = None option that accepts Q from the previous found of the pca. This can be done like

pca_result = ts.pca( ... )
pca_result_round_2 = ts.pca( ..., range_sketch = pca_result.Q, ...)

If the first round did q power iterations and the second round did p additional power iterations, the result of the second round has total q+p iterations. By adding additional power iterations in successive rounds, one can improve the accuracy without running the whole process from scratch.

jeromekelleher

This looks great, but I would suggest we break the nested functions out to the module level rather than embedding them in the TreeSequence class. The function is currently too long, and it's not clear what needs to be embedded within the function because it's using the namespace, vs what's in there just because. It would be nice to be able to test the bits of this individually, and putting them at the module level will make that possible.

Certainly the return class should be defined at the module level and added to the Sphinx documentation so that it can be linked to.

python/tskit/trees.py

jeromekelleher

Minor nitpick about code organisation!

python/tskit/trees.py

hanbin973 · 2024-11-16T00:06:52Z

It now has a time-resolved feature. You can select branches within the lower and the upper time limits. It is based on decapitate.

petrelharp · 2024-11-16T16:47:11Z

NICE!

python/tskit/trees.py

petrelharp · 2024-11-16T17:30:15Z

I made a pass through the docs. We need to add time_windows to the tests still, and see what's going on with the CI.

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

jeromekelleher

Looks good to me. I think we need to tidy up the lint and get tests passing next so we can see how coverage is doing?

jeromekelleher · 2024-11-18T13:29:22Z

python/tskit/trees.py

+        samples, sample_individuals = (
+            ij[:, 0],
+            ij[:, 1],
+        )  # sample node index, individual of those nodes


Putting comments at the end of lines is causing them to get broken by Black. Better to put the comments on the line immediately above.

jeromekelleher · 2024-11-18T13:35:58Z

python/tskit/trees.py

+    The principal component factors. Columns are orthogonal, with one entry per sample
+    or individual (see :meth:`pca <.TreeSequence.pca>`).
+    """
+    eigen_values: np.ndarray


eigenvalues is one word, isn't it?

petrelharp · 2025-01-08T23:00:00Z

python/tests/test_relatedness_vector.py

+            ploidy=2,
+            sequence_length=10,
+            random_seed=123,
+        )


maybe a test for n_components=0 and -1 also?

petrelharp · 2025-01-08T23:02:56Z

python/tests/test_relatedness_vector.py

+        if np.allclose(x, 0):
+            r = 1.0
+        else:
+            r = np.mean(x / y)


This is not right, as here we want r to be +/-1, I think?

petrelharp · 2025-01-08T23:07:46Z

It looks like the things to do here are:

get the tests working (right now they fail with FAILED tests/test_relatedness_vector.py::TestPCA::test_bad_windows - TypeError: pca() got an unexpected keyword argument 'n_components'
either remove for now the time_windows argument or write tests for it
write tests that exercise the individuals argument (or remove it)
write a test that uses range_sketch
write tests that exercise iterated_power and num_oversamples: probably, just something that checks whether setting these to bigger numbers still gets us (nearly) the same answer

randomized svd draft

1f45245

petrelharp reviewed Oct 4, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Oct 4, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Oct 4, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Oct 4, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp marked this pull request as draft October 4, 2024 01:46

modified api remove scipy

e408ab3

hanbin973 commented Oct 4, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

hanbin973 added 3 commits October 7, 2024 10:55

remove scipy

a176132

correct docstring and comments

8c662c8

space remove

aa13613

petrelharp reviewed Oct 7, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

hanbin973 added 2 commits October 7, 2024 23:16

rng to random seed

5bf405a

add windows feature

6e415e5

hanbin973 added 2 commits October 11, 2024 14:03

output shape change when windows=wholegnome

45ac61e

make centre work with nodes

2cdb9dd

remove redundant options from internal functions

1f194aa

start at testing

fdc5842

petrelharp reviewed Oct 13, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Oct 13, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Oct 13, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Oct 13, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Oct 13, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

linting has a bug; when converting lambda to ordinary function defini…

2f4ce2b

…tion, it omits return in the end of the function

now output is a dataclass

be2f736

docstring change

bcbbcf6

jeromekelleher reviewed Oct 17, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

hanbin973 added 2 commits October 17, 2024 23:56

change variable name of PCAResult class

1a8ff7a

move internal function of PCA out

af16340

jeromekelleher reviewed Oct 22, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

hanbin973 added 3 commits October 22, 2024 14:22

function rearrangement

e81db15

terminology change loaindgs -> factor

277cfc2

time resolved feature

be715de

petrelharp reviewed Nov 16, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Nov 16, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Nov 16, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

petrelharp reviewed Nov 16, 2024

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

hanbin973 and others added 4 commits November 16, 2024 21:04

Update python/tskit/trees.py

d3e6c89

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Update python/tskit/trees.py

49fe716

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Update python/tskit/trees.py

d217fc7

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

Update python/tskit/trees.py

89e3b27

Co-authored-by: Peter Ralph <petrel.harp@gmail.com>

jeromekelleher reviewed Nov 18, 2024

View reviewed changes

hanbin973 added 3 commits November 18, 2024 10:25

remove comments; eigen_values -> eigenvalues

81cbecf

support for subset of samples and individuals in a tree

7ca6452

add time assertion

587409b

petrelharp reviewed Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

randomized svd draft #3008

randomized svd draft #3008

hanbin973 commented Oct 3, 2024

hanbin973 commented Oct 3, 2024

codecov bot commented Oct 3, 2024 •

edited

Loading

petrelharp commented Oct 4, 2024

petrelharp commented Oct 7, 2024

hanbin973 commented Oct 8, 2024

hanbin973 commented Oct 11, 2024

hanbin973 commented Oct 11, 2024

jeromekelleher commented Oct 14, 2024

hanbin973 commented Oct 17, 2024

jeromekelleher left a comment

jeromekelleher left a comment

hanbin973 commented Nov 16, 2024

petrelharp commented Nov 16, 2024

petrelharp commented Nov 16, 2024

jeromekelleher left a comment

jeromekelleher Nov 18, 2024

jeromekelleher Nov 18, 2024

petrelharp Jan 8, 2025

petrelharp Jan 8, 2025

petrelharp commented Jan 8, 2025

randomized svd draft #3008

Are you sure you want to change the base?

randomized svd draft #3008

Conversation

hanbin973 commented Oct 3, 2024

Description

hanbin973 commented Oct 3, 2024

codecov bot commented Oct 3, 2024 • edited Loading

Codecov Report

petrelharp commented Oct 4, 2024

petrelharp commented Oct 7, 2024

hanbin973 commented Oct 8, 2024

hanbin973 commented Oct 11, 2024

hanbin973 commented Oct 11, 2024

jeromekelleher commented Oct 14, 2024

hanbin973 commented Oct 17, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher left a comment

Choose a reason for hiding this comment

hanbin973 commented Nov 16, 2024

petrelharp commented Nov 16, 2024

petrelharp commented Nov 16, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Nov 18, 2024

Choose a reason for hiding this comment

jeromekelleher Nov 18, 2024

Choose a reason for hiding this comment

petrelharp Jan 8, 2025

Choose a reason for hiding this comment

petrelharp Jan 8, 2025

Choose a reason for hiding this comment

petrelharp commented Jan 8, 2025

codecov bot commented Oct 3, 2024 •

edited

Loading