Skip to content

Commit

Permalink
udapted a bit dendron and added first ideas of random markov field
Browse files Browse the repository at this point in the history
  • Loading branch information
mvisani committed Oct 31, 2024
1 parent 0a5ce75 commit 5810e78
Show file tree
Hide file tree
Showing 5 changed files with 241 additions and 0 deletions.
37 changes: 37 additions & 0 deletions notes/open-notebook.2024.10.01.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
id: 5id34hm0ryz4pxao6y6g66f
title: '2024-10-01'
desc: ''
updated: 1727786275335
created: 1727785957553
traitIds:
- open-notebook-mvisani
---
# This is Marco's daily open-notebook.

Today is 2024.10.01

## First day of PhD


## Notes


## Todo today
- [ ] start presentation for 10th of October
- [ ] organise meeting with Daniel, Madleina and PMA
- [ ] Contact Luca for HyperSketching availability (and see potential approaches to include MS data in the project)
- [ ] Recap on what has been done so far
- [ ] Do a little of litterature review

## Doing


## Done
* Talk with PMA
* setup computer and keyboards



## Todo tomorrow
- [ ]
32 changes: 32 additions & 0 deletions notes/open-notebook.2024.10.10.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
id: suddbhwk7cqxg76m5c0gumj
title: '2024-10-10'
desc: ''
updated: 1728564801104
created: 1728549399053
traitIds:
- open-notebook-mvisani
---
# This is Marco's daily open-notebook.

Today is 2024.10.10

## Meeting with Madleina



## Notes


## Todo today
- [ ]

## Doing


## Done
*


## Todo tomorrow
- [ ]
92 changes: 92 additions & 0 deletions notes/open-notebook.2024.10.14.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
id: dwyyydw69ivyq9dq9iawtql
title: '2024-10-14'
desc: ''
updated: 1728970884567
created: 1728912731843
traitIds:
- open-notebook-mvisani
---
# This is Marco's daily open-notebook.

Today is 2024.10.14

## Meeting with Madleina
- [ ] start by creatin a script that is able to fetch phylogeny from lotus and molecules
- [ ] use Cache
- [ ] for species, get full wikidata taxo, then remove all species with their parent that are not in lotus.
- [ ] input should be a node and then it will get all the children of that node : Plantae --> all the children of plantae
- [ ] use that to subset lotus

## Notes
This is the qlever sparql query to get all the mammals and their parent taxa.

```sparql
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?taxon ?taxon_name ?taxon_rank ?taxon_rank_label ?taxon_parent ?parent_name WHERE {
?taxon wdt:P225 ?taxon_name;
wdt:P105 ?taxon_rank;
wdt:P171* wd:Q7377; # Recursively fetches all taxa with Mammal as an ancestor
wdt:P171 ?taxon_parent.
?taxon_rank rdfs:label ?taxon_rank_label.
FILTER (lang(?taxon_rank_label) = "en")
FILTER (?taxon_rank != wd:Q68947) # Exclude taxa with rank "subspecies"
?taxon_parent wdt:P225 ?parent_name.
}
```

This is a nice query because we can then create a function that will fetch all the children of a node. We will use the Mammal for testing our model as they are not too many.

What I need to do then is to create a function that given a root node, will return all the children (with the option of filtering the sub-species). It should also be able to not rerun the full function if the query and the output of that query is already in the cache.

This query :
```sparql
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pr: <http://www.wikidata.org/prop/reference/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT DISTINCT ?structure ?structure_inchikey ?taxon ?taxon_name ?reference ?reference_doi ?taxon_rank ?taxon_rank_label ?taxon_parent ?parent_name WHERE {
?structure wdt:P235 ?structure_inchikey; # get the inchikey
p:P703 ?taxon_statement. # find the statement node
?taxon_statement ps:P703 ?taxon. # get the taxon from the statement node
?taxon_statement prov:wasDerivedFrom ?ref_node. # get the reference node from the statement node
?ref_node pr:P248 ?reference. # get the reference item
?taxon wdt:P225 ?taxon_name; # get the taxon scientific name
wdt:P105 ?taxon_rank; # get the taxon rank
wdt:P171* wd:Q21730; # recursively fetch all taxa with Asterales as an ancestor
wdt:P171 ?taxon_parent. # get the taxon's immediate parent
?taxon_rank rdfs:label ?taxon_rank_label.
FILTER (lang(?taxon_rank_label) = "en")
FILTER (?taxon_rank != wd:Q68947) # exclude taxa with rank "subspecies"
?taxon_parent wdt:P225 ?parent_name. # get the parent taxon's scientific name
?reference wdt:P356 ?reference_doi. # get the reference DOI
}
```
Will get all the molecules that are associated with any taxon that is a child of Asterales. This is also very useful to quickly get all the molecules associated with a taxon.


## Todo today
- [ ]

## Doing


## Done
*


## Todo tomorrow
- [ ]
53 changes: 53 additions & 0 deletions notes/random-markov-field.ideas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
id: hnirbtrbjzu4siyt8hq7zl6
title: random-markov-field-ideas
desc: ''
updated: 1730359253416
created: 1730356961682
traitIds:
- open-notebook-mvisani
---

# Random Markov Field

## Computational tricks
### Discretization of branch lengths
The parameters of the rate matrix $\mu_{c1}$ and $\mu_{c2}$ and the branch lengths $b(n)$ are non identifiable: doubling all rate parameters and halving all branch lengths will lead to the exact same solution. We therefore introduce the constraint:

$$
\sum_n{b(n)} = 1
$$

such that $0 \leq b(n) \leq 1$ for all branch lengths $b(n)$. We further note that calculating the matrix exponential $\mathbf{P}(n) = \exp(\mathbf{\Lambda_c}b(n))$ for every possible
branch length is computationally prohibitive. To reduce the number of calculations, we bin the
branch lengths to predefined values. Specifically, let there be a regularly spaced grid on the
interval $[a, b]$ consisting of $K$ bins. The width of each bin is given by $\Delta = \frac{b-a}{K}$ Let us further define by $k(n) \in 0,..., K-1$ the bin the bin where node $n$ is assigned to. The transition matrix of the first three bins is given by:

$$
\begin{align}
\mathbf{P}(0) &= \exp(\mathbf{\Lambda_c}a) \\
\mathbf{P}(1) &= \exp(\mathbf{\Lambda_c}(a + \Delta)) \\
\mathbf{P}(2) &= \exp(\mathbf{\Lambda_c}(a + 2\Delta)) \\
\end{align}
$$

More generally, the transition matrix for bin $k$ is given by:
$$
\mathbf{P}(k) = \exp(\mathbf{\Lambda_c}(a + k\Delta))
$$

For all $k = 1,..., K-1$, this term can be calculated efficiently using a recursion:

$$
\begin{align}
\mathbf{P}(k) &= \exp(\mathbf{\Lambda_c}(a + k\Delta)) \\
&= \exp(\mathbf{\Lambda_c}(a+(k-1)\Delta) + \Lambda_c\Delta) \\
&= \exp(\mathbf{\Lambda_c}(a+(k-1)\Delta))\exp(\Lambda_c\Delta) \\
\end{align}
$$

where $\exp(\mathbf{\Lambda_c}(a+(k-1)\Delta))$ corresponds to the transition matrix of the previous bin $k-1$, and $\mathbf{\alpha} = \exp(\Lambda_c\Delta)$ is a scaling matrix that needs to be calculated once. Therefore, the matrix exponential needs to be calculated only twice: once for calculating first transition matrix $\mathbf{P}(0)$ and once for calculating the scaling matrix $\mathbf{\alpha}$. The transition matrices of all subsequent bins are obtained by a recursive matrix multiplication, which is very cheap to calculate.
Since the sum of all branch lengths is constrained to one, most branch lengths will likely be very small. We therefore set $a = 0$ and $b = 0.1$ by default, assuming that the longest branch length of the tree will not exceed 10% of the total length. We further set $K=100$ bins by default. However, all default values can be changed by the user. To respect the sum-one-constraint, we update the branch lengths in pairs. Specifically, we select two nodes $n_1$ and $n_2$ pick a sign (+ or -) and propose moving to an adjacent bin: either $k(n_1)' = k(n_1) + 1$ and $k(n_2)' = k(n_2) - 1$ or $k(n_1)' = k(n_1) - 1$ and $k(n_2)' = k(n_2) + 1$. If the bin of a node corresponds to the first or the last bin, $k(n) = 0$ or $k(n) = K-1$, there is only one possible direction for proposing.

### Discretization of rate parameters
When updating the transition rate parameters $\mu_{c1}$ and $\mu_{c2}$ for a clique $c$, the transition matrix $\mathbf{P}(k)$ for all $k = 0,..., K-1$ bins must be re-calculated. Despite the above approach, this becomes computationally prohibitive when considering there to be millions of cliques (one per molecule). We therefore propose to discretize the values of the rate parameters as well. We will use a ~~logarithmically~~ spaced grid in the interval $[x,y]$ with a total of $M$ bins. It is **still not clear how many bins we will need**. It is then possible to pre-calculate and store all combinations of K branch lengths and $M^2$ values of $\mu_{c1}$ and $\mu_{c2}$, such that the update of these parameters will be very fast.
27 changes: 27 additions & 0 deletions notes/templates.open-notebook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
id: mz59v1zr7grqghkinjmk034
title: Open Notebook
desc: ''
updated: 1719324421258
created: 1675937670620
---
# This is Marco's daily open-notebook.

Today is {{ CURRENT_YEAR }}.{{ CURRENT_MONTH }}.{{ CURRENT_DAY }}


## Notes


## Todo today
- [ ]

## Doing


## Done
*


## Todo tomorrow
- [ ]

0 comments on commit 5810e78

Please sign in to comment.