udapted a bit dendron and added first ideas of random markov field

anticipated-chemistry-of-life · Oct 31, 2024 · 5810e78 · 5810e78
1 parent 0a5ce75
commit 5810e78
Show file tree

Hide file tree

Showing 5 changed files with 241 additions and 0 deletions.
diff --git a/notes/open-notebook.2024.10.01.md b/notes/open-notebook.2024.10.01.md
@@ -0,0 +1,37 @@
+---
+id: 5id34hm0ryz4pxao6y6g66f
+title: '2024-10-01'
+desc: ''
+updated: 1727786275335
+created: 1727785957553
+traitIds:
+  - open-notebook-mvisani
+---
+# This is Marco's daily open-notebook.
+
+Today is 2024.10.01
+
+## First day of PhD
+
+
+## Notes
+
+
+## Todo today
+- [ ] start presentation for 10th of October
+- [ ] organise meeting with Daniel, Madleina and PMA
+- [ ] Contact Luca for HyperSketching availability (and see potential approaches to include MS data in the project)
+- [ ] Recap on what has been done so far
+- [ ] Do a little of litterature review
+
+## Doing
+
+
+## Done
+*  Talk with PMA
+*  setup computer and keyboards
+
+
+
+## Todo tomorrow
+- [ ]
diff --git a/notes/open-notebook.2024.10.10.md b/notes/open-notebook.2024.10.10.md
@@ -0,0 +1,32 @@
+---
+id: suddbhwk7cqxg76m5c0gumj
+title: '2024-10-10'
+desc: ''
+updated: 1728564801104
+created: 1728549399053
+traitIds:
+  - open-notebook-mvisani
+---
+# This is Marco's daily open-notebook.
+
+Today is 2024.10.10
+
+## Meeting with Madleina
+
+
+
+## Notes
+
+
+## Todo today
+- [ ] 
+
+## Doing
+
+
+## Done
+*  
+
+
+## Todo tomorrow
+- [ ]
diff --git a/notes/open-notebook.2024.10.14.md b/notes/open-notebook.2024.10.14.md
@@ -0,0 +1,92 @@
+---
+id: dwyyydw69ivyq9dq9iawtql
+title: '2024-10-14'
+desc: ''
+updated: 1728970884567
+created: 1728912731843
+traitIds:
+  - open-notebook-mvisani
+---
+# This is Marco's daily open-notebook.
+
+Today is 2024.10.14
+
+## Meeting with Madleina
+- [ ] start by creatin a script that is able to fetch phylogeny from lotus and molecules
+- [ ] use Cache
+- [ ] for species, get full wikidata taxo, then remove all species with their parent that are not in lotus.
+- [ ] input should be a node and then it will get all the children of that node : Plantae --> all the children of plantae
+- [ ] use that to subset lotus
+
+## Notes
+This is the qlever sparql query to get all the mammals and their parent taxa.
+
+```sparql
+PREFIX wdt: <http://www.wikidata.org/prop/direct/>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX wd: <http://www.wikidata.org/entity/>
+
+SELECT ?taxon ?taxon_name ?taxon_rank ?taxon_rank_label ?taxon_parent ?parent_name WHERE {
+  ?taxon wdt:P225 ?taxon_name;
+         wdt:P105 ?taxon_rank;
+         wdt:P171* wd:Q7377;      # Recursively fetches all taxa with Mammal as an ancestor
+         wdt:P171 ?taxon_parent.
+
+  ?taxon_rank rdfs:label ?taxon_rank_label.
+  FILTER (lang(?taxon_rank_label) = "en")
+  FILTER (?taxon_rank != wd:Q68947)  # Exclude taxa with rank "subspecies"
+
+  ?taxon_parent wdt:P225 ?parent_name.
+}
+```
+
+This is a nice query because we can then create a function that will fetch all the children of a node. We will use the Mammal for testing our model as they are not too many. 
+
+What I need to do then is to create a function that given a root node, will return all the children (with the option of filtering the sub-species). It should also be able to not rerun the full function if the query and the output of that query is already in the cache.
+
+This query : 
+```sparql
+PREFIX wdt: <http://www.wikidata.org/prop/direct/>
+PREFIX p: <http://www.wikidata.org/prop/>
+PREFIX ps: <http://www.wikidata.org/prop/statement/>
+PREFIX pr: <http://www.wikidata.org/prop/reference/>
+PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
+PREFIX wd: <http://www.wikidata.org/entity/>
+
+PREFIX prov: <http://www.w3.org/ns/prov#>
+SELECT DISTINCT ?structure ?structure_inchikey ?taxon ?taxon_name ?reference ?reference_doi ?taxon_rank ?taxon_rank_label ?taxon_parent ?parent_name WHERE {
+  ?structure wdt:P235 ?structure_inchikey;  # get the inchikey
+    p:P703 ?taxon_statement.                # find the statement node
+  
+  ?taxon_statement ps:P703 ?taxon.          # get the taxon from the statement node
+  ?taxon_statement prov:wasDerivedFrom ?ref_node. # get the reference node from the statement node
+  ?ref_node pr:P248 ?reference.             # get the reference item
+
+  ?taxon wdt:P225 ?taxon_name;              # get the taxon scientific name
+         wdt:P105 ?taxon_rank;              # get the taxon rank
+         wdt:P171* wd:Q21730;                # recursively fetch all taxa with Asterales as an ancestor
+         wdt:P171 ?taxon_parent.            # get the taxon's immediate parent
+
+  ?taxon_rank rdfs:label ?taxon_rank_label.
+  FILTER (lang(?taxon_rank_label) = "en")
+  FILTER (?taxon_rank != wd:Q68947)         # exclude taxa with rank "subspecies"
+
+  ?taxon_parent wdt:P225 ?parent_name.      # get the parent taxon's scientific name
+  ?reference wdt:P356 ?reference_doi.       # get the reference DOI
+}
+```
+Will get all the molecules that are associated with any taxon that is a child of Asterales. This is also very useful to quickly get all the molecules associated with a taxon.
+
+
+## Todo today
+- [ ] 
+
+## Doing
+
+
+## Done
+*  
+
+
+## Todo tomorrow
+- [ ]
diff --git a/notes/random-markov-field.ideas.md b/notes/random-markov-field.ideas.md
@@ -0,0 +1,53 @@
+---
+id: hnirbtrbjzu4siyt8hq7zl6
+title: random-markov-field-ideas
+desc: ''
+updated: 1730359253416
+created: 1730356961682
+traitIds:
+  - open-notebook-mvisani
+---
+
+# Random Markov Field
+
+## Computational tricks 
+### Discretization of branch lengths
+The parameters of the rate matrix $\mu_{c1}$ and $\mu_{c2}$ and the branch lengths $b(n)$ are non identifiable: doubling all rate parameters and halving all branch lengths will lead to the exact same solution. We therefore introduce the constraint: 
+
+$$
+\sum_n{b(n)} = 1
+$$
+
+such that  $0 \leq b(n) \leq 1$  for all branch lengths $b(n)$. We further note that calculating the matrix exponential $\mathbf{P}(n) = \exp(\mathbf{\Lambda_c}b(n))$  for every possible
+branch length is computationally prohibitive. To reduce the number of calculations, we bin the
+branch lengths to predefined values. Specifically, let there be a regularly spaced grid on the
+interval $[a, b]$ consisting of $K$ bins. The width of each bin is given by $\Delta = \frac{b-a}{K}$ Let us further define by $k(n) \in 0,..., K-1$ the bin the bin where node $n$ is assigned to. The transition matrix of the first three bins is given by:
+
+$$
+\begin{align}
+\mathbf{P}(0) &= \exp(\mathbf{\Lambda_c}a) \\
+\mathbf{P}(1) &= \exp(\mathbf{\Lambda_c}(a + \Delta)) \\
+\mathbf{P}(2) &= \exp(\mathbf{\Lambda_c}(a + 2\Delta)) \\
+\end{align}
+$$
+
+More generally, the transition matrix for bin $k$ is given by: 
+$$
+\mathbf{P}(k) = \exp(\mathbf{\Lambda_c}(a + k\Delta))
+$$
+
+For all $k = 1,..., K-1$,  this term can be calculated efficiently using a recursion:
+
+$$
+\begin{align}
+\mathbf{P}(k) &= \exp(\mathbf{\Lambda_c}(a + k\Delta)) \\
+&= \exp(\mathbf{\Lambda_c}(a+(k-1)\Delta) + \Lambda_c\Delta) \\
+&= \exp(\mathbf{\Lambda_c}(a+(k-1)\Delta))\exp(\Lambda_c\Delta) \\
+\end{align}
+$$
+
+where $\exp(\mathbf{\Lambda_c}(a+(k-1)\Delta))$  corresponds to the transition matrix of the previous bin $k-1$, and $\mathbf{\alpha} = \exp(\Lambda_c\Delta)$ is a scaling matrix that needs to be calculated once. Therefore, the matrix exponential needs to be calculated only twice: once for calculating first transition matrix $\mathbf{P}(0)$ and once for calculating the scaling matrix $\mathbf{\alpha}$. The transition matrices of all subsequent bins are obtained by a recursive matrix multiplication, which is very cheap to calculate.
+Since the sum of all branch lengths is constrained to one, most branch lengths will likely be very small. We therefore set $a = 0$ and $b = 0.1$  by default, assuming that the longest branch length of the tree will not exceed 10% of the total length. We further set $K=100$ bins by default. However, all default values can be changed by the user. To respect the sum-one-constraint, we update the branch lengths in pairs. Specifically, we select two nodes $n_1$ and $n_2$ pick a sign (+ or -) and propose moving to an adjacent bin: either $k(n_1)' = k(n_1) + 1$ and $k(n_2)' = k(n_2) - 1$ or $k(n_1)' = k(n_1) - 1$ and $k(n_2)' = k(n_2) + 1$. If the bin of a node corresponds to the first or the last bin, $k(n) = 0$ or $k(n) = K-1$, there is only one possible direction for proposing. 
+
+### Discretization of rate parameters
+When updating the transition rate parameters $\mu_{c1}$ and $\mu_{c2}$ for a clique $c$, the transition matrix $\mathbf{P}(k)$ for all $k = 0,..., K-1$ bins must be re-calculated. Despite the above approach, this becomes computationally prohibitive when considering there to be millions of cliques  (one per molecule). We therefore propose to discretize the values of the rate parameters as well. We will use a ~~logarithmically~~ spaced grid in the interval $[x,y]$ with a total of $M$ bins. It is **still not clear how many bins we will need**. It is then possible to pre-calculate and store all combinations of K branch lengths and $M^2$ values of $\mu_{c1}$ and $\mu_{c2}$, such that the update of these parameters will be very fast.
diff --git a/notes/templates.open-notebook.md b/notes/templates.open-notebook.md
@@ -0,0 +1,27 @@
+---
+id: mz59v1zr7grqghkinjmk034
+title: Open Notebook
+desc: ''
+updated: 1719324421258
+created: 1675937670620
+---
+# This is Marco's daily open-notebook.
+
+Today is {{ CURRENT_YEAR }}.{{ CURRENT_MONTH }}.{{ CURRENT_DAY }}
+
+
+## Notes
+
+
+## Todo today
+- [ ] 
+
+## Doing
+
+
+## Done
+*  
+
+
+## Todo tomorrow
+- [ ]