Chapter3.Rtex

%--------------------------------------------------------------------------------------------------------------------------------%
% Code and text for "A comparative test of the role of population structure in determining pathogen richness"
% Chapter 2 of thesis "The role of population structure and size in determining bat pathogen richness"
% by Tim CD Lucas
%
% NB The file is numbered Chapter3 as this was previously Chapter 3 in the thesis.
%
%---------------------------------------------------------------------------------------------------------------------------------%





%%begin.rcode settings, echo = FALSE, cache = FALSE, message = FALSE, results = 'hide', eval = TRUE


##################################
### Run web scraping?          ###
##################################

# There's some slow webscrapping functions. Run them?
runPubmedScrape <- FALSE
runScholarScrape <- FALSE
runFstScrape <- FALSE


# Run slow bootstrapping?
subBoots <- FALSE
fstBoots <- FALSE
batclocksBoots <- FALSE

# Run slow fst data wrangling as some is slow.
fstComb <- FALSE
runIucn <- FALSE

# There are figures created in the data analysis which are not in the final chapter document.
#   If TRUE, they will be included in the output.
#   Use 'hide' to remove them.
extraFigs <- 'hide'

#knitr options
opts_chunk$set(cache.path = '.Ch3Cache/')
source('misc/KnitrOptions.R')

# ggplot2 theme.
source('misc/theme_tcdl.R')
theme_set(theme_grey() + theme_tcdl)


# Choose the number of cores to use
nCores <- 4

%%end.rcode


%%begin.rcode libs, cache = FALSE, result = FALSE

# Data handling
library(dplyr)
library(broom)
library(readxl)
library(sqldf)
library(reshape2)

# phylogenetic regression
library(ape)
library(caper)
library(phytools)
library(nlme)
library(qpcR)
library(car)

# weighted means + var
library(Hmisc)

# Plotting
library(ggplot2)
library(ggtree)
library(palettetown)
library(ggthemes)
library(GGally)
library(cowplot)


# Web scraping.
library(rvest)

# For synonym list
library(taxize)

# Spatial analysis
library(maptools)
library(geosphere)

# Parllel computation
library(parallel)

%%end.rcode



%%begin.rcode parameters


# Define some parameters.
#   This is useful at the top so that it can go in text.

# How many bootstraps for model selection NULL variable
nBoots <- 50

# What proportion of a species range should be covered for an Fst study to count as valid.
rangeUseable <- 0.20

%%end.rcode

\section{Abstract}


%\tmpsection{One or two sentences providing a basic introduction to the field}
% comprehensible to a scientist in any discipline.
\lettr{Z}oonotic diseases make up the majority of human infectious diseases and are a major drain on healthcare resources and economies.
Species that host many pathogen species are more likely to be the source of a novel zoonotic disease than species with few pathogens, all else being equal.
However, the factors that influence pathogen richness in animal species are poorly understood.
%
%
%\tmpsection{Two to three sentences of more detailed background}
% comprehensible to scientists in related disciplines.
% Theory led.
The pattern of contacts between individuals (i.e.\ population structure) can be influenced by habitat fragmentation, sociality and dispersal behaviour.
Epidemiological theory suggests that increased population structure can promote pathogen richness by reducing competition between pathogen species.
Conversely, it is often assumed that as greater population structure slows the spread of a new pathogen (i.e.\ lowers $R_0$), less structured populations should have greater pathogen richness.
%
%
%\tmpsection{One sentence clearly stating the general problem (the gap)}
% being addressed by this particular study.
Previous comparative studies comparing pathogen richness and population structure measured population structure differently and have had contradictory results, complicating the interpretation.
%
%
%\tmpsection{One sentence summarising the main result}
%  (with the words “here we show” or their equivalent).
Here I test whether increased population structure correlates with viral richness using comparative data across 203 bat species, controlling for body mass, geographic range size, study effort and phylogeny.
This is an indirect test between the two competing hypotheses: does increased population structure allow pathogen coexistence by reducing competition, or does increased population structure decrease $R_0$ and therefore cause fewer new pathogens to enter the population.
Bats, as a group, make a useful case study because they have been associated with a number of important, recent zoonotic outbreaks.
Unlike previous studies, I used two measures of population structure: the number of subspecies and effective levels of gene flow.
I find that both measures are positively associated with pathogen richness.
%
%
%\tmpsection{Two or three sentences explaining what the main result reveals in direct comparison to what was thoughts to be the case previously}
% or how the main result adds to previous knowledge
My results add more robust support to the hypothesis that increased population structure promotes viral richness in bats.
The results support the prediction that increased population structure allows greater pathogen richness by reducing competition between pathogens.
The prediction that factors that decrease $R_0$ should decrease pathogen richness is not supported.
%
%
%\tmpsection{One or two sentences to put the results into a more general context.}
Although my analysis implies that increased population structure does promote pathogen richness in bats, the weakness of the relationship and the difficulty in obtaining some measurements means that this is probably not a useful, predictive factor on its own for optimising zoonotic surveillance.
%However, the relationship has implications for global change, implying that increased habitat fragmentation might promote greater viral richness in bats.





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Introduction}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%#the introduction is not bad and starts very well but i think you need a bit more from studies of other mammals (not bats) to put the study into context as well as explaining why particularly you focus on pop structure, some justification of why bats, and less detail about the specific Fst measures (move to methods) and more stuff on your actual methods and approach you use in this study.

%#Structure could be:
%#1. Zoonotic disease is bad (as you have written it already)
%#2. Need to understand why some species have more pathogens than others. Life history variables of the host have been used to explain why some species have more than others, such as blah blah. However, pop structure (explain what this means) is of particular interest because of blah blah.
%#3. Epidemiological theoretical models predict relationship with pop structure and translated into across species patterns as increased structure less pathogen diversity but problem is of inter-pathogen competition
%#4. lack of large across species studies of these relationships - those that have been done have conflicting patterns (examples across different taxa).
%#5. Bats are very interesting in this regard because of blah
%#6. Bat studies of pathogen richness and population structure are particularly interesting in this area but also are conflicting (examples), due in part to low sample sizes and problems with comparing results using different definitions of population structure and not controlling for effects of phylogeny.
%#7. Here I use a phylogenetic comparative approach to understand the relationship between pop structure and pathogen richness across the largest study of bats to date. I use a phylogenetic GLM controlling for the other life history characteristics known to impact pathogen richness to quantify the relationship between viral richness (as a proxy for pathogen richness_ and two measures of population structure. 
%#8. I found ...

\tmpsection{General Intro}

%#1. Zoonotic disease is bad (as you have written it already)
Zoonotic pathogens make up the majority of newly emerging diseases and have profound consequences for public health, economics and international development \cite{jones2008global, smith2014global, ebolaWorldbank}.
Better statistical models for predicting which wild host species are potential reservoirs of zoonotic diseases would allow us to optimise zoonotic disease surveillance and anticipate how the risks of disease spillover might change with global change.
The chance that a host species will be the source of a zoonotic pathogen depends on a number of factors, such as its proximity and interactions with humans, the prevalence of its pathogens and the number of pathogen species it carries \cite{wolfe2000deforestation}.
However, the factors that control the number of pathogen species a host species carries remain poorly understood.


\tmpsection{Specific Intro}

%#2. Need to understand why some species have more pathogens than others. Life history variables of the host have been used to explain why some species have more than others, such as blah blah. 
\tmpsection{Theoretical background}


A number of species traits that might control pathogen richness have been studied.
These traits can be at the level of the individual (e.g., body mass and longevity) or the level of the population (e.g., population density, sociality and species range size).
Large bodied animals have been shown to have high pathogen richness with large bodies providing more resources for pathogens \cite{kamiya2014determines, arneberg2002host, poulin1995phylogeny, bordes2008bat, luis2013comparison}.
Long lived species are expected to have high pathogen richness because the number of pathogens a host encounters in its lifetime will be higher \cite{nunn2003comparative, ezenwa2006host, luis2013comparison}.
Animal density \cite{kamiya2014determines, nunn2003comparative, arneberg2002host} and sociality \cite{bordes2007rodent, vitone2004body, altizer2003social, ezenwa2006host} are both predicted to increase pathogen richness by increasing the rate of spread, $R_0$, of a new pathogen.
Finally, widely distributed species have high pathogen richness, potentially because they experience a wider range of environments or because they are sympatric with more species \cite{kamiya2014determines, nunn2003comparative, luis2013comparison}.

%# However, pop structure (explain what this means) is of particular interest because of blah blah.

%#3. Epidemiological theoretical models predict relationship with pop structure and translated into across species patterns as increased structure less pathogen diversity but problem is of inter-pathogen competition


A further population level factor that may affect pathogen richness is population structure.
Population structure can be defined as the extent to which interactions between individuals in a population are non-random.
The role of population structure on human epidemics has been studied in depth and it has been shown that decreased population structure increases the speed of pathogen spread and makes establishment of a new pathogen more likely \cite{colizza2007invasion, vespignani2008reaction}.
In comparative studies of pathogen richness in wild animals, this relationship with $R_0$ is often taken as a prediction that decreased population structure will increase pathogen richness relative to other host species \cite{nunn2003comparative, morand2000wormy, poulin2014parasite, poulin2000diversity, altizer2003social}. 
However, epidemiological models of highly virulent pathogens have shown that increased population structure can allow persistence of a pathogen where a well-mixed population would experience a single, large epidemic followed by pathogen extinction \cite{blackwood2013resolving, plowright2011urban}.
Furthermore, the assumption that high $R_0$ leads to high pathogen richness ignores inter-pathogen competition.
Simple epidemiological models of competition between multiple pathogens show that, in completely unstructured populations, a competitive exclusion process occurs but that adding population structure makes coexistence possible \cite{qiu2013vector, allen2004sis, nunes2006localized}.


\tmpsection{Previous Studies}

%#4. lack of large across species studies of these relationships - those that have been done have conflicting patterns (examples across different taxa).

There is a lack of large, comparative studies of the role of population structure on pathogen richness.
Sociality, which is one constituent part of population structure, has been well studied.
However, in primates only a weak positive association between sociality and pathogen richness was found \cite{vitone2004body}.
Furthermore, a negative association was found in rodents \cite{bordes2007rodent} and in even and odd-toed hoofed mammals \cite{ezenwa2006host}.
Finally, two studies tested for an association between group size and parasite richness in bats \cite{bordes2008bat, gay2014parasite}.
Amongst 138 bat species, \textcite{bordes2008bat} found no relationship between group size (coded into four classes) and bat fly species richness.
\textcite{gay2014parasite} found a negative relationship between colony size and viral richness but a positive relationship between colony size and ectoparasite richness.
While sociality is an important component of population structure it does not capture fully how connected the population is globally. 


%#5. Bats are very interesting in this regard because of blah

%#6. Bat studies of pathogen richness and population structure are particularly interesting in this area but also are conflicting (examples), due in part to low sample sizes and problems with comparing results using different definitions of population structure and not controlling for effects of phylogeny.


Three studies have used comparative data to test for an association between global population structure and viral richness in bats.
A study on 15 African bat species found a positive relationship between the extent of distribution fragmentation and viral richness \cite{maganga2014bat}.
Conversely, a study on 20 South-East Asian bat species found the opposite relationship \cite{gay2014parasite}. 
These studies used the ratio between the perimeter and area of the species' geographic range as their measure of population structure.
However, range maps are very coarse for many species.
Furthermore, range maps are likely to be more detailed (and therefore have a greater perimeter) in well studied species.

A global study on 33 bat species found a positive relationship between $F_{ST}$ --- a measure of genetic structure --- and viral richness \cite{turmelle2009correlates}. 
However, this study included measures using mtDNA which only measures female dispersal which may have biased the results as many bat species show female philopatry \cite{kerth2002extreme, hulva2010mechanisms}.
Furthermore, this study used measures of $F_{ST}$ irrespective of the spatial scale of the study including studies covering from tens \cite{mccracken1981social} to thousands \cite{petit1999male} of kilometres.
As isolation by distance has been shown in a number of bat species \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range}, this could bias results further.
Finally, when a global $F_{ST}$ value is not given, \textcite{turmelle2009correlates} used the mean of all pairwise $F_{ST}$ values between sites.
This is not correct as pairwise and global $F_{ST}$ values have different relationships with effective migration rates. 



\tmpsection{The gap}
\tmpsection{What I did/found}

%#7. Here I use a phylogenetic comparative approach to understand the relationship between pop structure and pathogen richness across the largest study of bats to date. I use a phylogenetic GLM controlling for the other life history characteristics known to impact pathogen richness to quantify the relationship between viral richness (as a proxy for pathogen richness_ and two measures of population structure. 
%#8. I found ...

Here I used a phylogenetic comparative approach to test for a relationship between increased population structure and pathogen richness in the largest study of bats to date. 
I used phylogenetic linear models, controlling for the other life history characteristics known to impact pathogen richness, to quantify the relationship between viral richness (as a proxy for pathogen richness) and two measures of population structure: the number of subspecies and effective gene flow. 
I used two measures of population structure to increase the robustness of the analysis; this is particularly important as previous studies have had contradictory results \cite{maganga2014bat, gay2014parasite, turmelle2009correlates}.

I found that increases in both measures of population structure are positively associated with viral richness and are included as explanatory variables in the best models for describing viral richness.
Furthermore, I found that the role of phylogeny is very weak both in the models and in the distribution of viral richness amongst taxa.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Methods}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsection{Data Collection}

\subsubsection{Pathogen richness}

To measure pathogen richness I used data from \textcite{luis2013comparison}. 
This data simply includes known infections of a bat species with a virus species. 
I have used viral richness as a proxy for pathogen richness more generally.
Rows with host species that were not identified to species level according to \textcite{wilson2005mammal} were removed.
Many viruses were not identified to species level or their specified species names were not in the ICTV virus taxonomy \cite{ICTV}.
Therefore, I counted a virus if it was the only virus, for that host species, in the lowest taxonomic level identified (present in the ICTV taxonomy).
For example, if a host is recorded as harbouring an unknown Paramyxoviridae virus, then it is logical to assume that the host carries at least one Paramyxoviridae virus.
If a host carries an unknown Paramyxoviridae virus and a known Paramyxoviridae virus, it is hard to confirm that the unknown virus is not another record of the known virus.
In this case, the host would be counted as having one virus species.


%$F_{ST}$ studies are conducted at a range of spatial scales, but $F_{ST}$ often increases with distance studied \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range}.
%To minimise the effects of this I only used data from studies that cover \rinline{rangeUseable * 100}\% of the diameter of the species range.
%This is a largely arbitrary value that could be considered to reflect a ``global'' estimate of $F_{ST}$ while keeping a reasonable number of data points available.
%I calculated the diameter of the species range by finding the furthest apart points in the IUCN species range \cite{iucn} even if the range is split into multiple polygons.
%The width covered by each study was the distance between the most distant sampling sites.
%When this was not explicit in the paper, the centre of the lowest level of geographic area was used.




%%begin.rcode luis2013virusRead

#read in luis2013virus data
virus2 <- read.csv('data/Chapter3/luis2013comparison.csv', stringsAsFactors = FALSE)


virus2$binomial <- paste(virus2$host.genus, virus2$host.species)


# From methods
#Many viruses were not identified to species level or their identified species was not in the ICTV virus taxonomy \cite{ICTV}.
#I counted a virus if it was the only virus, for that host species, in the lowest taxonomic level identified in the ICTV taxonomy.
#That is, if a host carries an unknown Paramyxoviridae virus, then it must carry at least one Paramyxoviridae virus.
#If a host carries an unknown Paramyxoviridae virus and a known Paramyxoviridae virus, then it is hard to confirm that the unknown virus is not another record of the known virus.
#In this case, this would be counted as one virus species.

# This has been implemented manually and indicated in the column `remove`

virus2 <- virus2[!virus2$remove, ]

%%end.rcode


%%begin.rcode wilsonReaderTaxonomyRead, fig.show = extraFigs, fig.cap = 'Histogram of number of subspecies'

##################################################################
### Subspecies vs Viruses analysis.                            ###
##################################################################


# Read in the wilson Reader Taxonomy and use it to calculate the number of subspecies each bat species has.

tax <- read.csv('data/Chapter3/msw3-all.csv', stringsAsFactors = FALSE)

chir <- tax %>%
          filter(Order == 'CHIROPTERA')

# Save some memory.
rm(tax)

# Count the number of subspecies each bat species has.
subs <- sqldf('
  SELECT Family, Genus, Species, COUNT(Subspecies)
  AS NumberOfSubspecies
  FROM chir
  Where Species <> ""
  GROUP BY Genus, Species
               ')   



# I think each species has 1 row for species and extra rows for subspecies
#   Check this is true. 
#   If that is correct, then Species with >1 NumberOfSubspecies should be one less.

SpeciesRows <- sqldf('
  SELECT Genus, Species, COUNT(Subspecies)
  AS SpeciesRows
  FROM chir
  WHERE Subspecies == "" AND Species <> ""
  GROUP BY Genus, Species
               ') 

# 
(SpeciesRows$SpeciesRows != 1) %>% sum
all(SpeciesRows$SpeciesRows == 1)

# Species with >1 NumberOfSubspecies should be one less
subs$NumberOfSubspecies <- ifelse(subs$NumberOfSubspecies > 1, 
                             subs$NumberOfSubspecies - 1, 
                             subs$NumberOfSubspecies)

# Quick look at species with highest number of subspecies.
subs[order(subs$NumberOfSubspecies, decreasing = TRUE ),] %>% head

# Megaderma spasma is top. It's widespread across south east asia islands. 
#   So this makes sense.

# Quick look at the number of subspecies.
ggplot(subs, aes(x = NumberOfSubspecies)) +
  geom_histogram(binwidth = 2) +
  xlab('Number of Subspecies') +
  ylab('Count')


# Create a combined binomial name column
subs$binomial <- paste(subs$Genus, subs$Species)




# Check overlap of datasets.
sum(!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial))

notInTax <- (virus2$binomial[virus2$host.species != ''])[!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial)]

# Run this to find synonyms of names not in Wilson and Reeder
#   Doesn't find much of use.
# syns <- synonyms(notInTax, db = 'itis')

# Clean some names
#  As taxize::synonyms didn't find most of them, I am using IUCN.
#  And checking that the IUCN name is then in The Wilson & Reeder taxonomy

virus2$binomial[virus2$binomial == 'Myotis pilosus'] <- 'Myotis ricketti'
virus2$binomial[virus2$binomial == 'Tadarida pumila'] <- 'Chaerephon pumilus'
virus2$binomial[virus2$binomial == 'Tadarida condylura'] <- 'Mops condylurus'
virus2$binomial[virus2$binomial == 'Rhinolophus hildebrandti'] <- 'Rhinolophus hildebrandtii'
# Rhinolophus horsfeldi: I can't find this species anywhere. Will exclude.
#   Possibly Megaderma spasma according to http://www.fao.org/3/a-i2407e.pdf
virus2$binomial[virus2$binomial == 'Tadarida plicata'] <- 'Chaerephon plicatus'
virus2$binomial[virus2$binomial == 'Artibeus planirostris'] <- 'Artibeus jamaicensis'

sum(!(virus2$binomial[virus2$host.species != ''] %in% subs$binomial))

%%end.rcode

%%begin.rcode subsHistsByFam, fig.show = extraFigs, fig.height = 3, fig.cap = 'Histograms of number of subspecies for the families with many species.'

# Compare the histograms of numbers of subspecies over the families with many species.
subs %>%
  filter(Family %in% names(which(table(subs$Family) > 99))) %>%
  ggplot(., aes(x = NumberOfSubspecies, y = ..density..)) + 
    geom_histogram() +
    facet_grid(. ~ Family) +
    xlab('Number of Subspecies') +
    ylab('Density')

%%end.rcode

%%begin.rcode, subvsvirusCaption

# Caption for subspecies vs n. viruses plot.
subvsvirus <- '
Number of viruses against number of subspecies.
Points are coloured by family, with families with less than 10 species being grouped into "other".
Contours show the 2D density of points and suggest a positive correlation.
'
subvsvirusTitle <- 'Number of viruses against number of subspecies'
%%end.rcode

%%begin.rcode subsDataFrame, fig.show = extraFigs, fig.cap = subvsvirus, fig.scap = subvsvirusTitle, out.width = '\\textwidth'
# create combined dataframe

# Join dataframes
species <- sqldf("
               SELECT subs.binomial, virus2.[virus.species]
               FROM subs
               INNER JOIN virus2
               ON subs.binomial=virus2.binomial;
              ")
                        
# Count number of virus species for each bat species
nSpecies <- species %>%
              unique %>%
              group_by(binomial) %>%
              summarise(virusSpecies = n())
        
# Add other Subspecies data.
nSpecies <- sqldf("
              SELECT nSpecies.binomial, virusSpecies, NumberOfSubspecies, Genus, Family
              FROM nSpecies
              LEFT JOIN subs
              ON nSpecies.binomial=subs.binomial
             ")

# Create another column to make plotting easier.
#   Group families with few rows into 'other'

nSpecies$familyPlotCol <- nSpecies$Family
nSpecies$familyPlotCol[
  nSpecies$Family %in% names(which(table(nSpecies$Family) < 10))] <- 'Other'

table(nSpecies$familyPlotCol)

ggplot(nSpecies, aes(x = log(NumberOfSubspecies), y = log(virusSpecies))) +
  # geom_smooth(method = 'lm') +
  geom_jitter(aes(colour = familyPlotCol), size = 2.5, alpha = 0.8, 
    position = position_jitter(width = .1, height = .1)) +
  scale_colour_hc() +
  geom_density2d() +
  labs(colour = 'Family')
  
%%end.rcode

%%begin.rcode virusHist, fig.show = extraFigs, fig.cap = 'Histogram of known viruses per species'

ggplot(nSpecies, aes(x = virusSpecies)) +
  geom_histogram()

%%end.rcode




%%begin.rcode euthRead

# Read in pantheria data base
pantheria <- read.table(file = 'data/Chapter3/PanTHERIA_1-0_WR05_Aug2008.txt',
  header = TRUE, sep = "\t", na.strings = c("-999", "-999.00"))

mass <- sqldf("
  SELECT [X5.1_AdultBodyMass_g]
  FROM nSpecies
  LEFT JOIN pantheria
  ON nSpecies.binomial=pantheria.MSW05_Binomial
  ")

nSpecies$mass <- mass[, 1]

# Now add additional mass estimates.

additionalMass <- read.csv('data/Chapter3/AdditionalBodyMass.csv', stringsAsFactors = FALSE)
meanAdditionalMass <- additionalMass %>%
                        group_by(binomial) %>% 
                        summarise(mass = mean(Body.Mass.grams))

nSpecies$mass[
  sapply(meanAdditionalMass$binomial, function(x) which(nSpecies$binomial == x))
  ] <- meanAdditionalMass$mass 


%%end.rcode



%%begin.rcode IUCNranges, eval = runIucn

# Read in iucn ranges and calculate range sizes for each species.
ranges <- readShapePoly('data/Chapter3/TERRESTRIAL_MAMMALS/TERRESTRIAL_MAMMALS.shp')

ranges <- ranges[ranges$order_name == 'CHIROPTERA', ]

levels(ranges$binomial) <- c(levels(ranges$binomial), 'Myotis ricketti')
ranges$binomial[ranges$binomial == 'Myotis pilosus'] <- 'Myotis ricketti'




nSpecies$binomial[!(nSpecies$binomial %in% ranges$binomial)]

findArea <- function(name){
  #cat(name)
  A <- areaPolygon(ranges[ranges$binomial == name, ])
  sum(A)
}

iucnDistr <- sapply(nSpecies$binomial, findArea)

write.csv(iucnDistr, 'data/Chapter3/iucnDistr.csv')

%%end.rcode

%%begin.rcode readIucnIn

iucnDistr <- read.csv('data/Chapter3/iucnDistr.csv', row.names = 1)

nSpecies$distrSize <- iucnDistr$x

%%end.rcode



%%begin.rcode pubmedScrapeFunc

# Scrape from pubmed

scrapePub <- function(sp){
    
  Sys.sleep(2)

  # Initialise refs
  refs <- NA

  # Find synonyms from taxize
  syns <- synonyms(sp, db = 'itis')
  if(NROW(syns[[1]]) == 1){
    spString <- tolower(gsub(' ', '%20', sp))
  } else {
    spString <- paste(tolower(gsub(' ', '%20', syns[[1]]$syn_name)), collapse = '%22+OR+%22')
  }


  url <- paste0('http://www.ncbi.nlm.nih.gov/pubmed/?term=%22', spString, '%22')


  page <- html(url)
  
  # Test if exact phrase was found.
  phraseFound <- try(page %>% 
                   html_node('.icon') %>%
                   html_text() %>%
                   grepl("The following term was not found in PubMed:", .), silent = TRUE)

  if (class(phraseFound) == "logical") {
    if(phraseFound){
      if(phraseFound) refs <- NA
    }
  } 
  if (class(phraseFound) != "logical") {
    try({
    refs <- page %>%
              html_node('.result_count') %>%
              html_text() %>%
              strsplit(' ') %>% 
              .[[1]] %>%
              .[length(.)] %>%
              as.numeric()
    })
  }

  return(refs)
}


%%end.rcode


%%begin.rcode pubmedScrape, eval = runPubmedScrape

# Create empty vector
pubmedRefs <- rep(NA, nrow(nSpecies))

for(i in 1:NROW(nSpecies)){
  pubmedRefs[i] <- scrapePub(nSpecies$binomial[i])
}

pubmedScrapeDate <- Sys.Date()

pubmedRefs <- cbind(binomial = nSpecies$binomial, pubmedRefs = pubmedRefs)

# Write out.
write.csv(pubmedRefs, file = 'data/Chapter3/pubmedRefs.csv')

%%end.rcode




%%begin.rcode pubmedRead


pubmedRefs <- read.csv('data/Chapter3/pubmedRefs.csv', stringsAsFactors = FALSE, row.names = 1)

# Function returns NA for none found. Change that to a zero.
pubmedRefs$pubmedRefs[is.na(pubmedRefs$pubmedRefs)] <- 0
nSpecies$pubmedRefs <- pubmedRefs$pubmedRefs

%%end.rcode

%%begin.rcode scholarScrapeFunc

scrapeScholar <- function(sp){
    
  wait <- rnorm(1, 120, 2)
  Sys.sleep(wait)


  syns <- synonyms(sp, db = 'itis')
  if(NROW(syns[[1]]) == 1){
    spString <- tolower(gsub(' ', '%20', sp))
  } else {
    spString <- paste(tolower(gsub(' ', '%20', syns[[1]]$syn_name)), collapse = '%22+OR+%22')
  }

  url <- paste0('https://scholar.google.co.uk/scholar?hl=en&q=%22', 
                spString, '%22&btnG=&as_sdt=1%2C5&as_sdtp=')


  page <- html(url)
  
  try({
  refs <- page %>%
            html_node('#gs_ab_md') %>%
            html_text() %>%
            gsub('About\\s(.*)\\sresults.*', '\\1', .) %>%
            gsub(',', '', .) %>%
            as.numeric
  })
  return(refs)
}

%%end.rcode

%%begin.rcode scholarScrape, eval = runScholarScrape

# Create empty vector
scholarRefs <- rep(NA, nrow(nSpecies))

for(i in 1:NROW(nSpecies)){
  scholarRefs[i] <- scrapeScholar(nSpecies$binomial[i])
}

scholarScrapeDate <- Sys.Date()

scholarRefs <- cbind(binomial = nSpecies$binomial, scholarRefs = scholarRefs)

# Write out.
write.csv(scholarRefs, file = 'data/Chapter3/scholarRefs.csv')

%%end.rcode




%%begin.rcode scholarRead


scholarRefs <- read.csv('data/Chapter3/scholarRefs.csv', stringsAsFactors = FALSE, row.names = 1)

# Function returns NA for none found. Change that to a zero.
scholarRefs$scholarRefs[is.na(scholarRefs$scholarRefs)] <- 0

nSpecies$scholarRefs <- sqldf('
  SELECT scholarRefs
  FROM nSpecies
  INNER JOIN scholarRefs
  ON scholarRefs.binomial=nSpecies.binomial
  '
  ) %>%
  .$scholarRefs

%%end.rcode







%%begin.rcode subsRemoveNAs

# Remove missing data and sort out the data frame a little.

nSpecies <- nSpecies[complete.cases(nSpecies), ]

# Add number of subspecies as a factor. Might help plotting.
nSpecies$SubspeciesFactor <- factor(nSpecies$NumberOfSubspecies, 
  levels = as.character(1:max(nSpecies$NumberOfSubspecies)))

# Rownames to species names
rownames(nSpecies) <- nSpecies$binomial

%%end.rcode



%%begin.rcode savenSpecies
########################################################
### At this point, nSpecies should be in final form  ###
########################################################

write.csv(nSpecies, file = 'data/Chapter3/nSpecies.csv')

%%end.rcode



%%begin.rcode treeRead

# Read in trees
t <- read.nexus('data/Chapter3/fritz2009geographical.tre')

# Select best supported tree
tr1 <- t[[1]]

# Make names match previous names
tr1$tip.label <- gsub('_', ' ', tr1$tip.label)

# Which tips are not needed
unneededTips <- tr1$tip.label[!(tr1$tip.label %in% nSpecies$binomial)]

# Prune tree down to only needed tips.
pruneTree <- drop.tip(tr1, unneededTips)

rm(t)

%%end.rcode

%%begin.rcode nSpeciesTreePlot, out.width = '\\textwidth', fig.cap = 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.', fig.show = extraFigs

# Plot tree 
p <- ggtree(pruneTree, layout = 'fan') 

p %<+% nSpecies[, 1:6] +
  geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) +
  scale_size(range = c(0.2, 2)) +
  scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,6,9,10)], pokepal('Carvanha')[c(1,2,4, 13, 12)])) +
  theme_tcdl +
  theme(plot.margin = unit(c(-1, 3, -2.5, -2), "lines")) +
  theme(legend.position = 'right') +
  labs(size = 'Virus Richness') +
  theme(legend.key.size = unit(0.6, "lines"),
              legend.text = element_text(size = 6),
              legend.title = element_text(size = 8))


%%end.rcode



%%begin.rcode scholarvspubmed, fig.show = extraFigs, fig.cap = 'Logged number of references on scholar and pubmed, with a fitted (unphylogenetic) linear model. Colours indicate family.'

# Check how correlated pubmed and scholar are.


compSubspecies <- comparative.data(data = nSpecies, phy = pruneTree, names.col = 'binomial')

citeCor <- pgls(log(scholarRefs) ~ log(pubmedRefs + 1), data = compSubspecies, lambda = 'ML')

studyEffortCor <- summary(citeCor)
# And plot
ggplot(nSpecies, aes(x = scholarRefs, y = pubmedRefs + 1)) +
  geom_point(aes(colour = familyPlotCol), size = 2.5) +
  geom_smooth(method = 'lm') +
  scale_x_log10() +
  scale_y_log10() +
  scale_colour_hc()

%%end.rcode

%%begin.rcode subsDataCapts
subsDataCapts <- c(
'Unlogged number of virus species against log mass with a non-phylogenetic linear model added. Points are significantly jittered to try and reveal the severe overplotting in the bottom left corner in particular.',
'Number of virus species against logged number of subspecies (not marginal) with a non-phylogenetic linear model added. Points are significantly jittered to try and reveal the severe overplotting in the bottom left corner in particular.', 
'Number of virus species against logged number of subspecies (not marginal) with a non-phylogenetic linear model added.', 
'Virus species against study effort (log pubmed references +1)')
%%end.rcode

%%begin.rcode subsDataviz, fig.show = extraFigs, fig.cap = subsDataCapts

# A number of exploratory plots

# Mass against viruses
ggplot(nSpecies, aes(log(mass), virusSpecies)) +
  geom_point(aes(colour = familyPlotCol), size = 2.5) + 
  geom_smooth(method = 'lm')+
  labs(colour = 'Family') +
  scale_colour_hc()



# N Subspecies and against viruses
ggplot(nSpecies, aes(NumberOfSubspecies, virusSpecies)) +
  geom_jitter(aes(colour = familyPlotCol), size = 2.5, 
    position = position_jitter(width = .3, height = .3)) + 
  geom_smooth(method = 'lm')+
  labs(colour = 'Family') +
  scale_colour_hc()


# Log(N Subspecies) and against viruses

ggplot(nSpecies, aes(NumberOfSubspecies, virusSpecies)) +
  geom_jitter(aes(colour = familyPlotCol), size = 2.5, 
    position = position_jitter(width = .05, height = .2)) +
  scale_x_log10() + 
  geom_smooth(method = 'lm')+
  labs(colour = 'Family') +
  scale_colour_hc()


# N. Subspecies against viruses as a boxplot to deal with overplotting.
ggplot(nSpecies, aes(SubspeciesFactor, virusSpecies)) +
  geom_boxplot() +
  scale_x_discrete(limits = levels(nSpecies$SubspeciesFactor), drop=FALSE) +
  geom_smooth(method = 'lm', aes(group = 1)) +
  xlab('# subspecies')


# Study effort against virusSpecies
ggplot(nSpecies, aes(log(pubmedRefs + 1), virusSpecies)) +
  geom_jitter(aes(colour = familyPlotCol), size = 2.5, 
    position = position_jitter(width = .1, height = .1)) + 
  geom_smooth(method = 'lm') +
  labs(colour = 'Family')+
  scale_colour_hc()


# Distribution size aginst virus


ggplot(nSpecies, aes(distrSize, virusSpecies)) +
  geom_point(aes(colour = familyPlotCol), size = 2.5) + 
  geom_smooth(method = 'lm') +
  labs(colour = 'Family') +
  scale_colour_hc() +
  scale_x_log10()


# Correlation plot
nSpecies %>%
  dplyr::select(virusSpecies, NumberOfSubspecies, mass, distrSize, pubmedRefs, scholarRefs) %>%
  mutate(mass = log(mass), distrSize = log(distrSize), pubmedRefs = log(pubmedRefs + 1), scholarRefs = log(scholarRefs)) %>%
  ggpairs(.)

%%end.rcode



%%begin.rcode, subsAnalysis, fig.show = extraFigs

##################################################################################
## N Virus ~ subs + log(cites + mass)

subspeciesJointUnlog <- pgls(
  virusSpecies ~ log(scholarRefs) + NumberOfSubspecies +  log(mass), 
  data = compSubspecies, lambda = 'ML')



## N Virus ~ subs + log(cites + mass) + subs*log(cites)

subspeciesInter <- pgls(
  virusSpecies ~ log(mass) + 
  NumberOfSubspecies*log(scholarRefs), 
  data = compSubspecies, lambda = 'ML')

#subInter.summary <- summary(subspeciesInter)




## Look at Variance inflation factors.
##   Couple of help messages imply lm vif is fine.

#sqrt(vif(lm(virusSpecies ~ log(scholarRefs) + NumberOfSubspecies +  log(mass) + log(distrSize), data = nSpecies)))

%%end.rcode







		

%%begin.rcode ITanalysis

varList <- c('scholarRefs', 'NumberOfSubspecies', 'mass', 'distrSize', 'rand')

findCombs <- function(k, vars, longest){
  x <- t(combn(vars, k))
  nas <- matrix(NA, ncol = longest - NCOL(x), nrow = nrow(x))
  mat <- cbind(x, nas)
  return(mat)
}

modelList <- lapply(0:5, function(k) findCombs(k, varList, 6))
modelMat <- do.call(rbind, modelList)

interMat <- modelMat[apply(modelMat, 1, function(x) "scholarRefs" %in% x & "NumberOfSubspecies" %in% x), ]
interMat[, 2:5] <- interMat[, 1:4]
interMat[, 1] <- "scholarRefs:NumberOfSubspecies"

allModelMat <- rbind(modelMat, interMat)


allFormulae <- apply(allModelMat[-1, ], 1, function(x) as.formula(paste('virusSpecies ~', paste(x[!is.na(x)], collapse = ' + '))))

allFormulae <- c(as.formula('virusSpecies ~ 1'), allFormulae)



modelSelect <- function(allForm, data, phy, boot, allModelMat, varList){
  
  set.seed(paste0('123', boot))
  bootData <- cbind(data, rand = runif(nrow(data)))

  # log some predictors
  bootData[, c('mass', 'scholarRefs', 'distrSize')] <- log(bootData[, c('mass', 'scholarRefs', 'distrSize')])
  
  # scale
  bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'NumberOfSubspecies')] <- base::scale(bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'NumberOfSubspecies')])
  
  coefs <- matrix(NA, ncol = length(varList) + 2, nrow = nrow(allModelMat), 
             dimnames = list(NULL, paste0('beta.', c('(Intercept)', varList, 'scholarRefs:NumberOfSubspecies'))))

  results <- apply(allModelMat, 1, function(x) sapply(c(varList, "scholarRefs:NumberOfSubspecies"), function(y) y %in% x)) %>%
               t %>%
               data.frame %>%
               cbind(AIC = NA, boot = boot, lambda = NA, attempt = NA, predictors = NA, coefs)

  # Fit each model 
  # I'm having problems with convergence so sometimes have to try different starting values.
  for(m in 1:length(allForm)){
    if(exists('model')){
      rm(model)
    }
    try({
      model <- gls(allForm[[m]], correlation = corPagel(value = 0.4, phy = phy), data = bootData, method = 'ML')  
      results$attempt[m] <- 1
    }) 
    if(!exists('model')){
      try({
        model <- gls(allForm[[m]], correlation = corPagel(value = 0.3, phy = phy), data = bootData, method = 'ML')  
        results$attempt[m] <- 2
      }) 
    }
    if(!exists('model')){
      try({
        model <- gls(allForm[[m]], correlation = corPagel(value = 0.2, phy = phy), data = bootData, method = 'ML')  
        results$attempt[m] <- 3
      }) 
    }
    if(!exists('model')){
        try({
          model <- gls(allForm[[m]], correlation = corPagel(value = 0.1, phy = phy), data = bootData, method = 'ML')  
          results$attempt[m] <- 4
        }) 
      }
    if(!exists('model')){
      try({
        model <- lm(allForm[[m]], data = bootData) 
        results$attempt[m] <- 5
        message('Running lm')
      }) 
    }
    #model <- pgls(allForm[[m]], data = compBootData, lambda = 'ML')
    results$AIC[m] <- AICc(model)

    if(inherits(model, 'gls')){
        results$lambda[m] <- model$modelStruct$corStruct[1]
    }

    results$predictors[m] <- allForm[[m]] %>% as.character %>% .[3]


    results[m, paste0('beta.', names(coef(model)))] <- coef(model)

    message(paste('Boot:', boot, ', m:', m, '\n'))
  }

  results$dAIC <- results$AIC - min(results$AIC)
  results$weight <- exp(- 0.5 * results$dAIC) / sum(exp(- 0.5 * results$dAIC))


  return(results)

}




%%end.rcode

%%begin.rcode modelSelectBoots, eval = subBoots

fitModelsBootStrap <- mclapply(1:nBoots, function(b) modelSelect(allFormulae, nSpecies, pruneTree, b, allModelMat, varList), mc.cores = nCores)

allResults <- do.call(rbind, fitModelsBootStrap)

write.csv(allResults, file = 'data/Chapter3/modelSelectSubspecies.csv')


%%end.rcode

%%begin.rcode analyseModelSelect, fig.show = extraFigs

allResults <- read.csv('data/Chapter3/modelSelectSubspecies.csv', row.names = 1)

#varWeights <- sapply(names(allResults)[1:6], function(x) sum(allResults$weight[allResults[, x]])/nBoots)

sepVarWeights <- lapply(1:nBoots, function(b) 
                      sapply(names(allResults)[1:6], 
                        function(x) 
                          sum(allResults[allResults$boot == b, 'weight'][allResults[allResults$boot == b, x]])
                      )
                     )      

sepVarWeights <- do.call(rbind, sepVarWeights) %>%
                      data.frame(., boot = 1:nBoots) %>%
                      reshape2::melt(., value.name = 'estimate', id.vars = 'boot')

sepVarWeights$col <- 'Other Variables'
sepVarWeights$col[grep('NumberOf', sepVarWeights$variable)] <- 'Population Structure'
sepVarWeights$col[sepVarWeights$variable == 'rand'] <- 'Null'



modelWeights <- allResults %>%
                  group_by(predictors) %>%
                  summarise(AICc = mean(AIC)) %>%
                  mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>%
                  arrange(desc(modelWeight)) %>%
                  mutate(cumulativeWeight = cumsum(modelWeight)) %>%
                  mutate(string = predictors)


# Calculate variable weights based on mean(AIC) rather than raw AIC.
varWeights <- sapply(names(allResults)[1:6], 
  function(x) sum(modelWeights$modelWeight[grep(x, as.character(modelWeights$predictors))]))



allResults %>% 
  filter(rand, !`scholarRefs.NumberOfSubspecies`, NumberOfSubspecies) %>%
ggplot(., aes(x = lambda, colour = predictors)) + 
  geom_density() +
  scale_colour_hc()

ggplot(allResults, aes(x = lambda)) + 
  geom_density() 

allResults %>% 
  filter(boot == 1) %>%
  dplyr::select(predictors, lambda)

%%end.rcode



%%begin.rcode ITPlots

# reorder factors to get structure vars at beginning.
sepVarWeights$variable <- factor(sepVarWeights$variable, levels(sepVarWeights$variable)[c(2, 6, 1, 3, 4, 5)])

ITPlot <- ggplot(sepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) +
  geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.99, outlier.size = 1, lwd = 0.4) +
  scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) +
  scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) +
  theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'),
    panel.grid.major.x = element_blank(),
    axis.text.y = element_text(size = 8)) +
  scale_x_discrete(labels = c('NSubspecies', 'NSubspecies*Scholar', 'Scholar', 'Mass', 'Range size', 'Random')) +
  scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) +
  ylim(0, 1) +
  ylab('P(in best model)') +
  xlab('')


%%end.rcode

%%begin.rcode nSpeciesCoef, fig.show = extraFigs

ggplot(allResults, aes(x = 'beta.NumberOfSubspecies', colour = scholarRefs)) +
  geom_density()



mean(allResults$NumberOfSubspecies, na.rm = TRUE)


varCoefMeans <- apply(allResults[, grep('beta', names(allResults))], 2, function(x) wtd.mean(x, allResults$weight, na.rm = TRUE))
varCoefVars  <- apply(allResults[, grep('beta', names(allResults))], 2, function(x) wtd.var(x, allResults$weight, na.rm = TRUE))

nSpeciesCoefMean <- wtd.mean(allResults$beta.NumberOfSubspecies[!allResults$scholarRefs.NumberOfSubspecies], 
                             allResults$weight[!allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE)
nSpeciesCoefMeanI <- wtd.mean(allResults$beta.NumberOfSubspecies[allResults$scholarRefs.NumberOfSubspecies], 
                             allResults$weight[allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE)
nSpeciesInterMean <- wtd.mean(allResults$`beta.scholarRefs.NumberOfSubspecies`, allResults$weight, na.rm = TRUE)


nSpeciesCoefVar <- wtd.var(allResults$beta.NumberOfSubspecies[!allResults$scholarRefs.NumberOfSubspecies], 
                             allResults$weight[!allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE)
nSpeciesCoefVarI <- wtd.var(allResults$beta.NumberOfSubspecies[allResults$scholarRefs.NumberOfSubspecies], 
                             allResults$weight[allResults$scholarRefs.NumberOfSubspecies], na.rm = TRUE)
nSpeciesInterVar <- wtd.var(allResults$`beta.scholarRefs.NumberOfSubspecies`, allResults$weight, na.rm = TRUE)



# Direction of interaction models

min(nSpecies$NumberOfSubspecies)

max(nSpecies$NumberOfSubspecies)

# At minimum study effort
nSpeciesInterMean*log(min(nSpecies$scholarRefs)) + nSpeciesCoefMeanI
nSpeciesInterMean*log(max(nSpecies$scholarRefs)) + nSpeciesCoefMeanI
nSpeciesInterMean*log(median(nSpecies$scholarRefs)) + nSpeciesCoefMeanI

mean(nSpeciesInterMean*log(nSpecies$scholarRefs) + nSpeciesCoefMeanI > 0)



%%end.rcode



%%begin.rcode familyMeans

familyMeans <- nSpecies %>%
  group_by(Family) %>%
  summarise(mean = mean(virusSpecies), n = n()) 

%%end.rcode


%%begin.rcode univariatePGLS

#orderedNSpecies <- nSpecies[sapply(pruneTree$tip.label, function(x) which(nSpecies$binomial == x)),]


sspLambda <- summary(pgls(NumberOfSubspecies ~ 1, data = compSubspecies, lambda = 'ML'))
massLambda <- summary(pgls(log(mass) ~ 1, data = compSubspecies, lambda = 'ML'))
scholarLambda <- summary(pgls(log(scholarRefs) ~ 1, data = compSubspecies, lambda = 'ML'))
virusLambda <- summary(pgls(virusSpecies ~ 1, data = compSubspecies, lambda = 'ML'))
distrLambda <- summary(pgls(log(distrSize) ~ 1, data = compSubspecies, lambda = 'ML'))

sspUni <- summary(pgls(virusSpecies ~ NumberOfSubspecies, data = compSubspecies, lambda = 'ML'))


%%end.rcode





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% FST ANALYSIS                                                                                                                                  %%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%begin.rcode fstRead, eval = fstComb

# Read in Fst data.
# Then add extra columns needed.

fst <- read.csv('data/Chapter3/FstDataCompData.csv')

# Check overlap of datasets.
sum(!(fst$binomial %in% virus2$binomial[virus2$host.species != '']))

notInFst <- fst$binomial[!(fst$binomial %in% virus2$binomial)]
# lots of sp not in virus2. MAybe will include 0 virus species. Kinda makes sense.



#########################################################################################
#### Get distribution size and width                                                 ####
#########################################################################################




fst$binomial[!(fst$binomial %in% ranges$binomial)]

fst <- fst[(fst$binomial %in% ranges$binomial), ]

unique(fst$binomial) %>% length




findAreaFst <- function(name){
  #cat(name)
  A <- areaPolygon(ranges[ranges$binomial == as.character(name), ])
  sum(A)
}

fstIucnDistr <- sapply(fst$binomial, findAreaFst)


fst$distrSize <- fstIucnDistr


#### Now get distribution width

findWidth <- function(name){
  #print(name)
  distr <- ranges[ranges$binomial == as.character(name), ]

  coords <- list()
  # Get coordinates from all polygons into one matrix.
  for(i in 1:length(distr@polygons)){
    coords[[i]] <- distr@polygons[[i]]@Polygons[[1]]@coords
  }
  coords <- do.call(rbind, coords)

  # Take the convex hull of coordinates to speed up last step.
  hullCoords <- coords[chull(coords), ]

  maxDist <- max(apply(hullCoords, 1, function(x) distGeo(coords, x)))/1000
  return(maxDist)
  
}

# Calculate widest part of all species distributions.
#   This is slow but also RAM heavy.
#   3 cores doesn't crash my computer with 16GB RAM.
rangeWidth <- mclapply(fst$binomial, findWidth, mc.cores = 3) %>% do.call(c, .)

#rangeWidth <- sapply(fst$binomial, findWidth)

fst$rangeWidth <- rangeWidth
fst$rangeCoverage <- fst$Dmax..km. / fst$rangeWidth



fst$Useable <- fst$rangeCoverage > rangeUseable
sum(fst$Useable, na.rm = TRUE)
fst$binomial[fst$Useable] %>% unique %>% .[!is.na(.)] %>% length

# Need to go back and check data but for now if fst$Useable is na, then it's FALSE (i.e.\ it's not a useable row)
fst$Useable[is.na(fst$Useable)] <- FALSE


%%end.rcode



%%begin.rcode fstStudyEffort, eval = fstComb

# First take what data we can from nSpecies analysis.
fstStudy <- sqldf("	
                SELECT fst.binomial, nSpecies.scholarRefs, nSpecies.pubmedRefs
                FROM fst
                LEFT JOIN nSpecies
                ON nSpecies.binomial=fst.binomial
              ")

%%end.rcode

%%begin.rcode fstScrape, eval = runFstScrape

########################################################
#### Sloow bit that might get you blocked by google ####
########################################################

fstNewStudy <- fstStudy[is.na(fstStudy[,2]),1] %>%
                 lapply(., function(x) c(x, scrapeScholar(x), scrapePub(x))) %>%
                 do.call(rbind, .)

names(fstNewStudy) <- c('binomial', 'scholarRefs', 'pubmedRefs')

write.csv(fstNewStudy, file = 'data/Chapter3/fstScrape.csv')

%%end.rcode


%%begin.rcode fstCombine, eval = fstComb

fstNewStudy <- read.csv('data/Chapter3/fstScrape.csv', row.names = 1)
names(fstNewStudy) <- c('binomial', 'scholarRefs', 'pubmedRefs')

# NAs are from searches with 0 references.
fstNewStudy$pubmedRefs[is.na(fstNewStudy$pubmedRefs)] <- 0

whichRows <- lapply(fstNewStudy$binomial, function(x) which(fstStudy$binomial == x))
for(i in 1:length(whichRows)){
  fstStudy[whichRows[[i]], 2:3] <- fstNewStudy[i, 2:3]
}


fst <- cbind(fst, fstStudy[, 2:3])

# Remove rows whose scale is too small
fst <- fst[fst$Useable, ]


# Don't want rows using mtDNA due to female baised dispersal
fst <- fst[fst$Marker != 'mtDNA', ]

%%end.rcode

%%begin.rcode convertFst, eval = fstComb

calcNm <- function(Fst){ (1 - Fst)/(4 * Fst) }


fst$Nm <- calcNm(fst$Value)



fst <- fst[!is.na(fst$Nm) & !(fst$Nm == Inf), ]

fstFinal <- fst

# Take means of species with multiple measurements

fstFinal <- fstFinal[!duplicated(fstFinal$binomial), ]
fstFinal$Nm <- sapply(fstFinal$binomial, function(x) mean(fst$Nm[fst$binomial == x]))

# Add number of viruses to fst dataset
#   Includes zeros for species with no known viruses.

fstFinal$virusSpecies <- sapply(fstFinal$binomial, function(x) sum(virus2$binomial == x))




# Add mass data.


mass <- sqldf("
  SELECT [X5.1_AdultBodyMass_g]
  FROM fstFinal
  LEFT JOIN pantheria
  ON fstFinal.binomial=pantheria.MSW05_Binomial
  ")

# Don't need pantheria data anymore
rm(pantheria)

fstFinal$mass <- mass[, 1]

fstFinal$mass[fstFinal$binomial == 'Myotis ricketti'] <- meanAdditionalMass$mass[meanAdditionalMass$binomial == 'Myotis ricketti']

fstFinal$mass[fstFinal$binomial == 'Myotis macropus'] <- 9.8

fstFinal <- fstFinal[!is.na(fstFinal$mass), ]


#############################
### fst data is finished  ###
#############################

write.csv(fstFinal, 'data/Chapter3/fstFinal.csv')
%%end.rcode


%%begin.rcode

#### Read is full fstFinal dataframe

fstFinal <- read.csv('data/Chapter3/fstFinal.csv', row.names = 1)

%%end.rcode

%%begin.rcode fstCors, fig.show = extraFigs

fstFinal[, c('mass', 'scholarRefs', 'rangeWidth', 'Nm')] %>%
  log %>%
  cbind(virusSpecies = fstFinal$virusSpecies) %>%
  ggpairs(.)


%%end.rcode



%%begin.rcode compareNm, fig.show = extraFigs

ggplot(fstFinal, aes(x = Marker, y = Nm)) +
  geom_point() +
  scale_y_log10()

lm(fstFinal$Nm ~ fstFinal$Marker) %>% aov %>% summary


%%end.rcode


%%begin.rcode fstTree

# Prune the tree for the fst data.

# Which tips are not needed
fstUnneededTips <- tr1$tip.label[!(tr1$tip.label %in% fstFinal$binomial)]

# Prune tree down to only needed tips.
fstTree <- drop.tip(tr1, fstUnneededTips)



%%end.rcode


%%begin.rcode fstTreePlot, fig.show = extraFigs, out.width = '\\textwidth', fig.cap = 'Pruned phylogeny with dot size showing number of pathogens and colour showing family.', fig.height = 3.6

# Plot tree 
p <- ggtree(fstTree) 


fstFinal$lengthNames <- fstFinal$binomial %>%
                          as.character %>%
                          paste0('  ', .)


p %<+% fstFinal[, c('binomial', 'virusSpecies')] + 
  #geom_tiplab(family = 'lato light', align = FALSE) +
  geom_text2(aes(x = x + 15, label = as.character(label), subset = isTip), 
    family = 'Lato light', hjust = 0, size = 3.3) +
  #geom_text(aes(x = x + 15, label = as.character(label)), subset=.(isTip), 
  #  family = 'Lato light', hjust = 0, size = 3.3) +
  ggplot2::xlim(0, 210) +
  theme_tcdl +
  geom_point2(aes(x = x + 8, size = virusSpecies, subset = isTip)) +
  scale_size(range = c(0, 4)) +
  theme(legend.key.size = unit(0.8, "lines"),
              legend.text = element_text(size = 9),
              legend.title = element_text(size = 8),
              legend.position = "right",  
              text = element_text(colour = 'darkgrey'),
              legend.key = element_blank()) +
  labs(size = 'Virus Richness') 



%%end.rcode



%%begin.rcode fstITanalysis

fstVarList <- c('scholarRefs', 'Nm', 'mass', 'distrSize', 'rand')


fstModelList <- lapply(0:5, function(k) findCombs(k, fstVarList, 5))
fstModelMat <- do.call(rbind, fstModelList)

fstAllFormulae <- apply(fstModelMat[-1, ], 1, function(x) as.formula(paste('virusSpecies ~', paste(x[!is.na(x)], collapse = ' + '))))

fstAllFormulae <- c(as.formula('virusSpecies ~ 1'), fstAllFormulae)

%%end.rcode

%%begin.rcode fstModelSelectFun


fstModelSelect <- function(allForm, data, phy, boot, allModelMat, varList){
  
  set.seed(paste0('2388', boot))
  bootData <- cbind(data, rand = runif(nrow(data)))
  row.names(bootData) <- bootData$binomial


  # log some predictors
  bootData[, c('mass', 'scholarRefs', 'distrSize')] <- log(bootData[, c('mass', 'scholarRefs', 'distrSize')])
  
  # scale
  bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'Nm')] <- base::scale(bootData[, c('mass', 'scholarRefs', 'distrSize', 'rand', 'Nm')])
  

  coefs <- matrix(NA, ncol = length(varList) + 1, nrow = nrow(allModelMat), 
             dimnames = list(NULL, paste0('beta.', c('(Intercept)', varList))))

  results <- apply(allModelMat, 1, function(x) sapply(varList, function(y) y %in% x)) %>%
               t %>%
               data.frame %>%
               cbind(AIC = NA, boot = boot, lambda = NA, attempt = NA, predictors = NA, coefs)

  # Fit each model 
  # I'm having problems with convergence so sometimes have to try different starting values.
  for(m in 1:length(allForm)){
    if(exists('model')){
      rm(model)
    }
    try({
      model <- gls(allForm[[m]], correlation = corPagel(value = 0.4, phy = phy), data = bootData, method = 'ML')  
      results$attempt[m] <- 1
    }) 
    if(!exists('model')){
      try({
        model <- gls(allForm[[m]], correlation = corPagel(value = 0.3, phy = phy), data = bootData, method = 'ML')  
        results$attempt[m] <- 2
      }) 
    }
    if(!exists('model')){
      try({
        model <- gls(allForm[[m]], correlation = corPagel(value = 0.2, phy = phy), data = bootData, method = 'ML')  
        results$attempt[m] <- 3
      }) 
    }
    if(!exists('model')){
        try({
          model <- gls(allForm[[m]], correlation = corPagel(value = 0.1, phy = phy), data = bootData, method = 'ML')  
          results$attempt[m] <- 4
        }) 
      }
    if(!exists('model')){
      try({
        model <- lm(allForm[[m]], data = bootData) 
        results$attempt[m] <- 5
        message('Running lm')
      }) 
    }
    #model <- pgls(allForm[[m]], data = compBootData, lambda = 'ML')
    results$AIC[m] <- AICc(model)

    if(inherits(model, 'gls')){
        results$lambda[m] <- model$modelStruct$corStruct[1]
    }

    results$predictors[m] <- allForm[[m]] %>% as.character %>% .[3]


    results[m, paste0('beta.', names(coef(model)))] <- coef(model)

    message(paste('Boot:', boot, ', m:', m, '\n'))
  }

  results$dAIC <- results$AIC - min(results$AIC)
  results$weight <- exp(- 0.5 * results$dAIC) / sum(exp(- 0.5 * results$dAIC))


  return(results)

}

%%end.rcode

%%begin.rcode fstModelSelectBoots, eval = fstBoots



fstModelsBootStrap <- mclapply(1:nBoots, function(b) fstModelSelect(fstAllFormulae, fstFinal, fstTree, b, fstModelMat, fstVarList), mc.cores = nCores)

fstAllResults <- do.call(rbind, fstModelsBootStrap)

write.csv(fstAllResults, file = 'data/Chapter3/fstModelSelectSubspecies.csv')


%%end.rcode

%%begin.rcode fstAnalyseModelSelect, fig.show = extraFigs

fstAllResults <- read.csv('data/Chapter3/fstModelSelectSubspecies.csv', row.names = 1)

fstSepVarWeights <- lapply(1:nBoots, function(b) 
                      sapply(names(fstAllResults)[1:5], 
                        function(x) 
                          sum(fstAllResults[fstAllResults$boot == b, 'weight'][fstAllResults[fstAllResults$boot == b, x]])
                      )
                     )      

fstSepVarWeights <- do.call(rbind, fstSepVarWeights) %>%
                      data.frame(., boot = 1:nBoots) %>%
                      reshape2::melt(., value.name = 'estimate', id.vars = 'boot')

fstSepVarWeights$col <- 'Other Variables'
fstSepVarWeights$col[fstSepVarWeights$variable == 'Nm'] <- 'Population Structure'
fstSepVarWeights$col[fstSepVarWeights$variable == 'rand'] <- 'Null'





fstModelWeights <- fstAllResults %>%
                  group_by(predictors) %>%
                  summarise(AICc = mean(AIC)) %>%
                  mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>%
                  arrange(desc(modelWeight)) %>%
                  mutate(cumulativeWeight = cumsum(modelWeight)) 

# Calculate variable weights based on mean(AIC) rather than raw AIC.
fstVarWeights <- sapply(names(fstAllResults)[1:5], 
  function(x) sum(fstModelWeights$modelWeight[grep(x, as.character(fstModelWeights$predictors))]))

%%end.rcode




%%begin.rcode fstITlambda, fig.show = extraFigs, fig.cap = 'Values of $\\lambda$ found in $F_{ST}$ analysis.', fig.height = 3

ggplot(fstAllResults, aes(x = lambda)) + 
  geom_histogram() +
  ylab('Count') +
  xlab(expression(paste('Phylogenetic Signal, ', lambda)))

%%end.rcode


%%begin.rcode fstITlambdaFacets, fig.show = extraFigs, fig.height = 4


transform(fstAllResults, mass = c('Other', 'Mass' )[factor(mass)]) %>%
ggplot(aes(x = lambda)) + 
  facet_grid(. ~ mass) +
  geom_histogram() +
  ylab('Count') +
  xlab(expression(paste('Phylogenetic Signal, ', lambda)))


transform(fstAllResults, Nm = c('Other', 'Nm' )[factor(Nm)]) %>%
ggplot(aes(x = lambda)) + 
  facet_grid(. ~ Nm) +
  geom_histogram() +
  ylab('Count') +
  xlab(expression(paste('Phylogenetic Signal, ', lambda)))


transform(fstAllResults, distrSize = c('Other', 'distrSize' )[factor(distrSize)]) %>%
ggplot(aes(x = lambda)) + 
  facet_grid(. ~ distrSize) +
  geom_histogram() +
  ylab('Count') +
  xlab(expression(paste('Phylogenetic Signal, ', lambda)))


transform(fstAllResults, scholarRefs = factor(c('Scholar Refs', 'Other')[factor(!scholarRefs)], levels = c('Scholar Refs', 'Other'))) %>%
ggplot(aes(x = lambda)) + 
  facet_grid(. ~ scholarRefs) +
  geom_histogram() +
  ylab('Count') +
  xlab(expression(paste('Phylogenetic Signal, ', lambda)))

transform(fstAllResults, rand = c('Other', 'Rand' )[factor(rand)]) %>%
ggplot(aes(x = lambda)) + 
  facet_grid(. ~ rand) +
  geom_histogram() +
  ylab('Count') +
  xlab(expression(paste('Phylogenetic Signal, ', lambda)))


%%end.rcode

%%begin.rcode lookAtLambda, fig.show = extraFigs

fstComp <- comparative.data(fstTree, fstFinal, 'binomial')

fullFst <- pgls(virusSpecies ~ log(Nm) + log(mass) + log(distrSize) + log(distrSize) + log(scholarRefs), fstComp, lambda = 'ML')

fst.lambda.profile <- pgls.profile(fullFst, "lambda")
plot(fst.lambda.profile)

data.frame(x = fst.lambda.profile$x, L = fst.lambda.profile$logLik) %>%
ggplot(aes(x, L)) +
  geom_line() +
  geom_vline(xintercept = fst.lambda.profile$ci$ci.val, col = 'steelblue')


%%end.rcode


%%begin.rcode fstCoef, fig.show = extraFigs

ggplot(fstAllResults, aes(x = beta.Nm)) +
  geom_histogram()


ggplot(fstAllResults, aes(x = beta.Nm, colour = scholarRefs)) +
  geom_density()



ggplot(fstAllResults, aes(x = beta.Nm, colour = distrSize)) +
  geom_density()


fstCoefMeans <- apply(fstAllResults[, grep('beta', names(fstAllResults))], 2, function(x) wtd.mean(x, fstAllResults$weight, na.rm = TRUE))
fstCoefVars  <- apply(fstAllResults[, grep('beta', names(fstAllResults))], 2, function(x) wtd.var(x, fstAllResults$weight, na.rm = TRUE))

pcCoefLzero <- 100*sum(na.omit(fstAllResults$beta.Nm) < 0) / length(na.omit(fstAllResults$beta.Nm))

%%end.rcode



%%begin.rcode univariateFstPGLS

#orderedFst <- fstFinal[sapply(fstTree$tip.label, function(x) which(fstFinal$binomial == x)),]

compFst <- comparative.data(data = fstFinal, phy = fstTree, names.col = 'binomial')

nmFstLambda <- summary(pgls(log(Nm) ~ 1, data = compFst, lambda = 'ML'))
massFstLambda <- summary(pgls(log(mass) ~ 1, data = compFst, lambda = 'ML'))
scholarFstLambda <- summary(pgls(log(scholarRefs) ~ 1, data = compFst, lambda = 'ML'))
virusFstLambda <- summary(pgls(virusSpecies ~ 1, data = compFst, lambda = 'ML'))
distrFstLambda <- summary(pgls(distrSize ~ 1, data = compFst, lambda = 'ML'))

nmFstUni <- summary(pgls(virusSpecies ~ log(Nm), data = compFst, lambda = 'ML'))

massFstUni <- summary(pgls(virusSpecies ~ log(mass), data = compFst, lambda = 'ML'))
fstDistrStudyEffort <- summary(pgls(log(scholarRefs) ~ log(distrSize), data = compFst, lambda = 'ML'))

fstMassStudyEffort <- summary(pgls(log(scholarRefs) ~ log(mass), data = compFst, lambda = 'ML'))

%%end.rcode








\subsubsection{Population structure data}

I used two measures of population structure: the number of subspecies and the effective level of gene flow.
The number of subspecies was counted using the taxonomy from \textcite{wilson2005mammal}.
The effective level of gene flow was calculated from estimates of $F_{ST}$ collated from the literature.
The studies were from a wide range of spatial scales, from local ($\sim\!\SI{10}{\kilo\metre}$) to continental.
As $F_{ST}$ often increases with spatial scale \cite{burland1999population, hulva2010mechanisms, o2015genetic, vonhof2015range} I controlled for this by only using data from studies where a large proportion of the species range was studied.
I used the ratio of the furthest distance between $F_{ST}$ samples (taken from the paper or measured with \url{http://www.distancefromto.net/} if not stated) to the length of the IUCN species range \cite{iucn} and only used studies if this ratio was greater than \rinline{rangeUseable}.
This is an arbitrary value that was a compromise between retaining a reasonable number of data points and controlling for the bias in spatial scale.
I only used global $F_{ST}$ estimates as the mean of pairwise $F_{ST}$ values is not necessarily equal to the global $F_{ST}$ value.
I converted all $F_{ST}$ values to effective migration rates using $M = (1-F_{ST})/4F_{ST}$.
This transforms the data from being bound by $(0, 1)$ to being in the range $\lbrack 0, \infty)$ and is easier to interpret. 

The two measures of population structure were analysed separately because the number of subspecies data set had \rinline{nrow(nSpecies)} data points but there was only $F_{ST}$ data for \rinline{nrow(fstFinal)} bat species.
For the subspecies analysis, all bat species in \textcite{luis2013comparison} were used (i.e.\ all species with at least one known virus species).
This was to avoid using the very large number of bat species that have simply never been sampled for viruses.
However, for the gene flow analysis, all bat species with suitable $F_{ST}$ estimates were used.
As some bat species had suitable $F_{ST}$ estimates but were not present in \textcite{luis2013comparison}, some bat species with zero known virus species were included. 
These bat species with no known viruses were included to make the greatest use of the $F_{ST}$ data available and because the number of species with no known virus species was not unduly large (\rinline{sum(fstFinal$virusSpecies == 0)} species).

After data cleaning there was data for \rinline{nrow(nSpecies)} bat species in \rinline{length(unique(nSpecies$Family))} families for the subspecies analysis.
Due to the limited number of studies and the restrictive requirements imposed on study design, there was only data for \rinline{nrow(fstFinal)} bat species in \rinline{length(unique(fstFinal$Family))} families for the effective gene flow analysis.
The raw data are included in Table~\ref{A-rawData}.




\subsubsection{Other explanatory variables}



To control for study bias I collected the number of PubMed and Google Scholar citations for each bat species name including synonyms from ITIS \cite{itis}.
This was performed in \emph{R} \cite{R} using the \emph{rvest} package \cite{rvest}, with ITIS synonyms being accessed with the \emph{taxize} package \cite{chamberlain2013taxize}.
I log transformed these variables as they were strongly right skewed.
I tested for correlation between these two proxies for study effort using phylogenetic least squares regression (pgls), using the best-supported phylogeny from \textcite{fritz2009geographical}, and likelihood ratio tests using the \emph{caper} package \cite{caper} (Figures~\ref{fig:treePlot} and \ref{fig:scholarvspubmedPlot}).
The log number of citations on PubMed and Google scholar were highly correlated (pgls: $t$ = \rinline{studyEffortCor$coefficients['log(pubmedRefs + 1)', 't value']}, df = \rinline{studyEffortCor$df[2]}, $p < 10^{-5}$).
As the correlation between citation counts was strong, I only used Google Scholar reference counts in subsequent analyses.
%See the appendix for analyses run using PubMed citations.

Two factors that have previously been found to be important were included as additional explanatory variables: body mass \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat, han2015infectious, bordes2008bat} and range size \cite{kamiya2014determines, turmelle2009correlates, maganga2014bat}.		
These other factors were included to avoid spurious positive results occurring simply due to correlations between pathogen richness and a different, causal factor.
Despite commonly being associated with pathogen richness \cite{arneberg2002host, kamiya2014determines, nunn2003comparative}, population density was not included in the analysis as there is very little data for bat densities.
Measurements of body mass were taken from Pantheria \cite{jones2009pantheria} and primary literature \cite{canals2005relative, arita1993rarity, lopez2014echolocation, orr2013does, lim2001bat, aldridge1987turning, ma2003dietary, owen2003home, henderson2008movements, heaney2012nyctalus, oleksy2015high, zhang2009recent}. 
\emph{Pipistrellus pygmaeus} was assigned the same mass as \emph{P. pipistrellus} as they are indistinguishable by mass.
Body mass measurements were log transformed as they were strongly right skewed.
Distribution size was estimated by downloading range maps for all species from IUCN \cite{iucn} and were also log transformed due to right skew.




\subsection{Statistical analysis}

Statistical analysis for both response variables --- number of subspecies and effective level of gene flow --- was conducted using an information theoretical approach \cite{burnham2002model}, specifically following \textcite{whittingham2005habitat, whittingham2006we}.
All analyses were performed in \emph{R} \cite{R} and all code is available at \url{/~https://github.com/timcdlucas/PhDThesis/blob/master/comparative-test-of-pop-structure.Rtex}. %latex link issue. URL is correct.
I chose a credible set of models including all combinations of explanatory variables and a model with just an intercept.
In the analysis using the number of subspecies response variable I also modelled the interaction between study effort and number of subspecies by including their product.
This interaction was included as I believed \emph{a priori} that this interaction may be important as subspecies in well studied species are more likely to be identified.
The interaction was only included in models with both study effort and number of subspecies as individual terms.
Following \textcite{whittingham2005habitat} I included a uniformly distributed random variable.
This variable can be used to benchmark how important other explanatory variables are.
The whole analysis was run \rinline{nBoots} times, resampling the random variable each time.


To control for phylogenetic non-independence of data points I used the best-supported phylogeny from \textcite{fritz2009geographical} which is the supertree from \textcite{bininda2007delayed} with names updated to match the taxonomy by \textcite{wilson2005mammal}.
This tree was pruned to include only the species I had data for (Figure~\ref{fig:treePlot}).
Phylogenetic manipulation was performed using the \emph{ape} package \cite{ape}.
I also performed the analysis using the phylogeny from \textcite{jones2005bats} as this has some broad topological differences including the Rhinolophoidea being sister to the Pteropodidae rather than being related to the other insectivorous bats (Figure~\ref{fig:treePlot2}). 




%%begin.rcode treeCapt

treeCapt <- '
The phylogenetic distribution of viral richness.
The phylogeny is from \\cite{fritz2009geographical} pruned to include all species used in either the number of subspecies or gene flow analysis.
Dot size shows the number of known viruses for that species and colour shows family.
The red scale bar shows 25 million years.'



treeTitle <- 'Pruned phylogeny showing number of pathogens and family'

%%end.rcode

%%begin.rcode treePlot, out.width = '1\\textwidth', out.extra = 'trim = 0cm 0cm 0cm 0cm', fig.height = 5, fig.height = 5.5, fig.cap = treeCapt, fig.scap = treeTitle

combUneeded <- tr1$tip.label[!(tr1$tip.label %in% c(as.character(fstFinal$binomial), nSpecies$binomial))]

# Prune tree down to only needed tips.
combTree <- drop.tip(tr1, combUneeded)

combdf <- nSpecies %>%
            dplyr::select(binomial, virusSpecies, Family) %>%
            rbind(fstFinal %>% dplyr::select(binomial, virusSpecies, Family)) %>%
            distinct(binomial)

# Plot tree 
p <- ggtree(combTree, layout = 'fan') 

p %<+% combdf +
  geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) +
  scale_size(range = c(0.1, 3)) +
  scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,7,9,10)],    pokepal('Carvanha')[c(1,2,4, 13, 12, 9)])) +
  theme_tcdl +
  theme(plot.margin = unit(c(-2, -0, +3, -0), "lines")) +
  theme(legend.position = c(0.5, -0.04)) +
  geom_treescale(x = 0, y = 152, width = 25, color = pokepal(17)[3], offset = 9) +
  labs(size = 'Virus Richness') +
#  guides(size = guide_legend(override.aes = list(shape = 1))) +
  theme(legend.key.size = unit(0.8, "lines"),
        legend.text = element_text(size = 10),
        legend.margin = unit(c(0.05), "cm"),
        legend.title = element_text(size = 12),
        legend.direction = "horizontal") +
  guides(colour = guide_legend(ncol=3))


# Attempt at concentric circle time bar.
#scale <- data.frame(x = c(0, 0), y = c(0, 0), l = c(1200, 2400))

#p %<+% combdf +
#  geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) +
#  scale_size(range = c(0.1, 100), breaks = c(1, 5, 10)) +
#  scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,7,9,10)],    pokepal('Carvanha')[c(1,2,4, 13, 12, 9)])) +
#  theme_tcdl +
#  theme(plot.margin = unit(c(-2, -0, +3, -0), "lines")) +
#  theme(legend.position = c(0.5, -0.04)) +
#  geom_point(data = scale, aes(x = x, y = y, size = l), alpha = 0.2) +
#  geom_treescale(x = 0, y = 152, width = 25, color = pokepal(17)[3], offset = 9) +
#  labs(size = 'Virus Richness') +
##  guides(size = guide_legend(override.aes = list(shape = 1)), alpha = 0.9) +
#  theme(legend.key.size = unit(0.8, "lines"),
#        legend.text = element_text(size = 10),
#        legend.margin = unit(c(0.05), "cm"),
#        legend.title = element_text(size = 12),
#        legend.direction = "horizontal") +
#  guides(colour = guide_legend(ncol=3))

# Or using bars

#scale2 <- data.frame(x = c(1, 1), y = c(10, 200), w = c(1, 1))

#p %<+% combdf +
#  geom_point2(aes(size = virusSpecies, colour = Family, subset = isTip)) +
#  geom_bar(data = scale2, aes(x = x, y = y, size = w), alpha = 0.3, stat = 'identity', position = 'identity')

%%end.rcode



The importance of the phylogeny on each variable separately was examined by estimating the $\lambda$ parameter when regressing the variable against an intercept using the \emph{pgls} function in \emph{caper} \cite{caper}.
The parameter $\lambda$ usually takes values between zero and one and \emph{pgls} constrains $\lambda$ within these bounds. 
$\lambda = 0$ implies no autocorrelation while a trait evolving by Brownian motion along the tree would have $\lambda = 1$.
I tested fitted $\lambda$ values against the null hypothesis of $\lambda = 0$ (no correlation between species) with log-likelihood ratio tests using \emph{caper} \cite{caper}.

I fitted phylogenetic regressions for all models in the credible set using the function \emph{gls} in the package \emph{nlme} \cite{nlme}.
The explanatory variables were centred and scaled to allow direct comparison of the coefficients \cite{schielzeth2010simple}.
For each regression model I simultaneously fitted the $\lambda$ parameter as this avoids misspecifying the model \cite{revell2010phylogenetic}.
Unlike the \emph{pgls} function, \emph{gls} does not constrain $\lambda$ to be in the range $\lbrack 0, 1\rbrack$.
$\lambda < 0$ indicates that residuals from the fitted model are distributed on the phylogeny more uniformly than expected by chance.
$\kappa$ and $\delta$ parameters were constrained to one as they are more concerned with when evolution occurs along a branch than the importance of the phylogeny.
Further, fitting multiple parameters makes interpretation difficult. 



To establish the importance of variables I calculated the probability, $Pr$, that each variable would be in the best model amongst those examined (under the assumption that all models are \emph{a priori} equally likely).
This value can more generally, and with fewer assumptions, be considered as simply the relative weight of evidence for each variable being in the best model amongst those examined.
I calculated AICc for each model.
As each model was fitted 50 times, I calculated the average AICc, $\bar{\text{AICc}}$, by averaging AICc scores for each model.
$\Delta\text{AICc}$ was calculated as $\text{min}(\bar{\text{AICc}}) - \bar{\text{AICc}}$, not the mean of the individual $\Delta\text{AICc}$ scores, to guarantee that the best model has $\Delta\text{AICc} = 0$.
From these $\Delta\text{AICc}$ values I calculated Akaike weights, $w$.
This value can be interpreted as the probability that a model is the best model, given the data, amongst those examined.
For each variable, the sum of the Akaike weights of models containing that variable are summed to give $Pr$.
This value can be interpreted as the probability that the given variable is in the best model.

To determine the direction and strength of the effect of each variable the mean of its regression coefficient, $b$, in all models that contained that variable, weighted by the model's Akaike weight, was also calculated.
In the subspecies analysis the inclusion of an interaction term between number of subspecies and study effort makes interpretation of this mean coefficient more difficult, particularly because the interaction term greatly affects the estimated value of $b$.
To aid interpretation, the mean coefficient for the number of subspecies was calculated for: \emph{i}) all models containing the number of species, \emph{ii}) only models with the interaction term and \emph{iii}) only models with the number of subspecies but not the interaction term.



%%begin.rcode boxplotCapt	
		
# Caption for the main boxplot of subspecies vs virus		
		
boxplotCapt <- paste(
'The relationship between number of subspecies and viral richness for',
nrow(nSpecies),
'bat species.
The area of the circle shows the number of bat species at each discrete value.
48 bat species have one subspecies and one known virus species.
The red line represents a phylogenetic simple regression between the two variables.
'
)

boxplotTitle <- paste(
'The relationship between number of subspecies and viral richness for',
nrow(nSpecies),
'bat species'
)

%%end.rcode		
		
%%begin.rcode boxplot, fig.cap = boxplotCapt, fig.scap = boxplotTitle,	fig.height = 2.3

nSpeciesCounts <- nSpecies %>%
                    group_by(NumberOfSubspecies, virusSpecies) %>%
                    dplyr::summarize(n = n())

ggplot(nSpeciesCounts, aes(NumberOfSubspecies, virusSpecies, size = n)) + 
  geom_point() +
  scale_size(range = c(0.5, 4.3), breaks = c(1, 20, 40)) +
  scale_y_continuous(breaks = c(1, 5, 10, max(nSpecies$virusSpecies))) +
  scale_x_continuous(breaks = c(1,  4, 8, 12, 16)) +
  xlab('Number of Subspecies') +
  ylab('Viral Richness') +
  geom_abline(slope = sspUni$coef[2, 1], intercept = sspUni$coef[1,1], lwd = 0.7, colour = pokepal('nidorina')[10])

%%end.rcode




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Results}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



\subsection{Number of Subspecies}
\tmpsection{More descriptive}

The number of described virus species for a bat host ranged up to \rinline{max(nSpecies$virusSpecies)} viruses in \emph{\rinline{nSpecies$binomial[which.max(nSpecies$virusSpecies)]}}.
There appears to be a positive relationship between the number of subspecies and viral richness (Figure~\ref{fig:boxplot}) though few species have more than five subspecies. 
Out of \rinline{nrow(modelWeights)} fitted models, the top seven models all had $\Delta\text{AICc} < 4$ meaning there was no clear best model (Table~\ref{t:models} and Table~\ref{A-modelWeights}).
However these top seven models all contained study effort, number of subspecies and the interaction between these two variables.
The explanatory variables log(Mass), log(Range Size) and the uniformly random variable are each in three of the top seven models.
These top seven models had a combined weight of \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))} meaning that there is a \rinline{sprintf("%.0f", round(100 * modelWeights[7, 5]))}\% chance that one of these models is the best model amongst those examined.

Summing the Akaike weights of all models that contain a given variable gives a probability, $Pr$, that the variable would be in the best model amongst those in the plausible set \cite{whittingham2006we}.
The number of subspecies is very likely in the best model ($Pr > $ \rinline{substring(as.character( varWeights['NumberOfSubspecies']), 1, 4)}) as is the interaction term between the number of subspecies and study effort ($Pr = $ \rinline{varWeights['scholarRefs.NumberOfSubspecies']}) compared to the benchmark random variable which has $Pr = $ \rinline{varWeights['rand']} (Figure~\ref{fig:fstITPlots}A and Table~\ref{t:variables}).
When models with the interaction term are removed there is, on average (mean weighted by Akaike weights), a positive relationship between the number of subspecies and viral richness ($b = $ \rinline{nSpeciesCoefMean}, variance = \rinline{nSpeciesCoefVar}).
Models with an interaction term between the number of subspecies and study effort have a positive regression slope for the interaction term ($b = $ \rinline{nSpeciesInterMean}, variance = \rinline{nSpeciesInterVar}) and linear term ($b = $ \rinline{nSpeciesCoefMeanI}, variance = \rinline{nSpeciesCoefVarI}).
At median and high values of study effort, this gives a positive relationship between the number of subspecies and viral richness (Figure~\ref{fig:plotInter}).
At low values of study effort, the relationship between the number of subspecies and viral richness becomes flat or even negative.



\afterpage{ % use after page to make sure this whole table is at the end of a page.
\begin{landscape}
\begin{table}[p!]
\centering
%\rowcolors{2}{gray!25}{white}
\caption[Model selection results]{
Model selection results for number of subspecies and effective level of gene flow analysis. 
Models are ranked according to $\bar{\text{AICc}}$ and only the best nine and three  models are shown respectively.
Models were fitted to all combinations of variables (in total \rinline{nrow(modelWeights)} number of subspecies models and \rinline{nrow(fstModelWeights)} effective gene flow models).
$\bar{\text{AICc}}$ is the mean AICc score across \rinline{nBoots} resamplings of the null random variable. 
$\Delta$AICc is the model's $\bar{\text{AICc}}$ score minus $\text{min}(\bar{\text{AICc}})$. 
$w$ is the Akaike weight and can be interpreted as the probability that the model is the best model (of those in the plausible set).
$\sum w$ is the cumulative sum of the Akaike weights.
log(Scholar)*NSubspecies indicates the interaction term between study effort and number of subspecies.
%In the number of subspecies analysis there are many models with low $\Delta$AICc scores suggesting there there is no single `best model'.
%In the gene flow analysis, only the top model is supported.
}


\begin{tabular}{@{}>{\footnotesize}lrrrr@{}}

\toprule
\normalsize{Model} & $\bar{\text{AICc}}$ & $\Delta$AICc & $w$ & $\sum w$\\
\midrule
&&&&\\[-3mm]
\textit{\small{Number of Subspecies}} &&&&\\
%1
log(Scholar) + NSubspecies + log(Scholar)*NSubspecies  + log(Mass) + log(RangeSize) & 
\rinline{round(modelWeights[1 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[1, 3], 2))} &
\rinline{sprintf("%.2f", round(modelWeights[1, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[1, 5], 2))}\\
%2
log(Scholar) + NSubspecies + log(Scholar)*NSubspecies  + log(Mass) & 
\rinline{round(modelWeights[2 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[2, 3], 2))} &
\rinline{sprintf("%.2f", round(modelWeights[2, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[2, 5], 2))}\\
%3
log(Scholar) + NSubspecies + log(Scholar)*NSubspecies + Random + log(Mass) & 
\rinline{round(modelWeights[3 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[3, 3], 2))} &
\rinline{sprintf("%.2f", round(modelWeights[3, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[3, 5], 2))}\\
%4
log(Scholar) + NSubspecies + log(Scholar)*NSubspecies  & 
\rinline{round(modelWeights[4 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[4, 3], 2))} &
\rinline{sprintf("%.2f", round(modelWeights[4, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[4, 5], 2))}\\
%5
log(Scholar) + NSubspecies + log(Scholar)*NSubspecies  + log(RangeSize) & 
\rinline{round(modelWeights[5 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[5, 3], 2))} &
\rinline{sprintf("%.2f", round(modelWeights[5, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[5, 5], 2))}\\
%6
log(Scholar) + NSubspecies + log(Scholar)*NSubspecies  + Random + log(RangeSize) & 
\rinline{round(modelWeights[6 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[6, 3], 2))} &
\rinline{sprintf("%.2f", round(modelWeights[6, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[6, 5], 2))}\\
%7
log(Scholar) + NSubspecies + log(Scholar)*NSubspecies  + Random & 
\rinline{round(modelWeights[7 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[7, 3], 2))} &
\rinline{sprintf("%.2f", round(modelWeights[7, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[7, 5], 2))}\\
%8
log(Scholar) + NSubspecies + log(Mass) + Random & 
\rinline{round(modelWeights[8 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[8, 3], 2))} &
\rinline{sprintf("%.2f", round(modelWeights[8, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[8, 5], 2))}\\
%9
log(Scholar) + NSubspecies + log(Mass) + log(RangeSize) + rand& 
\rinline{round(modelWeights[9 ,2])} & \rinline{sprintf("%.2f", round(modelWeights[9, 3], 2))} &
\rinline{sprintf("%.2f", round(modelWeights[9, 4], 2))} & \rinline{sprintf("%.2f", round(modelWeights[9, 5], 2))}\\[5mm]
\textit{\small{Gene flow}} &&&&\\
log(Scholar) + log(Gene flow) + log(Mass) & 
\rinline{round(fstModelWeights[1 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[1, 3], 2))} &
\rinline{sprintf("%.2f", round(fstModelWeights[1, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[1, 5], 2))}\\
log(Range size) & 
\rinline{round(fstModelWeights[2 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[2, 3], 2))} &
\rinline{sprintf("%.2f", round(fstModelWeights[2, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[2, 5], 2))}\\
log(Mass) & 
\rinline{round(fstModelWeights[3 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[3, 3], 2))} &
\rinline{sprintf("%.2f", round(fstModelWeights[3, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[3, 5], 2))}\\
%log(Scholar) + log(Gene flow) + log(Mass) + Random &
%\rinline{round(fstModelWeights[4 ,2])} & \rinline{sprintf("%.2f", round(fstModelWeights[4, 3], 2))} &
%\rinline{sprintf("%.2f", round(fstModelWeights[4, 4], 2))} & \rinline{sprintf("%.2f", round(fstModelWeights[4, 5], 2))}\\
\bottomrule
\end{tabular}

\label{t:models}
\end{table}
\end{landscape}
}




When using the phylogeny from \textcite{jones2005bats} the results are broadly similar (Figure~\ref{f:A-itplots} and Tables~\ref{A-modelWeights2} and~\ref{t:variables2}).
Study effort, the number of subspecies and the interaction between the number of subspecies and study effort have strong support while range size and mass have intermediate support.
However, mass, range size and the interaction between number of subspecies and study effort have slightly weaker support than in the analysis using the phylogeny from \textcite{fritz2009geographical}.



\tmpsection{Model results}


\begin{table}[t!]
\centering
\caption[Estimated variable weights and coefficients for number of subspecies and gene flow analyses]{
Estimated variable weights (probability that a variable is in the best model) and their estimated coefficients for both number of subspecies and gene flow analyses.
The coefficients for the number of subspecies variable are given for models with and without the interaction term because this term strongly changes the coefficient and because the coefficient can only be usefully interpreted when estimated without the interaction. 
However, there are no weights for these separated terms as they are not directly compared in the model selection framework.
}
%\rowcolors{2}{gray!25}{white}
\begin{tabular}{@{}>{\small}l rrrr@{}}
\toprule
& \multicolumn{2}{c}{\textit{Number of Subspecies}} & \multicolumn{2}{c}{\textit{Gene flow}}\\\cmidrule(rl){2-3}\cmidrule(rl){4-5}
\normalsize{Variable} & $Pr$ & Coefficient & $Pr$ & Coefficient\\
\midrule
Number of subspecies &&&&\\
\hspace{3mm}Total & \rinline{sprintf('%.2f', varWeights['NumberOfSubspecies'])} & \rinline{varCoefMeans['beta.NumberOfSubspecies']} &&\\
\hspace{3mm}Models without interaction term &&  \rinline{nSpeciesCoefMean} &&\\
\hspace{3mm}Models with interaction term &&  \rinline{nSpeciesCoefMeanI} &&\\
Number of subspecies*log(Scholar) &  \rinline{varWeights['scholarRefs.NumberOfSubspecies']} &  \rinline{sprintf('%.2f', varCoefMeans['beta.scholarRefs.NumberOfSubspecies'])} && \\[2.5mm]  
Gene flow & & &  \rinline{sprintf('%.2f', fstVarWeights['Nm'])} &  \rinline{fstCoefMeans['beta.Nm']}\\[2.5mm]  
log(Scholar) &  \rinline{sprintf('%.2f', varWeights['scholarRefs'])} &  \rinline{varCoefMeans['beta.scholarRefs']} & 
   \rinline{sprintf('%.2f', fstVarWeights['scholarRefs'])} &  \rinline{fstCoefMeans['beta.scholarRefs']}\\
log(Mass) &  \rinline{sprintf('%.2f', varWeights['mass'])} &  \rinline{varCoefMeans['beta.mass']} & 
   \rinline{sprintf('%.2f', fstVarWeights['mass'])} &  \rinline{fstCoefMeans['beta.mass']}\\
log(Range size) &  \rinline{sprintf('%.2f', varWeights['distrSize'])} &  \rinline{varCoefMeans['beta.distrSize']}& 
   \rinline{fstVarWeights['distrSize']} &  \rinline{fstCoefMeans['beta.distrSize']}\\
Random &  \rinline{sprintf('%.2f', varWeights['rand'])} &  \rinline{varCoefMeans['beta.rand']}& 
   \rinline{fstVarWeights['rand']} &  \rinline{fstCoefMeans['beta.rand']}\\
\bottomrule
\end{tabular}

\label{t:variables}
\end{table}




\subsection{Gene Flow}

\tmpsection{More Descriptive}

%Figure~\ref{fig:fstTreePlot} shows the phylogeny used and the number of viruses for each species.
The number of described virus species for a bat host ranged up to \rinline{max(fstFinal$virusSpecies)} viruses in \emph{\rinline{fstFinal$binomial[which.max(fstFinal$virusSpecies)]}} (Figure~\ref{fig:fstRawData}).
Only the model with study effort, gene flow and body mass was well supported with the second model having an $\Delta\text{AICc}$ of \rinline{round(fstModelWeights[2, 3])} (Table~\ref{t:models} and Table~\ref{A-modelWeights}).
The effective level of gene flow was likely in the best model ($Pr > 0.99$, see Figure~\ref{fig:fstITPlots}B and Table~\ref{t:variables}).
On average (mean weighted by Akaike weights) there was a negative relationship between gene flow and viral richness ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) despite the insignificant positive relationship (Figure~\ref{fig:fstRawData}) estimated by the single-predictor model (pgls: $b$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Estimate']}, $t$ = \rinline{nmFstUni$coefficients['log(Nm)', 't value']}, df = \rinline{nmFstUni$df[2]}, $p$ = \rinline{nmFstUni$coefficients['log(Nm)', 'Pr(>|t|)']}).
Possibly due to the smaller sample size, or a weaker relationship, this coefficient was much more varied than the number of subspecies coefficient with \rinline{round(pcCoefLzero)}\% of multiple-regression models estimating a positive relationship.





%%begin.rcode ITCombPlotCapt

ITPlotCapts <- "
The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness.
The probability that each variable is in the best model (amongst the models tested) is shown for A) the number of subspecies analysis and B) the effective gene flow analysis.
The boxplots show the variation of the results over 50 resamplings of the uniformly random ``null'' variable. 
The thick bar of the boxplot shows the median value, the interquartile range is represented by a box, vertical lines represent range, and outliers are shown as filled circles.
The red ``Random'' box is the uniformly random variable. 
Population structure (number of subspecies and effective gene flow), shown in yellow, is likely to be in the best model in both analyses."

ITPlotTitle <- "The relative weight of evidence that each explanatory variable is in the best model for explaining viral richness"

%%end.rcode


%%begin.rcode fstITPlots, fig.cap = ITPlotCapts, fig.height = 2.5, fig.scap = ITPlotTitle, out.width = '\\textwidth', cache = FALSE

# Reorder var levels to get structure at beginning.
fstSepVarWeights$variable <- factor(fstSepVarWeights$variable, levels(fstSepVarWeights$variable)[c(2, 1, 3, 4, 5)])

# Draw the fst model selection plot
fstIT <- ggplot(fstSepVarWeights, aes(x = variable, y = estimate, colour = col, fill = col)) +
  geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) +
  scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) +
  scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) +
  ylim(0, 1) +
  theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'),
    panel.grid.major.x = element_blank(),
    axis.text.y = element_text(size = 8)) +
  scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) +
  scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) +
  ylim(0, 1) +
  ylab('P(in best model)') +
  xlab('')


#plot_grid(ITPlot, fstIT, labels = c("A", "B"), align = 'h', label_size = 10)


# Combine and print the plots.
ggdraw() +
  draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + 
  draw_plot(ITPlot, 0, 0, 0.5, 1) +
  draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + 
  draw_plot(fstIT, 0.5, 0.164, 0.5, 0.855) +
  draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12)


%%end.rcode




Study effort was very likely in the best model ($Pr > 0.99$) as was body mass ($Pr > 0.99$).
However, body mass had a negative average coefficient ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']}). % which is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model (pgls: $b$ = \rinline{massFstUni$coefficients['log(mass)', 'Estimate']}, $t$ = \rinline{massFstUni$coefficients['log(mass)', 't value']}, df = \rinline{massFstUni$df[2]}, $p$ = \rinline{massFstUni$coefficients['log(mass)', 'Pr(>|t|)']}).
In contrast to the number of subspecies analysis, range size was almost certainly not in the best model with $Pr = $ \rinline{fstVarWeights['distrSize']}.
%This variable being less supported than the random variable may be because range size is closely correlated with study effort (pgls: $b$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Estimate']}, $t$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 't value']}, df = \rinline{fstDistrStudyEffort$df[2]}, $p$ = \rinline{fstDistrStudyEffort$coefficients['log(distrSize)', 'Pr(>|t|)']}).
Of the three explanatory variables in the best model, study effort had the largest effect ($b = $ \rinline{fstCoefMeans['beta.scholarRefs']}, variance = \rinline{fstCoefVars['beta.scholarRefs']}).
The effect size of gene flow ($b = $ \rinline{fstCoefMeans['beta.Nm']}, variance = \rinline{fstCoefVars['beta.Nm']}) was approximately twice the size of that of body mass ($b = $ \rinline{fstCoefMeans['beta.mass']}, variance = \rinline{fstCoefVars['beta.mass']})




%%begin.rcode fstRawCapt

fstRawDataCapt <- 
paste(
'Relationship between viral richness and log effective gene flow per generation for',
nrow(fstFinal),
'bat species.
Green points are studies that estimated effective gene flow using allozymes and blue points are studies using microsatellites.
The red line represents a phylogenetic simple regression between the two variables.
')



fstRawDataTitle <- 
paste(
'Relationship between viral richness and log effective gene flow per generation for',
nrow(fstFinal),
'bat species
')

%%end.rcode



%%begin.rcode fstRawData, fig.height = 2.3, fig.cap = fstRawDataCapt, fig.scap = fstRawDataTitle

# Plot raw fst data

ggplot(fstFinal, aes(x = Nm, y = virusSpecies, colour = Marker)) +
  geom_point(size = 2) +
  scale_colour_poke(pokemon = 'oddish', spread = 3) +
  scale_x_log10() +
  geom_abline(intercept = nmFstUni$coef[1, 1], slope = nmFstUni$coef[2, 1], lwd = 0.7, colour = pokepal('nidorina')[10]) +
  xlab('Gene Flow (per gen.)') +
  ylab('Viral Richness') 


%%end.rcode



When using the phylogeny from \textcite{jones2005bats} the analysis became very unstable (Figure~\ref{f:A-itplots}).
The support for each variable changed dramatically with each resampling of the random variable.
On average however, only the model containing mass and range size is supported (Tables~\ref{A-fstModelWeights} and~\ref{t:variables2}).




\subsection{Phylogenetic Analysis}

\subsubsection{Number of subspecies}

Figure~\ref{fig:treePlot} shows the phylogeny used and the number of viruses for each species.
The mean number of viruses across families is fairly constant with \rinline{familyMeans$Family[which.min(familyMeans$mean)]} having the smallest mean, (\rinline{min(familyMeans$mean)}).
The highest mean is \rinline{familyMeans$Family[which.max(familyMeans$mean)]} with \rinline{max(familyMeans$mean)} virus species per bat species, but this is based on only \rinline{familyMeans$n[which.max(familyMeans$mean)]} species.
The \rinline{familyMeans$Family[order(familyMeans$mean, decreasing = TRUE)[2]]} have the second highest mean  of \rinline{familyMeans$mean[order(familyMeans$mean, decreasing = TRUE)[2]]} ($n$ = \rinline{familyMeans$n[order(familyMeans$mean, decreasing = TRUE)[2]]}).



The small change in mean pathogen richness across families and the lack of clear pattern in Figure~\ref{fig:treePlot} implies that viral richness is not strongly phylogenetic. 
This is corroborated by the small estimated size of $\lambda$ ($\lambda$ = \rinline{virusLambda$param['lambda']}, $p$ = \rinline{virusLambda$param.CI$lambda$bounds.p[1]}).
%This fact implies that other factors must control pathogen richness.
%It also implies that pathogens are not directly inherited down the phylogeny, although this is to be expected by the fast evolution of viruses.

Of the explanatory variables, the number of subspecies had no phylogenetic autocorrelation ($\lambda$ = \rinline{sspLambda$param['lambda']}, $p > 0.99$), study effort and distribution size had weak but significant autocorrelation (Study Effort: $\lambda$ = \rinline{scholarLambda$param['lambda']}, $p$ = \rinline{scholarLambda$param.CI$lambda$bounds.p[1]}, Distribution size: $\lambda$ = \rinline{distrLambda$param['lambda']}, $p < 10^{-5}$) and body mass was strongly phylogenetic ($\lambda$ = \rinline{massLambda$param['lambda']}, $p < 10^{-5}$). 
Across all multiple regression models the mean value of $\lambda$ was \rinline{mean(na.omit(allResults$lambda))} which implied that the residuals from the models were very weakly phylogenetic.
A small number of models (\rinline{mean(na.omit(allResults$lambda < 0))*100}\%)  had negatively phylogenetically distributed residuals.




\subsubsection{Effective gene flow}

There was no phylogenetic signal in the number of virus species ($\lambda$ = \rinline{virusFstLambda$param['lambda']}, $p > 0.99$).
Gene flow also had no phylogenetic autocorrelation ($\lambda$ = \rinline{nmFstLambda$param['lambda']},  $p > 0.99$).
Due to the limited sample size, significance tests are unlikely to have much power.
There is little evidence of phylogenetic autocorrelation in study effort ($\lambda$ = \rinline{scholarFstLambda$param['lambda']}, $p$ = \rinline{scholarFstLambda$param.CI$lambda$bounds.p[1]}).
However, there is some weak evidence of phylogenetic signal in range size as the estimated size of $\lambda$ is large while $p$ is also large, potentially due to a lack of statistical power ($\lambda$ = \rinline{distrFstLambda$param['lambda']}, $p$ = \rinline{distrFstLambda$param.CI$lambda$bounds.p[1]}).
Body mass showed significant phylogenetic autocorrelation ($\lambda$ = \rinline{massFstLambda$param['lambda']}, $p$ = \rinline{massFstLambda$param.CI$lambda$bounds.p[1]}).


Across all multiple regression models the mean value of $\lambda$ is \rinline{mean(na.omit(fstAllResults$lambda))} and a large number of individual models (\rinline{round(mean(na.omit(fstAllResults$lambda < 0))*100)}\%)  had negatively phylogenetically distributed residuals implying the residuals from the model are spread more uniformly on the phylogeny than expected by chance.
Due to the small sample size this was probably due to a small number of data points with large residuals being distant on the tree.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Discussion}  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\tmpsection{Discuss results in more detail}


\tmpsection{Pop structure relates to pathogen richness}

% It does so here
% I hope this study is more robust.
In this study I have used known viral richness in bats as a case study for the more general hypothesis that increased population structure promotes pathogen richness.
In both analyses I found that a positive effect of increasing population structure (a positive effect of the number of subspecies and a negative effect of gene flow) is likely to be in the best model for explaining viral richness.
Only the effective gene flow analysis, when performed using the phylogeny from \textcite{jones2005bats}, does not support this hypothesis.
Therefore my study supports the broader hypothesis that increased population structure promotes pathogen richness.
The positive relationship between increased population structure and pathogen richness implies that direct or indirect competitive mechanisms are acting such that increased population structure allows escape from competition which promotes pathogen richness.
Furthermore my study contradicts the assumption that factors that promote high $R_0$ will automatically promote high pathogen richness by increasing the rate of spread of new pathogens entering into the population \cite{nunn2003comparative, morand2000wormy}.



% It does so in some lit
This analysis is in agreement with two studies that have specifically tested this same hypothesis \cite{turmelle2009correlates, maganga2014bat}.
These two studies used $F_{ST}$ \cite{turmelle2009correlates} and fragmentation of species distributions \cite{maganga2014bat}.
Combined with the analysis here using the number of subspecies, three different measures of population structure have been shown to correlate with pathogen richness in bats.
By analysing data on two measures of population structure, and using larger data sets than previous studies, it is hoped that the results here may be more robust than in previous analyses \cite{gay2014parasite, turmelle2009correlates, maganga2014bat}.



% The pattern is reversed in other lit
In contrast, one study \textcite{gay2014parasite} found the opposite relationship using fragmentation of species distribution.
Furthermore, \textcite{bordes2008bat} found no relationship between increased colony size and pathogen richness while \textcite{gay2014parasite} found relationships in opposite directions for virus and ectoparasite richness.
However, the study by \textcite{gay2014parasite} uses relatively few species while the study by \textcite{bordes2008bat} uses group size which is a measure of local rather than global population structure.
The overall weight of evidence suggests that population structure and pathogen richness are associated.




\tmpsection{There is an interaction between study effort and number of subspecies}

% interpretations
% Biases are known in the lit. gippoliti2007problem % maybe should add to methods?

There was strong support for a positive interaction between the number of subspecies and study effort.
The support for this interaction implies that increased population structure has a stronger relationship with known pathogen richness when study effort is not very low.
One interpretation of this is that increased population structure alone does not predict high known viral richness; reasonable study effort is also needed to turn the expected high viral richness into known and recorded viral richness.
Biases in identification of subspecies have been noted before \cite{gippoliti2007problem}.
The number of subspecies is more commonly used as a variable in comparative analyses of birds than mammals but the fact that it is associated with study effort is often not taken into account \cite{phillimore2007biogeographical, belliure2000dispersal}.

\tmpsection{Other explanatory vars}


% study effort is important. Never forget.
% body mass behaves wierdly.
% Range size is very marginal

Of the other explanatory variables considered, study effort and body mass were selected as being in the best model while there was marginal evidence for range size being associated with viral richness.
Study effort positively predicted pathogen richness, confirming the expectation that additional study of a bat species yields more known viruses infecting that host species.
Therefore, this bias cannot be ignored in studies using known pathogen richness as a proxy for total pathogen richness \cite{nunn2003comparative, gregory1990parasites}.
While body mass is selected as being in the best model in both the number of subspecies analysis and the effective gene flow analysis the estimated coefficients have opposite signs in the two analyses.
In the number of subspecies analysis, body mass has a positive relationship with pathogen richness which is in agreement with previous studies \cite{kamiya2014determines, bordes2008bat, turmelle2009correlates, gay2014parasite, maganga2014bat}.
However, in the effective gene flow analysis, body mass has a negative estimated coefficient.
This is in contrast to the number of subspecies analysis, previous studies in the literature and the single-predictor model.
This result is probably due to correlations with other variables in the analysis and exacerbated by the small sample size in this analysis.


\tmpsection{phylogeny}
% Phylogeny is not very important
% phylogeny is weird in Fst study?



%Another interpretation is that having few subspecies does not predict low viral richness unless the species has been adequately studied as otherwise the low number of subspecies is probably due to a lack of study rather than an accurate measurement.

%Another potential mechanism by which structure might be promoting increased richness is by slowing the spread of highly virulent viruses such as rabies and preventing them from having short, intense epidemics followed by extinction.
%This mechanism has interesting parallel to metapopulation theory in ecology in which a metapopulation structure can allow persistence of species that would otherwise go extinct.

\subsection{Broader implications}

The relationship between increased population structure and pathogen richness suggests that population structure has at least some potential as being predictive of high pathogen richness and therefore of a species' likelihood of being a reservoir of a potentially zoonotic pathogen. 
However, given that it is difficult to measure population structure and given that the relationship appears to be weak at best, this trait on its own is unlikely to be useful in predicting zoonotic risk.
However, as a number of other factors are also associated with pathogen richness such as body mass and to a lesser extent range size as shown here as well as other traits studied elsewhere \cite{turmelle2009correlates, luis2013comparison}.
Therefore, using a combination of traits in a predictive (i.e.\ machine learning) framework has potential for use in prioritising zoonotic disease surveillance.
The main hurdle in this approach is finding a way to validate models; due to the study effort bias in current data, predictive models will also be biased.
As unbiased pathogen surveys such as \textcite{anthony2013strategy} become more common good validation may become possible.
Alternatively, predictive models could be trained on all available --- and therefore biased --- data and validated by predicting smaller, unbiased data sets such as the data collected in \textcite{maganga2014bat}.

The relationship between increased population structure and pathogen richness also has implications for habitat fragmentation and range shifts due to global change.
In short, habitat fragmentation and range shifts that reduce movement between populations would be predicted to increase pathogen richness.
However, depending on the mechanisms by which increased population structure increases pathogen richness this may not be a cause for concern.
If the main mechanism is one that reduces pathogen extinction rates, a newly fragmented population is unlikely to increase its pathogen richness over any short to medium-term timescales.
If, however, increased population structure actively promotes the evolution of new pathogen strains or allows the persistence of more virulent strains \cite{blackwood2013resolving, pons2014insights, plowright2011urban} this could have important public health implications.
Therefore further studies on the exact mechanisms by which increased population structure affects pathogen richness are needed. 


\subsection{Study limitations}

Although I have used measures of study effort to try to control for biases in the viral richness data, this bias could still make the results here unreliable --- this is especially true as study effort is by far the strongest predictor of viral richness in both data sets.
It is hoped that as untargeted sequencing of viral genetic material  becomes cheaper and more common this bias can be reduced \cite{anthony2013strategy}.
The strength of the relationship between study effort and known viral richness also highlights the number of bat-virus host-pathogen relationships yet to be discovered and the number of virus species that are yet to be described.

I have included a number of explanatory variables to avoid spurious correlations.
However, there is little data on bat density or population size.
Given that studies in other mammalian groups have found relationships between host density and pathogen richness this would be a useful variable to include in further analyses \cite{kamiya2014determines, nunn2003comparative, arneberg2002host}.
Acoustic monitoring is becoming cheaper and less labour intensive and may provide suitable data for estimating population densities or population sizes for more bat species.
However, it is not clear whether host population density or host population size is the more appropriate measure with respect to disease dynamics \cite{begon2002clarification}.
Given the importance of geographic range size found here and elsewhere \cite{lindenfors2007parasite, nunn2003comparative, turmelle2009correlates, huang2015parasite, kamiya2014determines} comparative studies may struggle to select between these three related factors: host population size, population density and geographic range size.

I have used two measures of population structure and the number of subspecies data set is larger than those used in previous studies.
However it is clear that the gene flow data set is small ($n$ = \rinline{nrow(fstFinal)}).
This may explain some unexpected results.
While the model averaging approach has given a negative model averaged coefficient for gene flow, the single-predictor model of gene flow against viral richness gave a positive coefficient.
Furthermore body mass has a negative average coefficient.
This is in contrast to the number of subspecies analysis, many studies in the literature \cite{kamiya2014determines, turmelle2009correlates, gay2014parasite, maganga2014bat} and the single-predictor model.
It is not easy to interpret these contradictions but it is clear that the results from the gene flow analysis alone should not be considered strong evidence for a relationship between increased population structure and pathogen richness.
These contradictions also reiterate the need to use large data sets where possible and the need to use multiple measures of population structure to promote robust conclusions.

Finally, while comparative studies are a useful tool for examining broad trends of pathogen richness across large taxonomic groups, they cannot examine the specific mechanisms that may be underpinning the correlations found.
Therefore, further work is needed to test which mechanisms are actually causing the relationship between increased population structure and pathogen richness that I have identified here.
A number of mechanisms might be involved.
A reduced rate of pathogen extinction might be caused by a reduction in competition due to the slow dispersal of competing pathogens.
Alternatively, increased population structure may promote the invasion of new pathogens, by creating localised areas of low competition or host immunity.
One method for testing these mechanisms would be through mechanistic epidemiological models.

\subsection{Conclusions}


I have used phylogenetic linear models to identify positive relationships between two measures of population structure (the number of subspecies and effective levels of gene flow) and viral richness in bats.
This study adds to the evidence that increased population structure may promote pathogen richness.
It does not support the view that factors that increase $R_0$ will increase pathogen richness.
Using larger data sets and multiple measurements makes the weight of the evidence here stronger than in previous studies.
However, caution must still be taken in interpreting these results as the data is biased and particularly sparse in one of the analyses.





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% Repeat analysis with bat clocks and rocks                  %%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%\section{Appendix}



%%begin.rcode treeRead2

# Read in trees
t2 <- read.nexus('data/Chapter3/BatST2BL.nex')

# Make names match previous names
t2$tip.label <- gsub('_', ' ', t2$tip.label)

#missing <- nSpecies$binomial[!nSpecies$binomial %in% pruneTree2$tip.label ]

## Copy binomial column. binomial will be changed to fit t2.
#nSpecies$oldBinomial <- nSpecies$binomial

## Replace with agrep where possible
#closeMatch <- sapply(missing, function(i) t2$tip.label[agrep(i, t2$tip.label, max.distance = 0.11)])

#closeMatch <- closeMatch[sapply(closeMatch, function(i) length(i) > 0)]




unneededTips2 <- t2$tip.label[!(t2$tip.label %in% nSpecies$binomial)]

# Prune tree down to only needed tips.
pruneTree2 <- drop.tip(t2, unneededTips2)


nSpecies2 <- sapply(pruneTree2$tip.label, function(x) which(nSpecies$binomial == x)) %>%
               nSpecies[., ]


################
## Fst tree   ##
################


# Which tips are not needed
fstUnneededTips2 <- t2$tip.label[!(t2$tip.label %in% fstFinal$binomial)]

# Prune tree down to only needed tips.
fstTree2 <- drop.tip(t2, fstUnneededTips2)

# Which tips in Fst analysis are not in bats clocks tree.
fstFinal$binomial[!(fstFinal$binomial %in% fstTree2$tip.label)]


# Hacky cruddy way of placing the missing tips into the tree. Should end up with genus level polytomies in trimmed tree.
# Just replacing some of the uneeded tips with the ones I need.

t2$tip.label[t2$tip.label == 'Miniopterus pusillus'] <- 'Miniopterus natalensis'
t2$tip.label[t2$tip.label == 'Miniopterus schreibersi'] <- 'Miniopterus schreibersii'
t2$tip.label[t2$tip.label == 'Rousettus celebensis'] <- 'Rousettus leschenaultii'
t2$tip.label[t2$tip.label == 'Myotis oxyotus'] <- 'Myotis macropus'
t2$tip.label[t2$tip.label == 'Myotis leibii'] <- 'Myotis ciliolabrum'

#Re prune tree
# Which tips are not needed
fstUnneededTips2 <- t2$tip.label[!(t2$tip.label %in% fstFinal$binomial)]

# Prune tree down to only needed tips.
fstTree2 <- drop.tip(t2, fstUnneededTips2)

# Check we now have all the tips.
fstFinal$binomial[!(fstFinal$binomial %in% fstTree2$tip.label)]

rm(t2)




%%end.rcode


%%begin.rcode treePlot2, show.figs = 'hide', out.width = '\\textwidth', fig.cap = 'Pruned phylogeny \\cite{jones2005bats} with dot size showing number of pathogens and colour showing family.'



## Plot tree 
#p2 <- ggtree(pruneTree2, layout = 'fan') 

#p2 %<+% nSpecies2[, 1:6] +
#  geom_point(aes(size = virusSpecies, colour = Family), subset=.(isTip)) +
#  scale_size(range = c(0.8, 3)) +
#  scale_colour_manual(values = c(pokepal('oddish')[c(1,3,5,6,9,10)],    pokepal('Carvanha')[c(1,2,4, 13, 12)])) +
#  theme_tcdl +
#  theme(plot.margin = unit(c(-1, 3, -2.5, -2), "lines")) +
#  theme(legend.position = 'right') +
#  labs(size = 'Virus Richness') +
#  theme(legend.key.size = unit(0.6, "lines"),
#              legend.text = element_text(size = 6),
#              legend.title = element_text(size = 8))



%%end.rcode



%%begin.rcode runBatClocks, eval = TRUE


fitModelsBootStrap2 <- mclapply(1:nBoots, function(b) modelSelect(allFormulae, nSpecies2, pruneTree2, b, allModelMat, varList), mc.cores = nCores)

allResults2 <- do.call(rbind, fitModelsBootStrap2)

write.csv(allResults2, file = 'data/Chapter3/modelSelectSubspeciesBatClocks.csv')


## FST analysis

fstModelsBootStrap2 <- mclapply(1:nBoots, function(b) fstModelSelect(fstAllFormulae, fstFinal, fstTree2, b, fstModelMat, fstVarList), mc.cores = nCores)

fstAllResults2 <- do.call(rbind, fstModelsBootStrap2)

write.csv(fstAllResults2, file = 'data/Chapter3/fstModelSelectSubspeciesBatClocks.csv')


%%end.rcode


%%begin.rcode batClocksAnalyse

allResults2 <- read.csv('data/Chapter3/modelSelectSubspeciesBatClocks.csv', row.names = 1)

varWeights2 <- sapply(names(allResults2)[1:6], function(x) sum(allResults2$weight[allResults2[, x]])/nBoots)


sepVarWeights2 <- lapply(1:nBoots, function(b) 
                      sapply(names(allResults2)[1:6], 
                        function(x) 
                          sum(allResults2[allResults2$boot == b, 'weight'][allResults2[allResults2$boot == b, x]])
                      )
                     )      

sepVarWeights2 <- do.call(rbind, sepVarWeights2) %>%
                      data.frame(., boot = 1:nBoots) %>%
                      reshape2::melt(., value.name = 'estimate', id.vars = 'boot')

sepVarWeights2$col <- 'Other Variables'
sepVarWeights2$col[grep('NumberOf', sepVarWeights2$variable)] <- 'Population Structure'
sepVarWeights2$col[sepVarWeights2$variable == 'rand'] <- 'Null'



modelWeights2 <- allResults2 %>%
                  group_by(predictors) %>%
                  summarise(AICc = mean(AIC)) %>%
                  mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>%
                  arrange(desc(modelWeight)) %>%
                  mutate(cumulativeWeight = cumsum(modelWeight)) %>%
                  mutate(string = levels(predictors)[predictors])


#### FST


fstAllResults2 <- read.csv('data/Chapter3/fstModelSelectSubspeciesBatClocks.csv', row.names = 1)

fstSepVarWeights2 <- lapply(1:nBoots, function(b) 
                      sapply(names(fstAllResults2)[1:5], 
                        function(x) 
                          sum(fstAllResults2[fstAllResults2$boot == b, 'weight'][fstAllResults2[fstAllResults2$boot == b, x]])
                      )
                     )      

fstSepVarWeights2 <- do.call(rbind, fstSepVarWeights2) %>%
                      data.frame(., boot = 1:nBoots) %>%
                      reshape2::melt(., value.name = 'estimate', id.vars = 'boot')

fstSepVarWeights2$col <- 'Other Variables'
fstSepVarWeights2$col[fstSepVarWeights2$variable == 'Nm'] <- 'Population Structure'
fstSepVarWeights2$col[fstSepVarWeights2$variable == 'rand'] <- 'Null'


fstVarWeights2 <- sapply(names(fstAllResults2)[1:5], function(x) sum(fstAllResults2$weight[fstAllResults2[, x]])/nBoots)


fstModelWeights2 <- fstAllResults2 %>%
                  group_by(predictors) %>%
                  summarise(AICc = mean(AIC)) %>%
                  mutate(dAIC = AICc - min(AICc), modelWeight = exp(- 0.5 * dAIC) / sum(exp(- 0.5 * dAIC))) %>%
                  arrange(desc(modelWeight)) %>%
                  mutate(cumulativeWeight = cumsum(modelWeight)) 


%%end.rcode


%% ------------------------------------------- %%
%% plot bat clocks rocks
%% ------------------------------------------- %%


%%begin.rcode ITPlots2

# reorder factors to get structure vars at beginning.
sepVarWeights2$variable <- factor(sepVarWeights2$variable, levels(sepVarWeights2$variable)[c(2, 6, 1, 3, 4, 5)])

ITPlot2 <- ggplot(sepVarWeights2, aes(x = variable, y = estimate, colour = col, fill = col)) +
  geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.99, outlier.size = 1, lwd = 0.4) +
  scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) +
  scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) +
  theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'),
    panel.grid.major.x = element_blank(),
    axis.text.y = element_text(size = 8)) +
  scale_x_discrete(labels = c('NSubspecies', 'NSubspecies*Scholar', 'Scholar', 'Mass', 'Range size', 'Random')) +
  scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) +
  ylim(0, 1) +
  ylab('P(in best model)') +
  xlab('')


%%end.rcode


%%begin.rcode fstITPlots2, fig.show = extraFigs, fig.cap = "Akaike variable weights for both analyses using the phylogeny from \\textcite{jones2005bats}. The probability that each variable is in the best model (amongst the models test) is shown, with the boxplots showing the variation amongst the models over 50 resamplings of the uniformly random ``null'' variable. The three bars of the boxplot show the median values and upper and lower quartiles of the data, vertical lines show the range and points display outliers. The red ``Random'' box is the uniformly random variable.", fig.height = 2.5, fig.scap = 'Akaike variable weights', out.width = '\\textwidth', out.extra = 'trim = 0 1cm 0 0'


# Reorder var levels to get structure at beginning.
fstSepVarWeights2$variable <- factor(fstSepVarWeights2$variable, levels(fstSepVarWeights2$variable)[c(2, 1, 3, 4, 5)])

# Draw the fst model selection plot
fstIT2 <- ggplot(fstSepVarWeights2, aes(x = variable, y = estimate, colour = col, fill = col)) +
  geom_boxplot(outlier.colour = grey(0.3), notch = FALSE, width = 0.7, outlier.size = 1, lwd = 0.4) +
  scale_colour_manual(values = pokepal('kingdra')[c(11, 1, 9)]) +
  scale_fill_manual(values = pokepal('kingdra')[c(12, 4, 8)]) +
  ylim(0, 1) +
  theme(legend.position = 'none', axis.text.x = element_text(size = 10, angle = 40, hjust = 1, colour = 'black', family = 'lato light'),
    panel.grid.major.x = element_blank(),
    axis.text.y = element_text(size = 8)) +
  scale_x_discrete(labels = c('Gene flow', 'Scholar', 'Mass', 'Range size', 'Random')) +
  scale_y_continuous(labels = c('0.00','0.25','0.50','0.75','1.00'), breaks = c(0, 0.25, 0.5, 0.75, 1)) +
  ylim(0, 1) +
  ylab('P(in best model)') +
  xlab('')


# Combine and print.
ggdraw() +
  draw_label("A)", 0.02, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + 
  draw_plot(ITPlot2, 0, 0, 0.5, 1) +
  draw_label("B)", 0.52, 0.96, size = 10, fontface = 'plain', fontfamily = 'lato light') + 
  draw_plot(fstIT2, 0.5, 0.164, 0.5, 0.855) +
  draw_label('Explanatory variable', 0.5, 0.1, fontfamily = 'lato light', size = 12)


%%end.rcode