Phylogenetic methods

Modelling contraints on physicochemical amino acid properties

Most codon substitution models treat all nonsynonymous changes as equivalent, regardless of the physicochemical properties of the amino acids involved. “CoRa” is a parametric codon model that allows constraints on radical or conservative amino acid substitutions to be considered separately. The model describes the evolution of protein coding sequences in organisms with large populations and effective selection significantly better than the standard model.

Repository	Description	Citation
https://github.com/claudia-c-weber/CoRa	Instructions for running the model	Weber CC and Whelan S (2019). Physicochemical Amino Acid Properties Better Describe Substitution Rates in Large Populations. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msz003

Ancestral protein sequence reconstruction

I contributed to benchmarking ProtASR 1.0 (Arenas et al, 2017), which addresses a common issue with methods based on empirical substitution models that tend to reconstruct proteins that are more stable than those found in nature.

Unsupervised learning for genome assembly

Identifying sample contamination in long-read sequencing data with VAEs

Samples collected for sequencing often contain genetic material from non-target organisms, and gaps in reference databases often make identifying the source of a sequence challenging. However, two-dimensional representations of composition learned by a Variational Autoencoder can help separate sequences from different sources, even when taxonomic labels are unavailable.

The example below shows HiFi reads from a buff-tip moth sample, which was infected Wolbachia strains:

Repository	Description	Citation
https://github.com/CobiontID/read_VAE	Code for training the VAE and visualising the embeddings	Weber CC (2024). Disentangling Cobionts and Contamination in Long-Read Genomic Data using Sequence Composition. G3 Genes, Genomes, Genetics, https://doi.org/10.1093/g3journal/jkae187

K-mer statistics

Along with the VAE workflow, I provide a set of standalone k-mer counting tools suitable for high-quality long read data:

Tool	Description	Application
kmer-counter	Fast k-mer counter for large read sets	Get tetranucleotide counts
unique-kmers	Count distinct k-mers in sequences	Calculate k-mer diversity
fastk-medians	Calculate median number of k-mer occurrences across the whole set for each sequence	Approximate k-mer coverage

Further details are provided under https://cobiontid.github.io/

Extracting genomes from mixed samples with chromatin network embeddings

Examining chromatin interactions can be helpful for separating distinct genomes in contaminated assemblies. Though this is often done manually, automated clustering of highly interconnected scaffolds based on graph embeddings provides a convenient approach.

Code for learning network embeddings and visualizing Hi-C maps: https://github.com/CobiontID/HiC_network
Preprint: Kudoa genomes from contaminated hosts reveal extensive gene order conservation and rapid sequence evolution https://www.biorxiv.org/content/10.1101/2024.11.01.621499v1

Software