Phylogenetic methods

Modelling contraints on physicochemical amino acid properties

Most codon substitution models treat all nonsynonymous changes as equivalent, regardless of the physicochemical properties of the amino acids involved. “CoRa” is a parametric codon model that allows constraints on radical or conservative amino acid substitutions to be considered separately. The model describes the evolution of protein coding sequences in organisms with large populations and effective selection significantly better than the standard model.

Repository Description Citation
https://github.com/claudia-c-weber/CoRa Instructions for running the model Weber CC and Whelan S (2019). Physicochemical Amino Acid Properties Better Describe Substitution Rates in Large Populations. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msz003

Ancestral protein sequence reconstruction

I contributed to benchmarking ProtASR 1.0 (Arenas et al, 2017), which addresses a common issue with methods based on empirical substitution models that tend to reconstruct proteins that are more stable than those found in nature.

Unsupervised learning for genome assembly

Identifying sample contamination in long-read sequencing data with VAEs

Samples collected for sequencing often contain genetic material from non-target organisms, and gaps in reference databases often make identifying the source of a sequence challenging. However, two-dimensional representations of composition learned by a Variational Autoencoder can help separate sequences from different sources, even when taxonomic labels are unavailable.

The example below shows HiFi reads from a buff-tip moth sample, which was infected Wolbachia strains:

Repository Description Citation
https://github.com/CobiontID/read_VAE Code for training the VAE and visualising the embeddings Weber CC (2024). Disentangling Cobionts and Contamination in Long-Read Genomic Data using Sequence Composition. G3 Genes, Genomes, Genetics, https://doi.org/10.1093/g3journal/jkae187

K-mer statistics

Along with the VAE workflow, I provide a set of standalone k-mer counting tools suitable for high-quality long read data:

Tool Description Application
kmer-counter Fast k-mer counter for large read sets Get tetranucleotide counts
unique-kmers Count distinct k-mers in sequences Calculate k-mer diversity
fastk-medians Calculate median number of k-mer occurrences across the whole set for each sequence Approximate k-mer coverage

Further details are provided under https://cobiontid.github.io/

Extracting genomes from mixed samples with chromatin network embeddings

Examining chromatin interactions can be helpful for separating distinct genomes in contaminated assemblies. Though this is often done manually, automated clustering of highly interconnected scaffolds based on graph embeddings provides a convenient approach.