Quantitative Biology - Genomics Publications (50)


Quantitative Biology - Genomics Publications

High throughput immune repertoire sequencing is promising to lead to new statistical diagnostic tools for medicine and biology. Successful implementations of these methods require a correct characterization, analysis and interpretation of these datasets. We present IGoR -- a new comprehensive tool that takes B or T-cell receptors sequence reads and quantitatively characterizes the statistics of receptor generation from both cDNA and gDNA. Read More

Essential genes constitute the core of genes which cannot be mutated too much nor lost along the adaptive evolutionary history of a species. Natural selection is expected to be stricter on essential genes and on conserved (highly shared) genes, than on genes that are either nonessential or peculiar to a single or a few species. In order to further assess this expectation, we study here how essentiality of a gene is connected with its degree of conservation among several unrelated bacterial species, each one characterised by its own codon usage bias. Read More

The edit distance under the DCJ model can be computed in linear time for genomes with equal content or with Indels. But it becomes NP-Hard in the presence of duplications, a problem largely unsolved especially when Indels are considered. In this paper, we compare two mainstream methods to deal with duplications and associate them with Indels: one by deletion, namely DCJ-Indel-Exemplar distance; versus the other by gene matching, namely DCJ-Indel-Matching distance. Read More

Understanding the genetic basis of phenotypic plasticity is crucial for predicting and managing climate change effects on wild plants and crops. Here, we combined crop modeling and quantitative genetics to study the genetic control of oil yield plasticity for multiple abiotic stresses in sunflower. First we developed stress indicators to characterize 14 environments for three abiotic stresses (cold, drought and nitrogen) using the SUNFLO crop model and phenotypic variations of three commercial varieties. Read More

Genetic alterations initiate tumors and enable the evolution of drug resistance. The pro-cancer view of mutations is however incomplete, and several studies show that mutational load can reduce tumor fitness. Given its negative effect, genetic load should make tumors more sensitive to anticancer drugs. Read More

To cluster sequences given only their read-set representations, one may try to reconstruct each one from the corresponding read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. This approach is however problematic and we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases. Read More

A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. Read More

Life science is entering a new era of petabyte-level sequencing data. Converting such big data to biological insights represents a huge challenge for computational analysis. To this end, we developed DeepMetabolism, a biology-guided deep learning system to predict cell phenotypes from transcriptomics data. Read More

Motivation: We here present OncoScore, an open-source tool and R package that implements a novel text-mining method capable of ranking genes according to their association to cancer, based on available biomedical literature on PubMed. OncoScore can scan the biomedical literature with dynamically updatable web queries and measure the association to cancer of each gene by considering their citations. The output of the tool is a score that is measuring the strength of the association of the genes to cancer at the time of the analysis. Read More

RNA sequencing allows one to study allelic imbalance of gene expression, which may be due to genetic factors or genomic imprinting. It is desirable to model both genetic and parent-of-origin effects simultaneously to avoid confounding and to improve the power to detect either effect. In a study of experimental cross, separation of genetic and parent-of-origin effects can be achieved by studying reciprocal cross of two inbred strains. Read More

The interaction between proteins and DNA is a key driving force in a significant number of biological processes such as transcriptional regulation, repair, recombination, splicing, and DNA modification. The identification of DNA-binding sites and the specificity of target proteins in binding to these regions are two important steps in understanding the mechanisms of these biological activities. A number of high-throughput technologies have recently emerged that try to quantify the affinity between proteins and DNA motifs. Read More

Technical signs of progress during the last decades has led to a situation in which the accumulation of genome sequence data is increasingly fast and cheap. The huge amount of molecular data available nowadays can help addressing new and essential questions in Evolution. However, reconstructing evolution of DNA sequences requires models, algorithms, statistical and computational methods of ever increasing complexity. Read More

A central goal in cancer genomics is to identify the somatic alterations that underpin tumor initiation and progression. This task is challenging as the mutational profiles of cancer genomes exhibit vast heterogeneity, with many alterations observed within each individual, few shared somatically mutated genes across individuals, and important roles in cancer for both frequently and infrequently mutated genes. While commonly mutated cancer genes are readily identifiable, those that are rarely mutated across samples are difficult to distinguish from the large numbers of other infrequently mutated genes. Read More

Transforming error-prone immunosequencing datasets into antibody repertoires is a fundamental problem in immunogenomics, and a prerequisite for studies of immune responses. Although various repertoire reconstruction algorithms were released in the last three years, it remains unclear how to benchmark them and how to assess the accuracy of the reconstructed repertoires. We describe a novel IgReC algorithm for constructing antibody repertoires from high-throughput immunosequencing datasets and a new framework for assessing the quality of reconstructed repertoires. Read More

Multiplex and multi-directional control of metabolic pathways is crucial for metabolic engineering to improve product yield of fuels, chemicals, and pharmaceuticals. To achieve this goal, artificial transcriptional regulators such as CRISPR-based transcription regulators have been developed to specifically activate or repress genes of interest. Here, we found that by deploying guide RNAs to target on DNA sites at different locations of genetic cassettes, we could use just one synthetic CRISPR-based transcriptional regulator to simultaneously activate and repress gene expressions. Read More

The medical research facilitates to acquire a diverse type of data from the same individual for particular cancer. Recent studies show that utilizing such diverse data results in more accurate predictions. The major challenge faced is how to utilize such diverse data sets in an effective way. Read More

High throughput sequencing is a technology that allows for the generation of millions of reads of genomic data regarding a study of interest, and data from high throughput sequencing platforms are usually count compositions. Subsequent analysis of such data can yield information on tran- scription profiles, microbial diversity, or even relative cellular abundance in culture. Because of the high cost of acquisition, the data are usually sparse, and always contain far fewer observations than variables. Read More

The ability to measure the transcriptome of single-cells has only been feasible for a few years, but is becoming an extremely popular assay. While many types of analysis and questions can be answered using single cell RNA-sequencing, of prime interest is the ability to investigate what cell types occur in nature. Unbiased and reproducible cataloging of distinct cell types require large numbers of cells to be sampled. Read More

Aims: Ischaemic cardiomyopathy (ICM) leads to impaired contraction and ventricular dysfunction causing high rates of morbidity and mortality. Epigenomics allows the identification of epigenetic signatures in human diseases. We analyse the differential epigenetic patterns of ASB gene family in ICM patients and relate these alterations to their haemodynamic and functional status. Read More

The colonic mucus layer is a dynamic and complex structure formed by secreted and transmembrane mucins, which are high-molecular-weight and heavily glycosylated proteins. Colonic mucus consists of a loose outer layer and a dense epithelium-attached layer. The outer layer is inhabited by various representatives of the human gut microbiota (HGM). Read More

While many short read assemblers attempt to simplify the de Brujin graph by identifying and resolving variant-induced bubbles to produce a haploid mosaic result, this approach is only viable when variants are relatively rare and the bubbles are well defined in a graph context. We observed that diploid genomes with very high levels of heterozygosity fail to display well-resolved bubble structures in a typical assembly graph and thus result in highly fragmented and incomplete assemblies. Here we present an enhancement of Meraculous2 algorithm, called Meraculous-2D, which preserves haplotypes across variant sites and generates accurate assembly of highly heterozygous diploid genomes. Read More

Among several quantitative invariants found in evolutionary genomics, one of the most striking is the scaling of the overall abundance of proteins, or protein domains, sharing a specific functional annotation across genomes of given size. The size of these functional categories change, on average, as power-laws in the total number of protein-coding genes. Here, we show that such regularities are not restricted to the overall behavior of high-level functional categories, but also exist systematically at the level of single evolutionary families of protein domains. Read More

Alignment of large genomic sequences is a fundamental task in computational genome analysis. Most methods for genomic alignment use high-scoring local alignments as {\em anchor points} to reduce the search space of the alignment procedure. Speed and quality of these methods therefore depend on the underlying anchor points. Read More

Background: Mustard aphid is a major pest of Brassica oilseeds. No source for aphid resistance is presently available in Brassica juncea . A wild crucifer, Brassica fruticulosa is known to be resistant to mustard aphid. Read More

The diversity revealed by large scale genomics in microbiology is calling into question long held beliefs about genome stability, evolutionary rate, even the definition of a species. MacArthur and Wilson's theory of insular biogeography provides an explanation for the diversity of macroscopic animal and plant species as a consequence of the associated hierarchical web of species interdependence. We report a large scale study of microbial diversity that reveals that the cumulative number of genes discovered increases with the number of genomes studied as a simple power law. Read More

Motivation: We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. Read More

Long-read sequencing has enabled the de novo assembly of several mammalian genomes, but with high cost in computing. Here, we demonstrated de novo assembly of mammalian genome using long reads in an efficient and inexpensive workstation. Read More

Gene regulation is a series of processes that control gene expression and its extent. The connections among genes and their regulatory molecules, usually transcription factors, and a descriptive model of such connections, are known as gene regulatory networks (GRNs). Elucidating GRNs is crucial to understand the inner workings of the cell and the complexity of gene interactions. Read More

The day we understand the time evolution of subcellular elements at a level of detail comparable to physical systems governed by Newton's laws of motion seems far away. Even so, quantitative approaches to cellular dynamics add to our understanding of cell biology, providing data-guided frameworks that allow us to develop better predictions about and methods for control over specific biological processes and system-wide cell behavior. In this paper we describe an approach to optimizing the use of transcription factors in the context of cellular reprogramming. Read More

Changepoint detection is a central problem in time series and genomic data. For some applications, it is natural to impose constraints on the directions of changes. One example is ChIP-seq data, for which adding an up-down constraint improves peak detection accuracy, but makes the optimization problem more complicated. Read More

The complicated, evolving landscape of cancer mutations poses a formidable challenge to identify cancer genes among the large lists of mutations typically generated in NGS experiments. The ability to prioritize these variants is therefore of paramount importance. To address this issue we developed OncoScore, a text-mining tool that ranks genes according to their association with cancer, based on available biomedical literature. Read More

The epigenome, i.e. the whole of chromatin modifications, is transferred from mother to daughter cells during cell differentiation. Read More

We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. Read More

We introduce an improved version of RECKONER, an error corrector for Illumina whole genome sequencing data. By modifying its workflow we reduce the computation time even 10 times. We also propose a new method of determination of $k$-mer length, the key parameter of $k$-spectrum-based family of correctors. Read More

While we once thought of cancer as single monolithic diseases affecting a specific organ site, we now understand that there are many subtypes of cancer defined by unique patterns of gene mutations. These gene mutational data, which can be more reliably obtained than gene expression data, help to determine how the subtypes develop, evolve, and respond to therapies. Different from dense continuous-value gene expression data, which most existing cancer subtype discovery algorithms use, somatic mutational data are extremely sparse and heterogeneous, because there are less than 0. Read More

Optimal subset selection is an important task that has numerous algorithms designed for it and has many application areas. STPGA contains a special genetic algorithm supplemented with a tabu memory property (that keeps track of previously tried solutions and their fitness for a number of iterations), and with a regression of the fitness of the solutions on their coding that is used to form the ideal estimated solution (look ahead property) to search for solutions of generic optimal subset selection problems. I have initially developed the programs for the specific problem of selecting training populations for genomic prediction or association problems, therefore I give discussion of the theory behind optimal design of experiments to explain the default optimization criteria in STPGA, and illustrate the use of the programs in this endeavor. Read More

The advent of rapid and inexpensive DNA sequencing has led to an explosion of data waiting to be transformed into knowledge about genome organization and function. Gene prediction is customarily the starting point for genome analysis. This paper presents a bioinformatics study of the oil palm genome, including comparative genomics analysis, database and tools development, and mining of biological data for genes of interest. Read More

When analyzing the genome, researchers have discovered that proteins bind to DNA based on certain patterns of the DNA sequence known as "motifs". However, it is difficult to manually construct motifs due to their complexity. Recently, externally learned memory models have proven to be effective methods for reasoning over inputs and supporting sets. Read More

Boolean matrix factorisation aims to decompose a binary data matrix into an approximate Boolean product of two low rank, binary matrices: one containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for Boolean matrix factorisation and derive a Metropolised Gibbs sampler that facilitates efficient parallel posterior inference. On real world and simulated data, our method outperforms all currently existing approaches for Boolean matrix factorisation and completion. Read More

In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. Read More

DNA read mapping is a ubiquitous task in bioinformatics, and many tools have been developed to solve the read mapping problem. However, there are two trends that are changing the landscape of readmapping: First, new sequencing technologies provide very long reads with high error rates (up to 15%). Second, many genetic variants in the population are known, so the reference genome is not considered as a single string over ACGT, but as a complex object containing these variants. Read More

Inverse problems in statistical physics are motivated by the challenges of `big data' in different fields, in particular high-throughput experiments in biology. In inverse problems, the usual procedure of statistical physics needs to be reversed: Instead of calculating observables on the basis of model parameters, we seek to infer parameters of a model based on observations. In this review, we focus on the inverse Ising problem and closely related problems, namely how to infer the coupling strengths between spins given observed spin correlations, magnetisations, or other data. Read More

Motivation: Epigenetic heterogeneity within a tumour can play an important role in tumour evolution and the emergence of resistance to treatment. It is increasingly recognised that the study of DNA methylation (DNAm) patterns along the genome -- so-called `epialleles' -- offers greater insight into epigenetic dynamics than conventional analyses which examine DNAm marks individually. Results: We have developed a Bayesian model to infer which epialleles are present in multiple regions of the same tumour. Read More

The past decade has seen a rapid growth in omics technologies. Genome-wide association studies (GWAS) have uncovered susceptibility variants for a variety of complex traits. However, the functional significance of most discovered variants are still not fully understood. Read More

Genome replication, a key process for a cell, relies on stochastic initiation by replication origins, causing a variability of replication timing from cell to cell. While stochastic models of eukaryotic replication are widely available, the link between the key parameters and overall replication timing has not been addressed systematically.We use a combined analytical and computational approach to calculate how positions and strength of many origins lead to a given cell-to-cell variability of total duration of the replication of a large region, a chromosome or the entire genome. Read More

Summary: Counting all k-mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases. Usefulness of the tools is shown on a few real problems. Read More

Knowledge about the clonal evolution of each tumor can inform driver-alteration discovery by pointing out initiating genetic events as well as events that contribute to the selective advantage of proliferative, and potentially drug-resistant tumor subclones. A necessary building block to the reconstruction of clonal evolution from tumor profiles is the estimation of the cellular composition of each tumor subclone cellularity, and these, in turn, are based on estimates of the relative abundance frequency of subclone-specific genetic alterations in tumor biopsies. Estimating the frequency of genetic alterations is complicated by the high genomic instability that characterizes many tumor types. Read More

Amino-acid substitutions are implicated in a wide range of human diseases, many of which are lethal. Distinguishing such mutations from polymorphisms without significant effect on human health is a necessary step in understanding the etiology of such diseases. Computational methods can be used to select interesting mutations within a larger set, to corroborate experimental findings and to elucidate the cause of the deleterious effect. Read More

Joint quantification of genetic and epigenetic effects on gene expression is important for understanding the establishment of complex gene regulation systems in living organisms. In particular, genomic imprinting and maternal effects play important roles in the developmental process of mammals and flowering plants. However, the influence of these effects on gene expression are difficult to quantify because they act simultaneously with cis-regulatory mutations. Read More

We call change-point problem (CPP) the identification of changes in the probabilistic behavior of a sequence of observations. Solving the CPP involves detecting the number and position of such changes. In genetics the study of how and what characteristics of a individual's genetic content might contribute to the occurrence and evolution of cancer has fundamental importance in the diagnosis and treatment of such diseases and can be formulated in the framework of chage-point analysis. Read More