Quantitative Biology - Genomics Publications (50)

Search

Quantitative Biology - Genomics Publications

Background: Mustard aphid is a major pest of Brassica oilseeds. No source for aphid resistance is presently available in Brassica juncea . A wild crucifer, Brassica fruticulosa is known to be resistant to mustard aphid. Read More


The diversity revealed by large scale genomics in microbiology is calling into question long held beliefs about genome stability, evolutionary rate, even the definition of a species. MacArthur and Wilson's theory of insular biogeography provides an explanation for the diversity of macroscopic animal and plant species as a consequence of the associated hierarchical web of species interdependence. We report a large scale study of microbial diversity that reveals that the cumulative number of genes discovered increases with the number of genomes studied as a simple power law. Read More


Motivation: We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. Read More


The day we understand the time evolution of subcellular elements at a level of detail comparable to physical systems governed by Newton's laws of motion seems far away. Even so, quantitative approaches to cellular dynamics add to our understanding of cell biology, providing data-guided frameworks that allow us to develop better predictions about and methods for control over specific biological processes and system-wide cell behavior. In this paper we describe an approach to optimizing the use of transcription factors in the context of cellular reprogramming. Read More


Changepoint detection is a central problem in time series and genomic data. For some applications, it is natural to impose constraints on the directions of changes. One example is ChIP-seq data, for which adding an up-down constraint improves peak detection accuracy, but makes the optimization problem more complicated. Read More


The complicated, evolving landscape of cancer mutations poses a formidable challenge to identify cancer genes among the large lists of mutations typically generated in NGS experiments. The ability to prioritize these variants is therefore of paramount importance. To address this issue we developed OncoScore, a text-mining tool that ranks genes according to their association with cancer, based on available biomedical literature. Read More


The epigenome, i.e. the whole of chromatin modifications, is transferred from mother to daughter cells during cell differentiation. Read More


We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is essentially deterministic without specifying initial centers, etc. Read More


We introduce an improved version of RECKONER, an error corrector for Illumina whole genome sequencing data. By modifying its workflow we reduce the computation time even 10 times. We also propose a new method of determination of $k$-mer length, the key parameter of $k$-spectrum-based family of correctors. Read More


While we once thought of cancer as single monolithic diseases affecting a specific organ site, we now understand that there are many subtypes of cancer defined by unique patterns of gene mutations. These gene mutational data, which can be more reliably obtained than gene expression data, help to determine how the subtypes develop, evolve, and respond to therapies. Different from dense continuous-value gene expression data, which most existing cancer subtype discovery algorithms use, somatic mutational data are extremely sparse and heterogeneous, because there are less than 0. Read More


Optimal subset selection is an important task that has numerous algorithms designed for it and has many application areas. STPGA contains a special genetic algorithm supplemented with a tabu memory property (that keeps track of previously tried solutions and their fitness for a number of iterations), and with a regression of the fitness of the solutions on their coding that is used to form the ideal estimated solution (look ahead property) to search for solutions of generic optimal subset selection problems. I have initially developed the programs for the specific problem of selecting training populations for genomic prediction or association problems, therefore I give discussion of the theory behind optimal design of experiments to explain the default optimization criteria in STPGA, and illustrate the use of the programs in this endeavor. Read More


The advent of rapid and inexpensive DNA sequencing has led to an explosion of data waiting to be transformed into knowledge about genome organization and function. Gene prediction is customarily the starting point for genome analysis. This paper presents a bioinformatics study of the oil palm genome, including comparative genomics analysis, database and tools development, and mining of biological data for genes of interest. Read More


When analyzing the genome, researchers have discovered that proteins bind to DNA based on certain patterns of the DNA sequence known as "motifs". However, it is difficult to manually construct motifs due to their complexity. Recently, externally learned memory models have proven to be effective methods for reasoning over inputs and supporting sets. Read More


Boolean matrix factorisation aims to decompose a binary data matrix into an approximate Boolean product of two low rank, binary matrices: one containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for Boolean matrix factorisation and derive a Metropolised Gibbs sampler that facilitates efficient parallel posterior inference. On real world and simulated data, our method outperforms all currently existing approaches for Boolean matrix factorisation and completion. Read More


In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. Read More


DNA read mapping is a ubiquitous task in bioinformatics, and many tools have been developed to solve the read mapping problem. However, there are two trends that are changing the landscape of readmapping: First, new sequencing technologies provide very long reads with high error rates (up to 15%). Second, many genetic variants in the population are known, so the reference genome is not considered as a single string over ACGT, but as a complex object containing these variants. Read More


Inverse problems in statistical physics are motivated by the challenges of `big data' in different fields, in particular high-throughput experiments in biology. In inverse problems, the usual procedure of statistical physics needs to be reversed: Instead of calculating observables on the basis of model parameters, we seek to infer parameters of a model based on observations. In this review, we focus on the inverse Ising problem and closely related problems, namely how to infer the interactions between spins given observed spin correlations, magnetisations, or other data. Read More


Motivation: Epigenetic heterogeneity within a tumour can play an important role in tumour evolution and the emergence of resistance to treatment. It is increasingly recognised that the study of DNA methylation (DNAm) patterns along the genome -- so-called `epialleles' -- offers greater insight into epigenetic dynamics than conventional analyses which examine DNAm marks individually. Results: We have developed a Bayesian model to infer which epialleles are present in multiple regions of the same tumour. Read More


The past decade has seen a rapid growth in omics technologies. Genome-wide association studies (GWAS) have uncovered susceptibility variants for a variety of complex traits. However, the functional significance of most discovered variants are still not fully understood. Read More


Genome replication, a key process for a cell, relies on stochastic initiation by replication origins, causing a variability of replication timing from cell to cell. While stochastic models of eukaryotic replication are widely available, the link between the key parameters and overall replication timing has not been addressed systematically.We use a combined analytical and computational approach to calculate how positions and strength of many origins lead to a given cell-to-cell variability of total duration of the replication of a large region, a chromosome or the entire genome. Read More


Summary: Counting all k-mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases. Usefulness of the tools is shown on a few real problems. Read More


Knowledge about the clonal evolution of each tumor can inform driver-alteration discovery by pointing out initiating genetic events as well as events that contribute to the selective advantage of proliferative, and potentially drug-resistant tumor subclones. A necessary building block to the reconstruction of clonal evolution from tumor profiles is the estimation of the cellular composition of each tumor subclone cellularity, and these, in turn, are based on estimates of the relative abundance frequency of subclone-specific genetic alterations in tumor biopsies. Estimating the frequency of genetic alterations is complicated by the high genomic instability that characterizes many tumor types. Read More


Amino-acid substitutions are implicated in a wide range of human diseases, many of which are lethal. Distinguishing such mutations from polymorphisms without significant effect on human health is a necessary step in understanding the etiology of such diseases. Computational methods can be used to select interesting mutations within a larger set, to corroborate experimental findings and to elucidate the cause of the deleterious effect. Read More


Joint quantification of genetic and epigenetic effects on gene expression is important for understanding the establishment of complex gene regulation systems in living organisms. In particular, genomic imprinting and maternal effects play important roles in the developmental process of mammals and flowering plants. However, the influence of these effects on gene expression are difficult to quantify because they act simultaneously with cis-regulatory mutations. Read More


We call change-point problem (CPP) the identification of changes in the probabilistic behavior of a sequence of observations. Solving the CPP involves detecting the number and position of such changes. In genetics the study of how and what characteristics of a individual's genetic content might contribute to the occurrence and evolution of cancer has fundamental importance in the diagnosis and treatment of such diseases and can be formulated in the framework of chage-point analysis. Read More


Plants rarely occur in isolated systems. Bacteria can inhabit either the endosphere, the region inside the plant root, or the rhizosphere, the soil region just outside the plant root. Our goal is to understand if using genomic data and media dependent metabolic model information is better for training machine learning of predicting bacterial ecological niche than media independent models or pure genome based species trees. Read More


Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. Read More


Protein motifs are conserved fragments occurred frequently in protein sequences. They have significant functions, such as active site of an enzyme. Search and clustering protein sequence motifs are computational intensive. Read More


Motif finding in DNA, RNA and proteins plays an important role in life science research. Recent patents concerning motif finding in the biomolecular data are recorded in the DNA Patent Database which serves as a resource for policy makers and members of the general public interested in fields like genomics, genetics and biotechnology. In this paper we present a computational approach to mining for RNA tertiary motifs in genomic sequences. Read More


Computational prediction of origin of replication (ORI) has been of great interest in bioinformatics and several methods including GC Skew, Z curve, auto-correlation etc. have been explored in the past. In this paper, we have extended the auto-correlation method to predict ORI location with much higher resolution for prokaryotes. Read More


Cancer is known as a disease mainly caused by gene alterations. Discovery of mutated driver pathways or gene sets is becoming an important step to understand molecular mechanisms of carcinogenesis. However, systematically investigating commonalities and specificities of driver gene sets among multiple cancer types is still a great challenge, but this investigation will undoubtedly benefit deciphering cancers and will be helpful for personalized therapy and precision medicine in cancer treatment. Read More


We propose a model for the formation of chromatin loops based on the diffusive sliding of a DNA-bound factor which can dimerise to form a molecular slip-link. Our slip-links mimic the behaviour of cohesin-like molecules, which, along with the CTCF protein, stabilize loops which organize the genome. By combining 3D Brownian dynamics simulations and 1D exactly solvable non-equilibrium models, we show that diffusive sliding is sufficient to account for the strong bias in favour of convergent CTCF-mediated chromosome loops observed experimentally. Read More


Transcriptional profiling on microarrays to obtain gene expressions has been used to facilitate cancer diagnosis. We propose a deep generative machine learning architecture (called DeepCancer) that learn features from unlabeled microarray data. These models have been used in conjunction with conventional classifiers that perform classification of the tissue samples as either being cancerous or non-cancerous. Read More


Antimicrobial resistance is an important public health concern that has implications in the practice of medicine worldwide. Accurately predicting resistance phenotypes from genome sequences shows great promise in promoting better use of antimicrobial agents, by determining which antibiotics are likely to be effective in specific clinical cases. In healthcare, this would allow for the design of treatment plans tailored for specific individuals, likely resulting in better clinical outcomes for patients with bacterial infections. Read More


Cancer has become one of the most widespread diseases in the world. Specifically, breast cancer is diagnosed more often than any other type of cancer. However, breast cancer patients and their individual tumors are often unique. Read More


Accurately predicting drug responses to cancer is an important problem hindering oncologists' efforts to find the most effective drugs to treat cancer, which is a core goal in precision medicine. The scientific community has focused on improving this prediction based on genomic, epigenomic, and proteomic datasets measured in human cancer cell lines. Real-world cancer cell lines contain noise, which degrades the performance of machine learning algorithms. Read More


Alloreactivity following stem cell transplantation (SCT) is difficult to predict in patients undergoing transplantation from HLA matched donors. In this study we performed whole exome sequencing of SCT donor-recipient pairs (DRP). This allowed determination of entire library of alloreactive peptide sequences which would bind HLA class I molecules in each DRP. Read More


The notion that transcription factors bind DNA only through specific, consensus binding sites has been recently questioned. In a pioneering study by Pugh and Venters no specific consensus motif for the positioning of the human pre-initiation complex (PIC) has been identified. Here, we reveal that nonconsensus, statistical, DNA triplet code provides specificity for the positioning of the human PIC. Read More


This paper focuses on pattern matching in the DNA sequence. It was inspired by a previously reported method that proposes encoding both pattern and sequence using prime numbers. Although fast, the method is limited to rather small pattern lengths, due to computing precision problem. Read More


We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of length $n$ starting from a square-free sequence from the set $\{0,1,01,10,010,101\}$. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. Read More


Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiplechromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. Read More


Background: Cardiovascular diseases (CVD) represent a major health issue in patients with schizophrenia and bipolar disorder (BD), but the underlying mechanisms remain unclear. Psychiatric medications are associated with metabolic syndrome and CVD. Yet metabolic abnormalities have been reported in drug-na\"ive patients, leading to our hypothesis that the two types of disorders may share a common genetic basis. Read More


The ability of the adaptive immune system to respond to arbitrary pathogens stems from the broad diversity of immune cell surface receptors (TCRs). This diversity originates in a stochastic DNA editing process (VDJ recombination) that acts each time a new immune cell is created from a stem cell. By analyzing T cell sequence repertoires taken from the blood and thymus of mice of different ages, we quantify the significant changes in this process that occur in development from embryo to young adult. Read More


A timely immunization can be effective against certain diseases and can save thousands of lives. However, for some diseases it has been difficult, so far, to develop an efficient vaccine. Malaria, a tropical disease caused by a parasite of the genus Plasmodium, is one example. Read More


Natural genetic variation between individuals in a population leads to variations in gene expression that are informative for the inference of gene regulatory networks. Particularly, genome-wide genotype and transcriptome data from the same samples allow for causal inference between gene expression traits using the DNA variations in cis-regulatory regions as causal anchors. However, existing causal inference programs are not efficient enough for contemporary datasets, and unrealistically assume the absence of hidden confounders affecting the coexpression of causally related gene pairs. Read More


The advent of large scale, high-throughput genomic screening has introduced a wide range of tests for diagnostic purposes. Prominent among them are tests using miRNA expression levels. Genomics and proteomics now provide expression levels of hundreds of miRNAs at a time. Read More


Frameshift translation is an important phenomenon that contributes to the appearance of novel Coding DNA Sequences (CDS) and functions in gene evolution, by allowing alternative amino acid translations of genes coding regions. Frameshift translations can be identified by aligning two CDS, from a same gene or from homologous genes, while accounting for their codon structure. Two main classes of algorithms have been proposed to solve the problem of aligning CDS, either by amino acid sequence alignment back-translation, or by simultaneously accounting for the nucleotide and amino acid levels. Read More


Standard models assign disease progression to discrete categories or stages based on well-characterized clinical markers. However, such a system is potentially at odds with our understanding of the underlying biology, which in highly complex systems may support a (near-)continuous evolution of disease from inception to terminal state. To learn such a continuous disease score one could infer a latent variable from dynamic "omics" data such as RNA-seq that correlates with an outcome of interest such as survival time. Read More


Corynebacterium glutamicum is a Gram-positive, anaerobic, rod-shaped soil bacterium able to grow on a diversity of carbon sources like sugars and organic acids. It is a biotechnological relevant organism because of its highly efficient ability to biosynthesize amino acids, such as L-glutamic acid and L-lysine. Here, we reconstructed the most complete C. Read More


Determination of protein concentration in often an absolute pre-requisite in preparing samples for biochemical and proteomic analyses. However, current protein assay methods are not compatible with both reducers and detergents, which are however present simultaneously in most denaturing extraction buffers used in proteomics and electrophoresis, and in particular in SDS electrophoresis. We found that inclusion of cyclodextrins in a Coomassie blue-based assay made it compatible with detergents, as cyclodextrins complex detergents in a 1:1 molecular ratio. Read More