Quantitative Biology - Genomics Publications (50)


Quantitative Biology - Genomics Publications

Multiplex and multi-directional control of metabolic pathways is crucial for metabolic engineering to improve product yield of fuels, chemicals, and pharmaceuticals. To achieve this goal, artificial transcriptional regulators such as CRISPR-based transcription regulators have been developed to specifically activate or repress genes of interest. Here, we found that by deploying guide RNAs to target on DNA sites at different locations of genetic cassettes, we could use just one synthetic CRISPR-based transcriptional regulator to simultaneously activate and repress gene expressions. Read More

The medical research facilitates to acquire a diverse type of data from the same individual for particular cancer. Recent studies show that utilizing such diverse data results in more accurate predictions. The major challenge faced is how to utilize such diverse data sets in an effective way. Read More

High throughput sequencing is a technology that allows for the generation of millions of reads of genomic data regarding a study of interest, and data from high throughput sequencing platforms are usually count compositions. Subsequent analysis of such data can yield information on tran- scription profiles, microbial diversity, or even relative cellular abundance in culture. Because of the high cost of acquisition, the data are usually sparse, and always contain far fewer observations than variables. Read More

The ability to measure the transcriptome of single-cells has only been feasible for a few years, but is becoming an extremely popular assay. While many types of analysis and questions can be answered using single cell RNA-sequencing, of prime interest is the ability to investigate what cell types occur in nature. Unbiased and reproducible cataloging of distinct cell types require large numbers of cells to be sampled. Read More

Aims: Ischaemic cardiomyopathy (ICM) leads to impaired contraction and ventricular dysfunction causing high rates of morbidity and mortality. Epigenomics allows the identification of epigenetic signatures in human diseases. We analyse the differential epigenetic patterns of ASB gene family in ICM patients and relate these alterations to their haemodynamic and functional status. Read More

The colonic mucus layer is a dynamic and complex structure formed by secreted and transmembrane mucins, which are high-molecular-weight and heavily glycosylated proteins. Colonic mucus consists of a loose outer layer and a dense epithelium-attached layer. The outer layer is inhabited by various representatives of the human gut microbiota (HGM). Read More

While many short read assemblers attempt to simplify the de Brujin graph by identifying and resolving variant-induced bubbles to produce a haploid mosaic result, this approach is only viable when variants are relatively rare and the bubbles are well defined in a graph context. We observed that diploid genomes with very high levels of heterozygosity fail to display well-resolved bubble structures in a typical assembly graph and thus result in highly fragmented and incomplete assemblies. Here we present an enhancement of Meraculous2 algorithm, called Meraculous-2D, which preserves haplotypes across variant sites and generates accurate assembly of highly heterozygous diploid genomes. Read More

Among several quantitative invariants found in evolutionary genomics, one of the most striking is the scaling of the overall abundance of proteins, or protein domains, sharing a specific functional annotation across genomes of given size. The size of these functional categories change, on average, as power-laws in the total number of protein-coding genes. Here, we show that such regularities are not restricted to the overall behavior of high-level functional categories, but also exist systematically at the level of single evolutionary families of protein domains. Read More

Alignment of large genomic sequences is a fundamental task in computational genome analysis. Most methods for genomic alignment use high-scoring local alignments as {\em anchor points} to reduce the search space of the alignment procedure. Speed and quality of these methods therefore depend on the underlying anchor points. Read More

Background: Mustard aphid is a major pest of Brassica oilseeds. No source for aphid resistance is presently available in Brassica juncea . A wild crucifer, Brassica fruticulosa is known to be resistant to mustard aphid. Read More

The diversity revealed by large scale genomics in microbiology is calling into question long held beliefs about genome stability, evolutionary rate, even the definition of a species. MacArthur and Wilson's theory of insular biogeography provides an explanation for the diversity of macroscopic animal and plant species as a consequence of the associated hierarchical web of species interdependence. We report a large scale study of microbial diversity that reveals that the cumulative number of genes discovered increases with the number of genomes studied as a simple power law. Read More

Motivation: We here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. Read More

Long-read sequencing has enabled the de novo assembly of several mammalian genomes, but with high cost in computing. Here, we demonstrated de novo assembly of mammalian genome using long reads in an efficient and inexpensive workstation. Read More

The day we understand the time evolution of subcellular elements at a level of detail comparable to physical systems governed by Newton's laws of motion seems far away. Even so, quantitative approaches to cellular dynamics add to our understanding of cell biology, providing data-guided frameworks that allow us to develop better predictions about and methods for control over specific biological processes and system-wide cell behavior. In this paper we describe an approach to optimizing the use of transcription factors in the context of cellular reprogramming. Read More

Changepoint detection is a central problem in time series and genomic data. For some applications, it is natural to impose constraints on the directions of changes. One example is ChIP-seq data, for which adding an up-down constraint improves peak detection accuracy, but makes the optimization problem more complicated. Read More

The complicated, evolving landscape of cancer mutations poses a formidable challenge to identify cancer genes among the large lists of mutations typically generated in NGS experiments. The ability to prioritize these variants is therefore of paramount importance. To address this issue we developed OncoScore, a text-mining tool that ranks genes according to their association with cancer, based on available biomedical literature. Read More

The epigenome, i.e. the whole of chromatin modifications, is transferred from mother to daughter cells during cell differentiation. Read More

We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. Read More

We introduce an improved version of RECKONER, an error corrector for Illumina whole genome sequencing data. By modifying its workflow we reduce the computation time even 10 times. We also propose a new method of determination of $k$-mer length, the key parameter of $k$-spectrum-based family of correctors. Read More

While we once thought of cancer as single monolithic diseases affecting a specific organ site, we now understand that there are many subtypes of cancer defined by unique patterns of gene mutations. These gene mutational data, which can be more reliably obtained than gene expression data, help to determine how the subtypes develop, evolve, and respond to therapies. Different from dense continuous-value gene expression data, which most existing cancer subtype discovery algorithms use, somatic mutational data are extremely sparse and heterogeneous, because there are less than 0. Read More

Optimal subset selection is an important task that has numerous algorithms designed for it and has many application areas. STPGA contains a special genetic algorithm supplemented with a tabu memory property (that keeps track of previously tried solutions and their fitness for a number of iterations), and with a regression of the fitness of the solutions on their coding that is used to form the ideal estimated solution (look ahead property) to search for solutions of generic optimal subset selection problems. I have initially developed the programs for the specific problem of selecting training populations for genomic prediction or association problems, therefore I give discussion of the theory behind optimal design of experiments to explain the default optimization criteria in STPGA, and illustrate the use of the programs in this endeavor. Read More

The advent of rapid and inexpensive DNA sequencing has led to an explosion of data waiting to be transformed into knowledge about genome organization and function. Gene prediction is customarily the starting point for genome analysis. This paper presents a bioinformatics study of the oil palm genome, including comparative genomics analysis, database and tools development, and mining of biological data for genes of interest. Read More

When analyzing the genome, researchers have discovered that proteins bind to DNA based on certain patterns of the DNA sequence known as "motifs". However, it is difficult to manually construct motifs due to their complexity. Recently, externally learned memory models have proven to be effective methods for reasoning over inputs and supporting sets. Read More

Boolean matrix factorisation aims to decompose a binary data matrix into an approximate Boolean product of two low rank, binary matrices: one containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for Boolean matrix factorisation and derive a Metropolised Gibbs sampler that facilitates efficient parallel posterior inference. On real world and simulated data, our method outperforms all currently existing approaches for Boolean matrix factorisation and completion. Read More

In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. Read More

DNA read mapping is a ubiquitous task in bioinformatics, and many tools have been developed to solve the read mapping problem. However, there are two trends that are changing the landscape of readmapping: First, new sequencing technologies provide very long reads with high error rates (up to 15%). Second, many genetic variants in the population are known, so the reference genome is not considered as a single string over ACGT, but as a complex object containing these variants. Read More

Inverse problems in statistical physics are motivated by the challenges of `big data' in different fields, in particular high-throughput experiments in biology. In inverse problems, the usual procedure of statistical physics needs to be reversed: Instead of calculating observables on the basis of model parameters, we seek to infer parameters of a model based on observations. In this review, we focus on the inverse Ising problem and closely related problems, namely how to infer the interactions between spins given observed spin correlations, magnetisations, or other data. Read More

Motivation: Epigenetic heterogeneity within a tumour can play an important role in tumour evolution and the emergence of resistance to treatment. It is increasingly recognised that the study of DNA methylation (DNAm) patterns along the genome -- so-called `epialleles' -- offers greater insight into epigenetic dynamics than conventional analyses which examine DNAm marks individually. Results: We have developed a Bayesian model to infer which epialleles are present in multiple regions of the same tumour. Read More

The past decade has seen a rapid growth in omics technologies. Genome-wide association studies (GWAS) have uncovered susceptibility variants for a variety of complex traits. However, the functional significance of most discovered variants are still not fully understood. Read More

Genome replication, a key process for a cell, relies on stochastic initiation by replication origins, causing a variability of replication timing from cell to cell. While stochastic models of eukaryotic replication are widely available, the link between the key parameters and overall replication timing has not been addressed systematically.We use a combined analytical and computational approach to calculate how positions and strength of many origins lead to a given cell-to-cell variability of total duration of the replication of a large region, a chromosome or the entire genome. Read More

Summary: Counting all k-mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases. Usefulness of the tools is shown on a few real problems. Read More

Knowledge about the clonal evolution of each tumor can inform driver-alteration discovery by pointing out initiating genetic events as well as events that contribute to the selective advantage of proliferative, and potentially drug-resistant tumor subclones. A necessary building block to the reconstruction of clonal evolution from tumor profiles is the estimation of the cellular composition of each tumor subclone cellularity, and these, in turn, are based on estimates of the relative abundance frequency of subclone-specific genetic alterations in tumor biopsies. Estimating the frequency of genetic alterations is complicated by the high genomic instability that characterizes many tumor types. Read More

Amino-acid substitutions are implicated in a wide range of human diseases, many of which are lethal. Distinguishing such mutations from polymorphisms without significant effect on human health is a necessary step in understanding the etiology of such diseases. Computational methods can be used to select interesting mutations within a larger set, to corroborate experimental findings and to elucidate the cause of the deleterious effect. Read More

Joint quantification of genetic and epigenetic effects on gene expression is important for understanding the establishment of complex gene regulation systems in living organisms. In particular, genomic imprinting and maternal effects play important roles in the developmental process of mammals and flowering plants. However, the influence of these effects on gene expression are difficult to quantify because they act simultaneously with cis-regulatory mutations. Read More

We call change-point problem (CPP) the identification of changes in the probabilistic behavior of a sequence of observations. Solving the CPP involves detecting the number and position of such changes. In genetics the study of how and what characteristics of a individual's genetic content might contribute to the occurrence and evolution of cancer has fundamental importance in the diagnosis and treatment of such diseases and can be formulated in the framework of chage-point analysis. Read More

Plants rarely occur in isolated systems. Bacteria can inhabit either the endosphere, the region inside the plant root, or the rhizosphere, the soil region just outside the plant root. Our goal is to understand if using genomic data and media dependent metabolic model information is better for training machine learning of predicting bacterial ecological niche than media independent models or pure genome based species trees. Read More

Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. Read More

Protein motifs are conserved fragments occurred frequently in protein sequences. They have significant functions, such as active site of an enzyme. Search and clustering protein sequence motifs are computational intensive. Read More

Motif finding in DNA, RNA and proteins plays an important role in life science research. Recent patents concerning motif finding in the biomolecular data are recorded in the DNA Patent Database which serves as a resource for policy makers and members of the general public interested in fields like genomics, genetics and biotechnology. In this paper we present a computational approach to mining for RNA tertiary motifs in genomic sequences. Read More

Computational prediction of origin of replication (ORI) has been of great interest in bioinformatics and several methods including GC Skew, Z curve, auto-correlation etc. have been explored in the past. In this paper, we have extended the auto-correlation method to predict ORI location with much higher resolution for prokaryotes. Read More

Cancer is known as a disease mainly caused by gene alterations. Discovery of mutated driver pathways or gene sets is becoming an important step to understand molecular mechanisms of carcinogenesis. However, systematically investigating commonalities and specificities of driver gene sets among multiple cancer types is still a great challenge, but this investigation will undoubtedly benefit deciphering cancers and will be helpful for personalized therapy and precision medicine in cancer treatment. Read More

We propose a model for the formation of chromatin loops based on the diffusive sliding of a DNA-bound factor which can dimerise to form a molecular slip-link. Our slip-links mimic the behaviour of cohesin-like molecules, which, along with the CTCF protein, stabilize loops which organize the genome. By combining 3D Brownian dynamics simulations and 1D exactly solvable non-equilibrium models, we show that diffusive sliding is sufficient to account for the strong bias in favour of convergent CTCF-mediated chromosome loops observed experimentally. Read More

Transcriptional profiling on microarrays to obtain gene expressions has been used to facilitate cancer diagnosis. We propose a deep generative machine learning architecture (called DeepCancer) that learn features from unlabeled microarray data. These models have been used in conjunction with conventional classifiers that perform classification of the tissue samples as either being cancerous or non-cancerous. Read More

Antimicrobial resistance is an important public health concern that has implications in the practice of medicine worldwide. Accurately predicting resistance phenotypes from genome sequences shows great promise in promoting better use of antimicrobial agents, by determining which antibiotics are likely to be effective in specific clinical cases. In healthcare, this would allow for the design of treatment plans tailored for specific individuals, likely resulting in better clinical outcomes for patients with bacterial infections. Read More

Cancer has become one of the most widespread diseases in the world. Specifically, breast cancer is diagnosed more often than any other type of cancer. However, breast cancer patients and their individual tumors are often unique. Read More

Accurately predicting drug responses to cancer is an important problem hindering oncologists' efforts to find the most effective drugs to treat cancer, which is a core goal in precision medicine. The scientific community has focused on improving this prediction based on genomic, epigenomic, and proteomic datasets measured in human cancer cell lines. Real-world cancer cell lines contain noise, which degrades the performance of machine learning algorithms. Read More

Alloreactivity following stem cell transplantation (SCT) is difficult to predict in patients undergoing transplantation from HLA matched donors. In this study we performed whole exome sequencing of SCT donor-recipient pairs (DRP). This allowed determination of entire library of alloreactive peptide sequences which would bind HLA class I molecules in each DRP. Read More

The notion that transcription factors bind DNA only through specific, consensus binding sites has been recently questioned. In a pioneering study by Pugh and Venters no specific consensus motif for the positioning of the human pre-initiation complex (PIC) has been identified. Here, we reveal that nonconsensus, statistical, DNA triplet code provides specificity for the positioning of the human PIC. Read More

This paper focuses on pattern matching in the DNA sequence. It was inspired by a previously reported method that proposes encoding both pattern and sequence using prime numbers. Although fast, the method is limited to rather small pattern lengths, due to computing precision problem. Read More

We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of length $n$ starting from a square-free sequence from the set $\{0,1,01,10,010,101\}$. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. Read More