Quantitative Biology - Genomics Publications (50)


Quantitative Biology - Genomics Publications

We call change-point problem (CPP) the identification of changes in the probabilistic behavior of a sequence of observations. Solving the CPP involves detecting the number and position of such changes. In genetics the study of how and what characteristics of a individual's genetic content might contribute to the occurrence and evolution of cancer has fundamental importance in the diagnosis and treatment of such diseases and can be formulated in the framework of chage-point analysis. Read More

Plants rarely occur in isolated systems. Bacteria can inhabit either the endosphere, the region inside the plant root, or the rhizosphere, the soil region just outside the plant root. Our goal is to understand if using genomic data and media dependent metabolic model information is better for training machine learning of predicting bacterial ecological niche than media independent models or pure genome based species trees. Read More

Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. Read More

Protein motifs are conserved fragments occurred frequently in protein sequences. They have significant functions, such as active site of an enzyme. Search and clustering protein sequence motifs are computational intensive. Read More

Motif finding in DNA, RNA and proteins plays an important role in life science research. Recent patents concerning motif finding in the biomolecular data are recorded in the DNA Patent Database which serves as a resource for policy makers and members of the general public interested in fields like genomics, genetics and biotechnology. In this paper we present a computational approach to mining for RNA tertiary motifs in genomic sequences. Read More

Computational prediction of origin of replication (ORI) has been of great interest in bioinformatics and several methods including GC Skew, Z curve, auto-correlation etc. have been explored in the past. In this paper, we have extended the auto-correlation method to predict ORI location with much higher resolution for prokaryotes. Read More

Cancer is known as a disease mainly caused by gene alterations. Discovery of mutated driver pathways or gene sets is becoming an important step to understand molecular mechanisms of carcinogenesis. However, systematically investigating commonalities and specificities of driver gene sets among multiple cancer types is still a great challenge, but this investigation will undoubtedly benefit deciphering cancers and will be helpful for personalized therapy and precision medicine in cancer treatment. Read More

We propose a model for the formation of chromatin loops based on the diffusive sliding of a DNA-bound factor which can dimerise to form a molecular slip-link. Our slip-links mimic the behaviour of cohesin-like molecules, which, along with the CTCF protein, stabilize loops which organize the genome. By combining 3D Brownian dynamics simulations and 1D exactly solvable non-equilibrium models, we show that diffusive sliding is sufficient to account for the strong bias in favour of convergent CTCF-mediated chromosome loops observed experimentally. Read More

Transcriptional profiling on microarrays to obtain gene expressions has been used to facilitate cancer diagnosis. We propose a deep generative machine learning architecture (called DeepCancer) that learn features from unlabeled microarray data. These models have been used in conjunction with conventional classifiers that perform classification of the tissue samples as either being cancerous or non-cancerous. Read More

Antimicrobial resistance is an important public health concern that has implications in the practice of medicine worldwide. Accurately predicting resistance phenotypes from genome sequences shows great promise in promoting better use of antimicrobial agents, by determining which antibiotics are likely to be effective in specific clinical cases. In healthcare, this would allow for the design of treatment plans tailored for specific individuals, likely resulting in better clinical outcomes for patients with bacterial infections. Read More

Cancer has become one of the most widespread diseases in the world. Specifically, breast cancer is diagnosed more often than any other type of cancer. However, breast cancer patients and their individual tumors are often unique. Read More

Accurately predicting drug responses to cancer is an important problem hindering oncologists' efforts to find the most effective drugs to treat cancer, which is a core goal in precision medicine. The scientific community has focused on improving this prediction based on genomic, epigenomic, and proteomic datasets measured in human cancer cell lines. Real-world cancer cell lines contain noise, which degrades the performance of machine learning algorithms. Read More

Alloreactivity following stem cell transplantation (SCT) is difficult to predict in patients undergoing transplantation from HLA matched donors. In this study we performed whole exome sequencing of SCT donor-recipient pairs (DRP). This allowed determination of entire library of alloreactive peptide sequences which would bind HLA class I molecules in each DRP. Read More

The notion that transcription factors bind DNA only through specific, consensus binding sites has been recently questioned. In a pioneering study by Pugh and Venters no specific consensus motif for the positioning of the human pre-initiation complex (PIC) has been identified. Here, we reveal that nonconsensus, statistical, DNA triplet code provides specificity for the positioning of the human PIC. Read More

This paper focuses on pattern matching in the DNA sequence. It was inspired by a previously reported method that proposes encoding both pattern and sequence using prime numbers. Although fast, the method is limited to rather small pattern lengths, due to computing precision problem. Read More

We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of length $n$ starting from a square-free sequence from the set $\{0,1,01,10,010,101\}$. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. Read More

Several modern genomic technologies, such as DNA-Methylation arrays, measure spatially registered probes that number in the hundreds of thousands across multiplechromosomes. The measured probes are by themselves less interesting scientifically; instead scientists seek to discover biologically interpretable genomic regions comprised of contiguous groups of probes which may act as biomarkers of disease or serve as a dimension-reducing pre-processing step for downstream analyses. In this paper, we introduce an unsupervised feature learning technique which maps technological units (probes) to biological units (genomic regions) that are common across all subjects. Read More

Background: Cardiovascular diseases (CVD) represent a major health issue in patients with schizophrenia and bipolar disorder (BD), but the underlying mechanisms remain unclear. Psychiatric medications are associated with metabolic syndrome and CVD. Yet metabolic abnormalities have been reported in drug-na\"ive patients, leading to our hypothesis that the two types of disorders may share a common genetic basis. Read More

The ability of the adaptive immune system to respond to arbitrary pathogens stems from the broad diversity of immune cell surface receptors (TCRs). This diversity originates in a stochastic DNA editing process (VDJ recombination) that acts each time a new immune cell is created from a stem cell. By analyzing T cell sequence repertoires taken from the blood and thymus of mice of different ages, we quantify the significant changes in this process that occur in development from embryo to young adult. Read More

A timely immunization can be effective against certain diseases and can save thousands of lives. However, for some diseases it has been difficult, so far, to develop an efficient vaccine. Malaria, a tropical disease caused by a parasite of the genus Plasmodium, is one example. Read More

Natural genetic variation between individuals in a population leads to variations in gene expression that are informative for the inference of gene regulatory networks. Particularly, genome-wide genotype and transcriptome data from the same samples allow for causal inference between gene expression traits using the DNA variations in cis-regulatory regions as causal anchors. However, existing causal inference programs are not efficient enough for contemporary datasets, and unrealistically assume the absence of hidden confounders affecting the coexpression of causally related gene pairs. Read More

The advent of large scale, high-throughput genomic screening has introduced a wide range of tests for diagnostic purposes. Prominent among them are tests using miRNA expression levels. Genomics and proteomics now provide expression levels of hundreds of miRNAs at a time. Read More

Frameshift translation is an important phenomenon that contributes to the appearance of novel Coding DNA Sequences (CDS) and functions in gene evolution, by allowing alternative amino acid translations of genes coding regions. Frameshift translations can be identified by aligning two CDS, from a same gene or from homologous genes, while accounting for their codon structure. Two main classes of algorithms have been proposed to solve the problem of aligning CDS, either by amino acid sequence alignment back-translation, or by simultaneously accounting for the nucleotide and amino acid levels. Read More

Standard models assign disease progression to discrete categories or stages based on well-characterized clinical markers. However, such a system is potentially at odds with our understanding of the underlying biology, which in highly complex systems may support a (near-)continuous evolution of disease from inception to terminal state. To learn such a continuous disease score one could infer a latent variable from dynamic "omics" data such as RNA-seq that correlates with an outcome of interest such as survival time. Read More

Corynebacterium glutamicum is a Gram-positive, anaerobic, rod-shaped soil bacterium able to grow on a diversity of carbon sources like sugars and organic acids. It is a biotechnological relevant organism because of its highly efficient ability to biosynthesize amino acids, such as L-glutamic acid and L-lysine. Here, we reconstructed the most complete C. Read More

Determination of protein concentration in often an absolute pre-requisite in preparing samples for biochemical and proteomic analyses. However, current protein assay methods are not compatible with both reducers and detergents, which are however present simultaneously in most denaturing extraction buffers used in proteomics and electrophoresis, and in particular in SDS electrophoresis. We found that inclusion of cyclodextrins in a Coomassie blue-based assay made it compatible with detergents, as cyclodextrins complex detergents in a 1:1 molecular ratio. Read More

Understanding the evolutionary relationship among species is of fundamental importance to the biological sciences. The location of the root in any phylogenetic tree is critical as it gives an order to evolutionary events. None of the popular models of nucleotide evolution used in likelihood or Bayesian methods are able to infer the location of the root without exogenous information. Read More

Pan-genome analysis is a standard procedure to decipher genome heterogeneity and diversification of bacterial species. Specie evolution is traced by defining and comparing the core (conserved), accessory (dispensable) and unique (strain-specific) gene pool with other strains of interest. Here, we present pan-genome analysis of the genus Serratia, comprising of a dataset of 100 genomes. Read More

The RNA-sequencing (RNA-seq) is becoming increasingly popular for quantifying gene expression levels. Since the RNA-seq measurements are relative in nature, between-sample normalization of counts is an essential step in differential expression (DE) analysis. The normalization of existing DE detection algorithms is ad hoc and performed once for all prior to DE detection, which may be suboptimal since ideally normalization should be based on non-DE genes only and thus coupled with DE detection. Read More

For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Read More

MicroRNAs (miRNAs) are non-coding RNAs with approximately 22 nucleotides (nt) that are derived from precursor molecules. These precursor molecules or pre-miRNAs often fold into stem-loop hairpin structures. However, a large number of sequences with pre-miRNA-like hairpins can be found in genomes. Read More

Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained the same CNN architecture on promoters of four very distant organisms: human, plant (Arabidopsis), and two bacteria (Escherichia coli and Mycoplasma pneumonia). Read More

RNA can be used as a high-density medium for data storage and transmission; however, an important RNA process -- replication -- is noisy. This paper presents an error analysis for RNA as a data transmission medium, analyzing how deletion errors increase in a collection of replicated DNA strands over time. Read More

Motivation: New long read sequencers promise to transform sequencing and genome assembly by producing reads tens of kilobases long. However their high error rate significantly complicates assembly and requires expensive correction steps to layout the reads using standard assembly engines. Results: We present an original and efficient spectral algorithm to layout the uncorrected nanopore reads, and its seamless integration into a straightforward overlap/layout/consensus (OLC) assembly scheme. Read More

The aim of this study is to investigate the relation that can be found between the phylogeny of a large set of complete chloroplast genomes, and the evolution of gene content inside these sequences. Core and pan genomes have been computed on \textit{de novo} annotation of these 845 genomes, the former being used for producing well-supported phylogenetic tree while the latter provides information regarding the evolution of gene contents over time. It details too the specificity of some branches of the tree, when specificity is obtained on accessory genes. Read More

Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. Read More

Transposable elements, or transposons, are DNA sequences that can jump from site to site in the genome during the life cycle of a cell, usually encoding the very enzymes which perform their excision. However, some transposons are parasitic, relying on the enzymes produced by the regular transposons. In this case, we show that a stochastic model, which takes into account the small copy numbers of the transposons in a cell, predicts noise-induced predator-prey oscillations with a characteristic time scale that is much longer than the cell replication time, indicating that the state of the predator-prey oscillator is stored in the genome and transmitted to successive generations. Read More

Genome assembly from the high-throughput sequencing (HTS) reads is a fundamental yet challenging computational problem. An intrinsic challenge is the uncertainty caused by the widespread repetitive elements. Here we get around the uncertainty using the notion of uniquely mapped (UM) reads, which motivated the design of a new assembler BAUM. Read More

In this paper we consider the problem of learning the genetic-interaction-map, i.e., the topology of a directed acyclic graph (DAG) of genetic interactions from noisy double knockout (DK) data. Read More

Increasing accessibility of data to researchers makes it possible to conduct massive amounts of statistical testing. Rather than follow a carefully crafted set of scientific hypotheses with statistical analysis, researchers can now test many possible relations and let P-values or other statistical summaries generate hypotheses for them. Genetic epidemiology field is an illustrative case in this paradigm shift. Read More

Increased availability of data and accessibility of computational tools in recent years have created unprecedented opportunities for scientific research driven by statistical analysis. Inherent limitations of statistics impose constrains on reliability of conclusions drawn from data but misuse of statistical methods is a growing concern. Significance, hypothesis testing and the accompanying P-values are being scrutinized as representing most widely applied and abused practices. Read More

Extracting associations that recur across multiple studies while controlling the false discovery rate is a fundamental challenge. Here, we consider an extension of Efron's single-study two-groups model to allow joint analysis of multiple studies. We assume that given a set of p-values obtained from each study, the researcher is interested in associations that recur in at least $k>1$ studies. Read More

RNA-Seq is a widely-used method for studying the behavior of genes under different biological conditions. An essential step in an RNA-Seq study is normalization, in which raw data are adjusted to account for factors that prevent direct comparison of expression measures. Errors in normalization can have a significant impact on downstream analysis, such as inflated false positives in differential expression analysis. Read More

The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large-scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of problematic genes (i.e. Read More

In this paper, we explain why the chaotic model (CM) of Bahi and Michel (2008) accurately simulates gene mutations over time. First, we demonstrate that the CM model is a truly chaotic one, as defined by Devaney. Then, we show that mutations occurring in gene mutations have the same chaotic dynamic, thus making the use of chaotic models relevant for genome evolution. Read More

The majority of mammalian genomic transcripts do not directly code for proteins and it is currently believed that most of these are not under evolutionary constraint. However given the abundance non-coding RNA (ncRNA) and its strong affinity for inter-RNA binding, these molecules have the potential to regulate proteins in a highly distributed way, similar to artificial neural networks. We explore this analogy by devising a simple architecture for a biochemical network that can function as an associative memory. Read More

We perform differential expression analysis of high-throughput sequencing count data under a Bayesian nonparametric framework, removing sophisticated ad-hoc pre-processing steps commonly required in existing algorithms. We propose to use the gamma (beta) negative binomial process, which takes into account different sequencing depths using sample-specific negative binomial probability (dispersion) parameters, to detect differentially expressed genes by comparing the posterior distributions of gene-specific negative binomial dispersion (probability) parameters. These model parameters are inferred by borrowing statistical strength across both the genes and samples. Read More

Characterizing genes with semantic information is an important process regarding the description of gene products. In spite that complete genomes of many organisms have been already sequenced, the biological functions of all of their genes are still unknown. Since experimentally studying the functions of those genes, one by one, would be unfeasible, new computational methods for gene functions inference are needed. Read More