# Stephane Robin

## Contact Details

NameStephane Robin |
||

Affiliation |
||

Location |
||

## Pubs By Year |
||

## Pub CategoriesStatistics - Methodology (13) Statistics - Theory (5) Mathematics - Statistics (5) Statistics - Machine Learning (5) Statistics - Computation (4) Statistics - Applications (4) Computer Science - Learning (2) Quantitative Biology - Populations and Evolution (2) Quantitative Biology - Quantitative Methods (1) Computer Science - Artificial Intelligence (1) Quantitative Biology - Genomics (1) |

## Publications Authored By Stephane Robin

Many application domains such as ecology or genomics have to deal with multivariate non Gaussian observations. A typical example is the joint observation of the respective abundances of a set of species in a series of sites, aiming to understand the co-variations between these species. The Gaussian setting provides a canonical way to model such dependencies, but does not apply in general. Read More

We consider the problem of change-point detection in multivariate time-series. The multivariate distribution of the observations is supposed to follow a graphical model, whose graph and parameters are affected by abrupt changes throughout time. We demonstrate that it is possible to perform exact Bayesian inference whenever one considers a simple class of undirected graphs called spanning trees as possible structures. Read More

Comparative and evolutive ecologists are interested in the distribution of quantitative traits among related species. The classical framework for these distributions consists of a random process running along the branches of a phylogenetic tree relating the species. We consider shifts in the process parameters, which reveal fast adaptation to changes of ecological niches. Read More

Logistic regression is a natural and simple tool to understand how covariates contribute to explain the topology of a binary network. Once the model fitted, the practitioner is interested in the goodness-of-fit of the regression in order to check if the covariates are sufficient to explain the whole topology of the network and, if they are not, to analyze the residual structure. To address this problem, we introduce a generic model that combines logistic regression with a network-oriented residual term. Read More

The degrees are a classical and relevant way to study the topology of a network. They can be used to assess the goodness-of-fit for a given random graph model. In this paper we introduce goodness-of-fit tests for two classes of models. Read More

Probabilistic graphical models offer a powerful framework to account for the dependence structure between variables, which can be represented as a graph. The dependence between variables may render inference tasks such as computing normalizing constant, marginalization or optimization intractable. The objective of this paper is to review techniques exploiting the graph structure for exact inference borrowed from optimization and computer science. Read More

We consider the segmentation of set of correlated time-series due e.g. to some spatial structure. Read More

Motivation: Detecting local correlations in expression between neighbor genes along the genome has proved to be an effective strategy to identify possible causes of transcriptional deregulation in cancer. It has been successfully used to illustrate the role of mechanisms such as copy number variation (CNV) or epigenetic alterations as factors that may significantly alter expression in large chromosomic regions (gene silencing or gene activation). Results: The identification of correlated regions requires segmenting the gene expression correlation matrix into regions of homogeneously correlated genes and assessing whether the observed local correlation is significantly higher than the background chromosomal correlation. Read More

We propose to learn the structure of an undirected graphical model by computing exact posterior probabilities for local structures in a Bayesian framework. This task would be untractable without any restriction on the considered graphs. We limit our exploration to the spanning trees and define priors on tree structures and parameters that allow fast and exact computation of the posterior probability for an edge to belong to the random tree thanks to an algebraic result called the Matrix-Tree theorem. Read More

Next-generation sequencing technologies now constitute a method of choice to measure gene expression. Data to analyze are read counts, commonly modeled using Negative Binomial distributions. A relevant issue associated with this probabilistic framework is the reliable estimation of the overdispersion parameter, reinforced by the limited number of replicates generally observable for each gene. Read More

We consider the estimation of the total number $N$ of species based on the abundances of species that have been observed. We adopt a non parametric approach where the true abundance distribution $p$ is only supposed to be convex. From this assumption, we propose a definition for convex abundance distributions. Read More

Dynamic extinction colonisation models (also called contact processes) are widely studied in epidemiology and in metapopulation theory. Contacts are usually assumed to be possible only through a network of connected patches. This network accounts for a spatial landscape or a social organisation of interactions. Read More

Conditional Gaussian graphical models (cGGM) are a recent reparametrization of the multivariate linear regression model which explicitly exhibits $i)$ the partial covariances between the predictors and the responses, and $ii)$ the partial covariances between the responses themselves. Such models are particularly suitable for interpretability since partial covariances describe strong relationships between variables. In this framework, we propose a regularization scheme to enhance the learning strategy of the model by driving the selection of the relevant input features by prior structural information. Read More

We consider the problem of multiple change-point estimation in the mean of a Gaussian AR(1) process. Taking into account the dependence structure does not allow us to use the dynamic programming algorithm, which is the only algorithm giving the optimal solution in the independent case. We propose a robust estimator of the autocorrelation parameter, which is consistent and satisfies a central limit theorem. Read More

**Affiliations:**

^{1}LaMME, LPMA

We present a selective review on probabilistic modeling of heterogeneity in random graphs. We focus on latent space models and more particularly on stochastic block models and their extensions that have undergone major developments in the last five years. Read More

We are interested in the comparison of transcript boundaries from cells which originated in different environments. The goal is to assess whether this phenomenon, called differential splicing, is used to modify the transcription of the genome in response to stress factors. We address this question by comparing the change-points locations in the individual segmentation of each profile, which correspond to the RNA-Seq data for a gene in one growth condition. Read More

In this paper, we prove that finite state space non parametric hidden Markov models are identifiable as soon as the transition matrix of the latent Markov chain has full rank and the emission probability distributions are linearly independent. We then propose several non parametric likelihood based estimation methods, which we apply to models used in applications. We finally show on examples that the use of non parametric modeling and estimation may improve the classification performances. Read More

In unsupervised classification, Hidden Markov Models (HMM) are used to account for a neighborhood structure between observations. The emission distributions are often supposed to belong to some parametric family. In this paper, a semiparametric modeling where the emission distributions are a mixture of parametric distributions is proposed to get a higher flexibility. Read More

Genome annotation is an important issue in biology which has long been addressed with gene prediction methods and manual experiments requiring biological expertise. The expanding Next Generation Sequencing technologies and their enhanced precision allow a new approach to the domain: the segmentation of RNA-Seq data to determine gene boundaries. Because of its almost linear complexity, we propose to use the Pruned Dynamic Programming Algorithm, which performances had been acknowledged for CGH arrays, for Seq-experiment outputs. Read More

Non-parametric estimation of a convex discrete distribution may be of interest in several applications, such as the estimation of species abundance distribution in ecology. In this paper we study the least squares estimator of a discrete distribution under the constraint of convexity. We show that this estimator exists and is unique, and that it always outperforms the classical empirical estimator in terms of the $\ell_{2}$-distance. Read More

The Stochastic Block Model (Holland et al., 1983) is a mixture model for heterogeneous network data. Unlike the usual statistical framework, new nodes give additional information about the previous ones in this model. Read More

We consider a binary unsupervised classification problem where each observation is associated with an unobserved label that we want to retrieve. More precisely, we assume that there are two groups of observation: normal and abnormal. The `normal' observations are coming from a known distribution whereas the distribution of the `abnormal' observations is unknown. Read More

Tiling arrays make possible a large scale exploration of the genome thanks to probes which cover the whole genome with very high density until 2 000 000 probes. Biological questions usually addressed are either the expression difference between two conditions or the detection of transcribed regions. In this work we propose to consider simultaneously both questions as an unsupervised classification problem by modeling the joint distribution of the two conditions. Read More

As more and more network-structured data sets are available, the statistical analysis of valued graphs has become common place. Looking for a latent structure is one of the many strategies used to better understand the behavior of a network. Several methods already exist for the binary case. Read More

In segmentation problems, inference on change-point position and model selection are two difficult issues due to the discrete nature of change-points. In a Bayesian context, we derive exact, non-asymptotic, explicit and tractable formulae for the posterior distribution of variables such as the number of change-points or their positions. We also derive a new selection criterion that accounts for the reliability of the results. Read More

In the multiple testing context, a challenging problem is the estimation of the proportion $\pi_0$ of true-null hypotheses. A large number of estimators of this quantity rely on identifiability assumptions that either appear to be violated on real data, or may be at least relaxed. Under independence, we propose an estimator $\hat{\pi}_0$ based on density estimation using both histograms and cross-validation. Read More