# Ji Zhu

## Contact Details

NameJi Zhu |
||

Affiliation |
||

Location |
||

## Pubs By Year |
||

## Pub CategoriesStatistics - Methodology (11) Statistics - Theory (10) Mathematics - Statistics (10) Statistics - Machine Learning (9) Statistics - Applications (7) Physics - Data Analysis; Statistics and Probability (4) Computer Science - Learning (3) Physics - Physics and Society (3) Computer Science - Performance (2) Computer Science - Information Theory (2) Mathematics - Information Theory (2) High Energy Physics - Experiment (1) Computer Science - Data Structures and Algorithms (1) Computer Science - Networking and Internet Architecture (1) Mathematics - Probability (1) Computer Science - Cryptography and Security (1) |

## Publications Authored By Ji Zhu

The stochastic block model is widely used for detecting community structures in network data. How to test the goodness-of-fit of the model is one of the fundamental problems and has gained growing interests in recent years. In this paper, we propose a new goodness-of-fit test based on the maximum entry of the centered and re-scaled observed adjacency matrix for the stochastic block model in which the number of communities can be allowed to grow linearly with the number of nodes ignoring a logarithm factor. Read More

Many models and methods are now available for network analysis, but model selection and tuning remain challenging. Cross-validation is a useful general tool for these tasks in many settings, but is not directly applicable to networks since splitting network nodes into groups requires deleting edges and destroys some of the network structure. Here we propose a new network cross-validation strategy based on splitting edges rather than nodes, which avoids losing information and is applicable to a wide range of network problems. Read More

Although much progress has been made in classification with high-dimensional features \citep{Fan_Fan:2008, JGuo:2010, CaiSun:2014, PRXu:2014}, classification with ultrahigh-dimensional features, wherein the features much outnumber the sample size, defies most existing work. This paper introduces a novel and computationally feasible multivariate screening and classification method for ultrahigh-dimensional data. Leveraging inter-feature correlations, the proposed method enables detection of marginally weak and sparse signals and recovery of the true informative feature set, and achieves asymptotic optimal misclassification rates. Read More

The Expectation-Maximization (EM) algorithm is an iterative method that is often used for parameter estimation in incomplete data problems. Despite much theoretical endeavors devoted to understand the convergence behavior of the EM algorithm, some ubiquitous phenomena still remain unexplained. As observed in both numerical experiments and real applications, the convergence rate of the optimization error of an EM sequence is data-dependent: it is more spread-out when the sample size is smaller while more concentrated when the sample size is larger; and for a fixed sample size, one observes random fluctuations of the convergence rate of EM sequences constructed from different sets of i. Read More

Prediction problems typically assume the training data are independent samples, but in many modern applications samples come from individuals connected by a network. For example, in adolescent health studies of risk-taking behaviors, information on the subjects' social networks is often available and plays an important role through network cohesion, the empirically observed phenomenon of friends behaving similarly. Taking cohesion into account in prediction models should allow us to improve their performance. Read More

The problem of estimating probabilities of network edges from the observed adjacency matrix has important applications to predicting missing links and network denoising. It has usually been addressed by estimating the graphon, a function that determines the matrix of edge probabilities, but is ill-defined without strong assumptions on the network structure. Here we propose a novel computationally efficient method based on neighborhood smoothing to estimate the expectation of the adjacency matrix directly, without making the strong structural assumptions graphon estimation requires. Read More

We consider the problem of jointly estimating a collection of graphical models for discrete data, corresponding to several categories that share some common structure. An example for such a setting is voting records of legislators on different issues, such as defense, energy, and healthcare. We develop a Markov graphical model to characterize the heterogeneous dependence structures arising from such data. Read More

Many methods have been proposed for community detection in networks, but most of them do not take into account additional information on the nodes that is often available in practice. In this paper, we propose a new joint community detection criterion that uses both the network edge information and the node features to detect community structures. One advantage our method has over existing joint detection approaches is the flexibility of learning the impact of different features which may differ across communities. Read More

Community detection is a fundamental problem in network analysis which is made more challenging by overlaps between communities which often occur in practice. Here we propose a general, flexible, and interpretable generative model for overlapping communities, which can be thought of as a generalization of the degree-corrected stochastic block model. We develop an efficient spectral algorithm for estimating the community memberships, which deals with the overlaps by employing the K-medians algorithm rather than the usual K-means for clustering in the spectral domain. Read More

Although asymptotic analyses of undirected network models based on degree sequences have started to appear in recent literature, it remains an open problem to study statistical properties of directed network models. In this paper, we provide for the first time a rigorous analysis of directed exponential random graph models using the in-degrees and out-degrees as sufficient statistics with binary as well as continuous weighted edges. We establish the uniform consistency and the asymptotic normality for the maximum likelihood estimate, when the number of parameters grows and only one realized observation of the graph is available. Read More

The primary motivation and application in this article come from brain imaging studies on cognitive impairment in elderly subjects with brain disorders. We propose a regularized Haar wavelet-based approach for the analysis of three-dimensional brain image data in the framework of functional data analysis, which automatically takes into account the spatial information among neighboring voxels. We conduct extensive simulation studies to evaluate the prediction performance of the proposed approach and its ability to identify related regions to the outcome of interest, with the underlying assumption that only few relatively small subregions are truly predictive of the outcome of interest. Read More

This paper presents an asynchronous distributed algorithm to manage multiple trees for peer-to-peer streaming in a flow level model. It is assumed that videos are cut into substreams, with or without source coding, to be distributed to all nodes. The algorithm guarantees that each node receives sufficiently many substreams within delay logarithmic in the number of peers. Read More

While graphical models for continuous data (Gaussian graphical models) and discrete data (Ising models) have been extensively studied, there is little work on graphical models linking both continuous and discrete variables (mixed data), which are common in many scientific applications. We propose a novel graphical model for mixed data, which is simple enough to be suitable for high-dimensional data, yet flexible enough to represent all possible graph structures. We develop a computationally efficient regression-based algorithm for fitting the model by focusing on the conditional log-likelihood of each variable given the rest. Read More

Link prediction is one of the fundamental problems in network analysis. In many applications, notably in genetics, a partially observed network may not contain any negative examples of absent edges, which creates a difficulty for many existing supervised learning approaches. We develop a new method which treats the observed network as a sample of the true network with different sampling rates for positive and negative examples. Read More

In this paper we study the effective degrees of freedom of a general class of reduced rank estimators for multivariate regression in the framework of Stein's unbiased risk estimation (SURE). We derive a finite-sample exact unbiased estimator that admits a closed-form expression in terms of the singular values or thresholded singular values of the least squares solution and hence readily computable. The results continue to hold in the high-dimensional scenario when both the predictor and response dimensions are allowed to be larger than the sample size. Read More

There has been a lot of work fitting Ising models to multivariate binary data in order to understand the conditional dependency relationships between the variables. However, additional covariates are frequently recorded together with the binary data, and may influence the dependence relationships. Motivated by such a dataset on genomic instability collected from tumor samples of several types, we propose a sparse covariate dependent Ising model to study both the conditional dependency within the binary data and its relationship with the additional covariates. Read More

In this paper, we show the Wilks type of results for the Bradley-Terry model. Specifically, for some simple and composite null hypotheses of interest, we show that the likelihood ratio test statistic $\Lambda$ enjoys a chi-square approximation in the sense that $(2p)^{-1/2}(-2\log \Lambda -p)\stackrel{L}{\rightarrow}N(0,1)$ as $p$ goes to infinity, where $p$ is the corresponding degrees of freedom. Simulation studies and an application to NBA data illustrate the theoretical results. Read More

Community detection is a fundamental problem in network analysis, with applications in many diverse areas. The stochastic block model is a common tool for model-based community detection, and asymptotic tools for checking consistency of community detection under the block model have been recently developed. However, the block model is limited by its assumption that all nodes within a community are stochastically equivalent, and provides a poor fit to networks with hubs or highly varying node degrees within communities, which are common in practice. Read More

This paper focuses on the stationary portion of file download in an unstructured peer-to-peer network, which typically follows for many hours after a flash crowd initiation. The model includes the case that peers can have some pieces at the time of arrival. The contribution of the paper is to identify how much help is needed from the seeds, either fixed seeds or peer seeds (which are peers remaining in the system after obtaining a complete collection) to stabilize the system. Read More

**Category:**Statistics - Applications

We propose a computationally intensive method, the random lasso method, for variable selection in linear models. The method consists of two major steps. In step 1, the lasso method is applied to many bootstrap samples, each using a set of randomly selected covariates. Read More

**Category:**Statistics - Applications

In many organisms the expression levels of each gene are controlled by the activation levels of known "Transcription Factors" (TF). A problem of considerable interest is that of estimating the "Transcription Regulation Networks" (TRN) relating the TFs and genes. While the expression levels of genes can be observed, the activation levels of the corresponding TFs are usually unknown, greatly increasing the difficulty of the problem. Read More

Information flow analysis is a powerful technique for reasoning about the sensitive information exposed by a program during its execution. While past work has proposed information theoretic metrics (e.g. Read More

In many engineering and scientific applications, prediction variables are grouped, for example, in biological applications where assayed genes or proteins can be grouped by biological roles or biological pathways. Common statistical analysis methods such as ANOVA, factor analysis, and functional modeling with basis sets also exhibit natural variable groupings. Existing successful group variable selection methods such as Antoniadis and Fan (2001), Yuan and Lin (2006) and Zhao, Rocha and Yu (2009) have the limitation of selecting variables in an "all-in-all-out" fashion, i. Read More

Analysis of networks and in particular discovering communities within networks has been a focus of recent work in several fields, with applications ranging from citation and friendship networks to food webs and gene regulatory networks. Most of the existing community detection methods focus on partitioning the entire network into communities, with the expectation of many ties within communities and few ties between. However, many networks contain nodes that do not fit in with any of the communities, and forcing every node into a community can distort results. Read More

Typical protocols for peer-to-peer file sharing over the Internet divide files to be shared into pieces. New peers strive to obtain a complete collection of pieces from other peers and from a seed. In this paper we investigate a problem that can occur if the seeding rate is not large enough. Read More

Regression models to relate a scalar $Y$ to a functional predictor $X(t)$ are becoming increasingly common. Work in this area has concentrated on estimating a coefficient function, $\beta(t)$, with $Y$ related to $X(t)$ through $\int\beta(t)X(t) dt$. Regions where $\beta(t)\ne0$ correspond to places where there is a relationship between $X(t)$ and $Y$. Read More

In this paper we propose a new regression interpretation of the Cholesky factor of the covariance matrix, as opposed to the well known regression interpretation of the Cholesky factor of the inverse covariance, which leads to a new class of regularized covariance estimators suitable for high-dimensional problems. Regularizing the Cholesky factor of the covariance via this regression interpretation always results in a positive definite estimator. In particular, one can obtain a positive definite banded estimator of the covariance matrix at the same computational cost as the popular banded estimator proposed by Bickel and Levina (2008b), which is not guaranteed to be positive definite. Read More

Fisher-consistent loss functions play a fundamental role in the construction of successful binary margin-based classifiers. In this paper we establish the Fisher-consistency condition for multicategory classification problems. Our approach uses the margin vector concept which can be regarded as a multicategory generalization of the binary margin. Read More

In this paper, we propose a new method remMap -- REgularized Multivariate regression for identifying MAster Predictors -- for fitting multivariate response regression models under the high-dimension-low-sample-size setting. remMap is motivated by investigating the regulatory relationships among different biological molecules based on multiple types of high dimensional genomic data. Particularly, we are interested in studying the influence of DNA copy number alterations on RNA transcript levels. Read More

In this paper, we propose a computationally efficient approach -- space(Sparse PArtial Correlation Estimation)-- for selecting non-zero partial correlations under the high-dimension-low-sample-size setting. This method assumes the overall sparsity of the partial correlation matrix and employs sparse regression techniques for model fitting. We illustrate the performance of space by extensive simulation studies. Read More

The paper proposes a new covariance estimator for large covariance matrices when the variables have a natural ordering. Using the Cholesky decomposition of the inverse, we impose a banded structure on the Cholesky factor, and select the bandwidth adaptively for each row of the Cholesky factor, using a novel penalty we call nested Lasso. This structure has more flexibility than regular banding, but, unlike regular Lasso applied to the entries of the Cholesky factor, results in a sparse estimator for the inverse of the covariance matrix. Read More

The Support Vector Machine (SVM) is a popular classification paradigm in machine learning and has achieved great success in real applications. However, the standard SVM can not select variables automatically and therefore its solution typically utilizes all the input variables without discrimination. This makes it difficult to identify important predictor variables, which is often one of the primary goals in data analysis. Read More

The paper proposes a method for constructing a sparse estimator for the inverse covariance (concentration) matrix in high-dimensional settings. The estimator uses a penalized normal likelihood approach and forces sparsity by using a lasso-type penalty. We establish a rate of convergence in the Frobenius norm as both data dimension $p$ and sample size $n$ are allowed to grow, and show that the rate depends explicitly on how sparse the true concentration matrix is. Read More

We consider the generic regularized optimization problem $\hat{\mathsf{\beta}}(\lambda)=\arg \min_{\beta}L({\sf{y}},X{\sf{\beta}})+\lambda J({\sf{\beta}})$. Efron, Hastie, Johnstone and Tibshirani [Ann. Statist. Read More

Comment on [math.ST/0612817] Read More

In this paper, we compare the performance, stability and robustness of Artificial Neural Networks (ANN) and Boosted Decision Trees (BDT) using MiniBooNE Monte Carlo samples. These methods attempt to classify events given a number of identification variables. The BDT algorithm has been discussed by us in previous publications. Read More

Boosted decision trees are applied to particle identification in the MiniBooNE experiment operated at Fermi National Accelerator Laboratory (Fermilab) for neutrino oscillations. Numerous attempts are made to tune the boosted decision trees, to compare performance of various boosting algorithms, and to select input variables for optimal performance. Read More

The efficacy of particle identification is compared using artificial neutral networks and boosted decision trees. The comparison is performed in the context of the MiniBooNE, an experiment at Fermilab searching for neutrino oscillations. Based on studies of Monte Carlo samples of simulated data, particle identification with boosting algorithms has better performance than that with artificial neural networks for the MiniBooNE experiment. Read More

Discussion of ``Least angle regression'' by Efron et al. [math.ST/0406456] Read More