H. Zha - Peking University

H. Zha
Are you H. Zha?

Claim your profile, edit publications, add additional information:

Contact Details

H. Zha
Peking University

Pubs By Year

External Links

Pub Categories

Computer Science - Learning (25)
Statistics - Machine Learning (11)
Computer Science - Computer Vision and Pattern Recognition (5)
Physics - Physics and Society (4)
Mathematics - Numerical Analysis (3)
Computer Science - Information Retrieval (2)
Mathematics - Optimization and Control (2)
Computer Science - Artificial Intelligence (2)
Physics - Accelerator Physics (2)
Mathematics - Differential Geometry (1)
Mathematics - Statistics (1)
Statistics - Theory (1)
Computer Science - Computational Engineering; Finance; and Science (1)
Computer Science - Numerical Analysis (1)
Mathematics - Probability (1)
Computer Science - Data Structures and Algorithms (1)
High Energy Physics - Experiment (1)

Publications Authored By H. Zha

Event sequence, asynchronously generated with random timestamp, is ubiquitous among applications. The precise and arbitrary timestamp can carry important clues about the underlying dynamics, and has lent the event data fundamentally different from the time-series whereby series is indexed with fixed and equal time interval. One expressive mathematical tool for modeling event is point process. Read More

Point processes are becoming very popular in modeling asynchronous sequential data due to their sound mathematical foundation and strength in modeling a variety of real-world phenomena. Currently, they are often characterized via intensity function which limits model's expressiveness due to unrealistic assumptions on its parametric form used in practice. Furthermore, they are learned via maximum likelihood approach which is prone to failure in multi-modal distributions of sequences. Read More

Extreme multi-label learning or classification has been a practical and important problem since the boom of big data. The main challenge lies in the exponential label space which involves 2L possible label sets when the label dimension L is very large e.g. Read More

A variety of real-world processes (over networks) produce sequences of data whose complex temporal dynamics need to be studied. More especially, the event timestamps can carry important information about the underlying network dynamics, which otherwise are not available from the time-series evenly sampled from continuous signals. Moreover, in most complex processes, event sequences and evenly-sampled times series data can interact with each other, which renders joint modeling of those two sources of data necessary. Read More

We propose the first multistage intervention framework that tackles fake news in social networks by combining reinforcement learning with a point process network activity model. The spread of fake news and mitigation events within the network is modeled by a multivariate Hawkes process with additional exogenous control terms. By choosing a feature representation of states, defining mitigation actions and constructing reward functions to measure the effectiveness of mitigation activities, we map the problem of fake news mitigation into the reinforcement learning framework. Read More

Two standing-wave single-cell choke-mode damped structures with different choke dimensions which worked at 11.424 GHz were designed, manufactured and tuned by accelerator group in Tsinghua University. High power test was carried out to study choke-mode structure's properties in high gradient and related breakdown phenomenon. Read More

Poisson factorization is a probabilistic model of users and items for recommendation systems, where the so-called implicit consumer data is modeled by a factorized Poisson distribution. There are many variants of Poisson factorization methods who show state-of-the-art performance on real-world recommendation tasks. However, most of them do not explicitly take into account the temporal behavior and the recurrent activities of users which is essential to recommend the right item to the right user at the right time. Read More

Many real-world applications require robust algorithms to learn point process models based on a type of incomplete data --- the so-called short doubly-censored (SDC) event sequences. In this paper, we study this critical problem of quantitative asynchronous event sequence analysis under the framework of Hawkes processes by leveraging the general idea of data synthesis. In particular, given SDC event sequences observed in a variety of time intervals, we propose a sampling-stitching data synthesis method --- sampling predecessor and successor for each SDC event sequence from potential candidates and stitching them together to synthesize long training sequences. Read More

Groups of Small and Medium Enterprises (SME) back each other and form guarantee network to obtain loan from banks. The risk over the networked enterprises may cause significant contagious damage. To dissolve such risks, we propose a hybrid feature representation, which is feeded into a gradient boosting model for credit risk assessment of guarantee network. Read More

We propose an effective method to solve the event sequence clustering problems based on a novel Dirichlet mixture model of a special but significant type of point processes --- Hawkes process. In this model, each event sequence belonging to a cluster is generated via the same Hawkes process with specific parameters, and different clusters correspond to different Hawkes processes. The prior distribution of the Hawkes processes is controlled via a Dirichlet distribution. Read More

A typical viral marketing model identifies influential users in a social network to maximize a single product adoption assuming unlimited user attention, campaign budgets, and time. In reality, multiple products need campaigns, users have limited attention, convincing users incurs costs, and advertisers have limited budgets and expect the adoptions to be maximized soon. Facing these user, monetary, and timing constraints, we formulate the problem as a submodular maximization task in a continuous-time diffusion model under the intersection of a matroid and multiple knapsack constraints. Read More

Authors: The CLIC, CLICdp collaborations, :, M. J. Boland, U. Felzmann, P. J. Giansiracusa, T. G. Lucas, R. P. Rassool, C. Balazs, T. K. Charles, K. Afanaciev, I. Emeliantchik, A. Ignatenko, V. Makarenko, N. Shumeiko, A. Patapenka, I. Zhuk, A. C. Abusleme Hoffman, M. A. Diaz Gutierrez, M. Vogel Gonzalez, Y. Chi, X. He, G. Pei, S. Pei, G. Shu, X. Wang, J. Zhang, F. Zhao, Z. Zhou, H. Chen, Y. Gao, W. Huang, Y. P. Kuang, B. Li, Y. Li, J. Shao, J. Shi, C. Tang, X. Wu, L. Ma, Y. Han, W. Fang, Q. Gu, D. Huang, X. Huang, J. Tan, Z. Wang, Z. Zhao, T. Laštovička, U. Uggerhoj, T. N. Wistisen, A. Aabloo, K. Eimre, K. Kuppart, S. Vigonski, V. Zadin, M. Aicheler, E. Baibuz, E. Brücken, F. Djurabekova, P. Eerola, F. Garcia, E. Haeggström, K. Huitu, V. Jansson, V. Karimaki, I. Kassamakov, A. Kyritsakis, S. Lehti, A. Meriläinen, R. Montonen, T. Niinikoski, K. Nordlund, K. Österberg, M. Parekh, N. A. Törnqvist, J. Väinölä, M. Veske, W. Farabolini, A. Mollard, O. Napoly, F. Peauger, J. Plouin, P. Bambade, I. Chaikovska, R. Chehab, M. Davier, W. Kaabi, E. Kou, F. LeDiberder, R. Pöschl, D. Zerwas, B. Aimard, G. Balik, J. -P. Baud, J. -J. Blaising, L. Brunetti, M. Chefdeville, C. Drancourt, N. Geoffroy, J. Jacquemier, A. Jeremie, Y. Karyotakis, J. M. Nappa, S. Vilalte, G. Vouters, A. Bernard, I. Peric, M. Gabriel, F. Simon, M. Szalay, N. van der Kolk, T. Alexopoulos, E. N. Gazis, N. Gazis, E. Ikarios, V. Kostopoulos, S. Kourkoulis, P. D. Gupta, P. Shrivastava, H. Arfaei, M. K. Dayyani, H. Ghasem, S. S. Hajari, H. Shaker, Y. Ashkenazy, H. Abramowicz, Y. Benhammou, O. Borysov, S. Kananov, A. Levy, I. Levy, O. Rosenblat, G. D'Auria, S. Di Mitri, T. Abe, A. Aryshev, T. Higo, Y. Makida, S. Matsumoto, T. Shidara, T. Takatomi, Y. Takubo, T. Tauchi, N. Toge, K. Ueno, J. Urakawa, A. Yamamoto, M. Yamanaka, R. Raboanary, R. Hart, H. van der Graaf, G. Eigen, J. Zalieckas, E. Adli, R. Lillestøl, L. Malina, J. Pfingstner, K. N. Sjobak, W. Ahmed, M. I. Asghar, H. Hoorani, S. Bugiel, R. Dasgupta, M. Firlej, T. A. Fiutowski, M. Idzik, M. Kopec, M. Kuczynska, J. Moron, K. P. Swientek, W. Daniluk, B. Krupa, M. Kucharczyk, T. Lesiak, A. Moszczynski, B. Pawlik, P. Sopicki, T. Wojtoń, L. Zawiejski, J. Kalinowski, M. Krawczyk, A. F. Żarnecki, E. Firu, V. Ghenescu, A. T. Neagu, T. Preda, I-S. Zgura, A. Aloev, N. Azaryan, J. Budagov, M. Chizhov, M. Filippova, V. Glagolev, A. Gongadze, S. Grigoryan, D. Gudkov, V. Karjavine, M. Lyablin, A. Olyunin, A. Samochkine, A. Sapronov, G. Shirkov, V. Soldatov, A. Solodko, E. Solodko, G. Trubnikov, I. Tyapkin, V. Uzhinsky, A. Vorozhtov, E. Levichev, N. Mezentsev, P. Piminov, D. Shatilov, P. Vobly, K. Zolotarev, I. Bozovic Jelisavcic, G. Kacarevic, S. Lukic, G. Milutinovic-Dumbelovic, M. Pandurovic, U. Iriso, F. Perez, M. Pont, J. Trenado, M. Aguilar-Benitez, J. Calero, L. Garcia-Tabares, D. Gavela, J. L. Gutierrez, D. Lopez, F. Toral, D. Moya, A. Ruiz Jimeno, I. Vila, T. Argyropoulos, C. Blanch Gutierrez, M. Boronat, D. Esperante, A. Faus-Golfe, J. Fuster, N. Fuster Martinez, N. Galindo Muñoz, I. García, J. Giner Navarro, E. Ros, M. Vos, R. Brenner, T. Ekelöf, M. Jacewicz, J. Ögren, M. Olvegård, R. Ruber, V. Ziemann, D. Aguglia, N. Alipour Tehrani, A. Andersson, F. Andrianala, F. Antoniou, K. Artoos, S. Atieh, R. Ballabriga Sune, M. J. Barnes, J. Barranco Garcia, H. Bartosik, C. Belver-Aguilar, A. Benot Morell, D. R. Bett, S. Bettoni, G. Blanchot, O. Blanco Garcia, X. A. Bonnin, O. Brunner, H. Burkhardt, S. Calatroni, M. Campbell, N. Catalan Lasheras, M. Cerqueira Bastos, A. Cherif, E. Chevallay, B. Constance, R. Corsini, B. Cure, S. Curt, B. Dalena, D. Dannheim, G. De Michele, L. De Oliveira, N. Deelen, J. P. Delahaye, T. Dobers, S. Doebert, M. Draper, F. Duarte Ramos, A. Dubrovskiy, K. Elsener, J. Esberg, M. Esposito, V. Fedosseev, P. Ferracin, A. Fiergolski, K. Foraz, A. Fowler, F. Friebel, J-F. Fuchs, C. A. Fuentes Rojas, A. Gaddi, L. Garcia Fajardo, H. Garcia Morales, C. Garion, L. Gatignon, J-C. Gayde, H. Gerwig, A. N. Goldblatt, C. Grefe, A. Grudiev, F. G. Guillot-Vignot, M. L. Gutt-Mostowy, M. Hauschild, C. Hessler, J. K. Holma, E. Holzer, M. Hourican, D. Hynds, Y. Inntjore Levinsen, B. Jeanneret, E. Jensen, M. Jonker, M. Kastriotou, J. M. K. Kemppinen, R. B. Kieffer, W. Klempt, O. Kononenko, A. Korsback, E. Koukovini Platia, J. W. Kovermann, C-I. Kozsar, I. Kremastiotis, S. Kulis, A. Latina, F. Leaux, P. Lebrun, T. Lefevre, L. Linssen, X. Llopart Cudie, A. A. Maier, H. Mainaud Durand, E. Manosperti, C. Marelli, E. Marin Lacoma, R. Martin, S. Mazzoni, G. Mcmonagle, O. Mete, L. M. Mether, M. Modena, R. M. Münker, T. Muranaka, E. Nebot Del Busto, N. Nikiforou, D. Nisbet, J-M. Nonglaton, F. X. Nuiry, A. Nürnberg, M. Olvegard, J. Osborne, S. Papadopoulou, Y. Papaphilippou, A. Passarelli, M. Patecki, L. Pazdera, D. Pellegrini, K. Pepitone, E. Perez Codina, A. Perez Fontenla, T. H. B. Persson, M. Petrič, F. Pitters, S. Pittet, F. Plassard, R. Rajamak, S. Redford, Y. Renier, S. F. Rey, G. Riddone, L. Rinolfi, E. Rodriguez Castro, P. Roloff, C. Rossi, V. Rude, G. Rumolo, A. Sailer, E. Santin, D. Schlatter, H. Schmickler, D. Schulte, N. Shipman, E. Sicking, R. Simoniello, P. K. Skowronski, P. Sobrino Mompean, L. Soby, M. P. Sosin, S. Sroka, S. Stapnes, G. Sterbini, R. Ström, I. Syratchev, F. Tecker, P. A. Thonet, L. Timeo, H. Timko, R. Tomas Garcia, P. Valerio, A. L. Vamvakas, A. Vivoli, M. A. Weber, R. Wegner, M. Wendt, B. Woolley, W. Wuensch, J. Uythoven, H. Zha, P. Zisopoulos, M. Benoit, M. Vicente Barreto Pinto, M. Bopp, H. H. Braun, M. Csatari Divall, M. Dehler, T. Garvey, J. Y. Raguin, L. Rivkin, R. Zennaro, A. Aksoy, Z. Nergiz, E. Pilicer, I. Tapan, O. Yavas, V. Baturin, R. Kholodov, S. Lebedynskyi, V. Miroshnichenko, S. Mordyk, I. Profatilova, V. Storizhko, N. Watson, A. Winter, J. Goldstein, S. Green, J. S. Marshall, M. A. Thomson, B. Xu, W. A. Gillespie, R. Pan, M. A Tyrk, D. Protopopescu, A. Robson, R. Apsimon, I. Bailey, G. Burt, D. Constable, A. Dexter, S. Karimian, C. Lingwood, M. D. Buckland, G. Casse, J. Vossebeld, A. Bosco, P. Karataev, K. Kruchinin, K. Lekomtsev, L. Nevay, J. Snuverink, E. Yamakawa, V. Boisvert, S. Boogert, G. Boorman, S. Gibson, A. Lyapin, W. Shields, P. Teixeira-Dias, S. West, R. Jones, N. Joshi, R. Bodenstein, P. N. Burrows, G. B. Christian, D. Gamba, C. Perry, J. Roberts, J. A. Clarke, N. A. Collomb, S. P. Jamison, B. J. A. Shepherd, D. Walsh, M. Demarteau, J. Repond, H. Weerts, L. Xia, J. D. Wells, C. Adolphsen, T. Barklow, M. Breidenbach, N. Graf, J. Hewett, T. Markiewicz, D. McCormick, K. Moffeit, Y. Nosochkov, M. Oriunno, N. Phinney, T. Rizzo, S. Tantawi, F. Wang, J. Wang, G. White, M. Woodley

The Compact Linear Collider (CLIC) is a multi-TeV high-luminosity linear e+e- collider under development. For an optimal exploitation of its physics potential, CLIC is foreseen to be built and operated in a staged approach with three centre-of-mass energy stages ranging from a few hundred GeV up to 3 TeV. The first stage will focus on precision Standard Model physics, in particular Higgs and top-quark measurements. Read More

We consider the problem of how to optimize multi-stage campaigning over social networks. The dynamic programming framework is employed to balance the high present reward and large penalty on low future outcome in the presence of extensive uncertainties. In particular, we establish theoretical foundations of optimal campaigning over social networks where the user activities are modeled as a multivariate Hawkes process, and we derive a time dependent linear relation between the intensity of exogenous events and several commonly used objective functions of campaigning. Read More

In this paper, we consider the problem of finding the feature correspondences among a collection of feature sets, by using their point-wise unary features. This is a fundamental problem in computer vision and pattern recognition, which also closely relates to other areas such as operational research. Different from two-set matching which can be transformed to a quadratic assignment programming task that is known NP-hard, inclusion of merely unary attributes leads to a linear assignment problem for matching two feature sets. Read More

In this paper, we propose a novel multi-task learning (MTL) framework, called Self-Paced Multi-Task Learning (SPMTL). Different from previous works treating all tasks and instances equally when training, SPMTL attempts to jointly learn the tasks by taking into consideration the complexities of both tasks and instances. This is inspired by the cognitive process of human brain that often learns from the easy to the hard. Read More

In this paper, we propose a novel multi-label learning framework, called Multi-Label Self-Paced Learning (MLSPL), in an attempt to incorporate the self-paced learning strategy into multi-label learning regime. In light of the benefits of adopting the easy-to-hard strategy proposed by self-paced learning, the devised MLSPL aims to learn multiple labels jointly by gradually including label learning tasks and instances into model training from the easy to the hard. We first introduce a self-paced function as a regularizer in the multi-label learning formulation, so as to simultaneously rank priorities of the label learning tasks and the instances in each learning iteration. Read More

Fractal analysis has been widely used in computer vision, especially in texture image processing and texture analysis. The key concept of fractal-based image model is the fractal dimension, which is invariant to bi-Lipschitz transformation of image, and thus capable of representing intrinsic structural information of image robustly. However, the invariance of fractal dimension generally does not hold after filtering, which limits the application of fractal-based image model. Read More

Learning Granger causality for general point processes is a very challenging task. In this paper, we propose an effective method, learning Granger causality, for a special but significant type of point processes --- Hawkes process. We reveal the relationship between Hawkes process's impact function and its Granger causality graph. Read More

Over the past decade the rate of care unit (CU) use in the United States has been increasing. With an aging population and ever-growing demand for medical care, effective management of patients' transitions among different care facilities will prove indispensible for shortening the length of hospital stays, improving patient outcomes, allocating critical care resources, and reducing preventable re-admissions. In this paper, we focus on an important problem of predicting the so-called "patient flow" from longitudinal electronic health records (EHRs), which has not been explored via existing machine learning techniques. Read More

We consider the problem of predicting the time evolution of influence, the expected number of activated nodes, given a set of initially active nodes on a propagation network. To address the significant computational challenges of this problem on large-scale heterogeneous networks, we establish a system of differential equations governing the dynamics of probability mass functions on the state graph where the nodes each lumps a number of activation states of the network, which can be considered as an analogue to the Fokker-Planck equation in continuous space. We provides several methods to estimate the system parameters which depend on the identities of the initially active nodes, network topology, and activation rates etc. Read More

We propose a new majorization-minimization (MM) method for non-smooth and non-convex programs, which is general enough to include the existing MM methods. Besides the local majorization condition, we only require that the difference between the directional derivatives of the objective function and its surrogate function vanishes when the number of iterations approaches infinity, which is a very weak condition. So our method can use a surrogate function that directly approximates the non-smooth objective function. Read More

The overwhelming amount and rate of information update in online social media is making it increasingly difficult for users to allocate their attention to their topics of interest, thus there is a strong need for prioritizing news feeds. The attractiveness of a post to a user depends on many complex contextual and temporal features of the post. For instance, the contents of the post, the responsiveness of a third user, and the age of the post may all have impact. Read More

In real world social networks, there are multiple cascades which are rarely independent. They usually compete or cooperate with each other. Motivated by the reinforcement theory in sociology we leverage the fact that adoption of a user to any behavior is modeled by the aggregation of behaviors of its neighbors. Read More

Information diffusion in online social networks is affected by the underlying network topology, but it also has the power to change it. Online users are constantly creating new links when exposed to new information sources, and in turn these links are alternating the way information spreads. However, these two highly intertwined stochastic processes, information diffusion and network evolution, have been predominantly studied separately, ignoring their co-evolutionary dynamics. Read More

This paper focuses on the problem of simultaneous sample and feature selection for machine learning in a fully unsupervised setting. Though most existing works tackle these two problems separately that derives two well-studied sub-areas namely active learning and feature selection, a unified approach is inspirational since they are often interleaved with each other. Noisy and high-dimensional features will bring adverse effect on sample selection, while `good' samples will be beneficial to feature selection. Read More

This paper addresses the problem of matching $N$ weighted graphs referring to an identical object or category. More specifically, matching the common node correspondences among graphs. This multi-graph matching problem involves two ingredients affecting the overall accuracy: i) the local pairwise matching affinity score among graphs; ii) the global matching consistency that measures the uniqueness of the pairwise matching results by different chaining orders. Read More

When a piece of malicious information becomes rampant in an information diffusion network, can we identify the source node that originally introduced the piece into the network and infer the time when it initiated this? Being able to do so is critical for curtailing the spread of malicious information, and reducing the potential losses incurred. This is a very challenging problem since typically only incomplete traces are observed and we need to unroll the incomplete traces into the past in order to pinpoint the source. In this paper, we tackle this problem by developing a two-stage framework, which first learns a continuous-time diffusion network model based on historical diffusion traces and then identifies the source of an incomplete diffusion trace by maximizing the likelihood of the trace under the learned model. Read More

We propose a novel approach to sufficient dimension reduction in regression, based on estimating contour directions of negligible variation for the response surface. These directions span the orthogonal complement of the minimal space relevant for the regression, and can be extracted according to a measure of the variation in the response, leading to General Contour Regression(GCR). In comparison to exiisting sufficient dimension reduction techniques, this sontour-based mothology guarantees exhaustive estimation of the central space under ellipticity of the predictoor distribution and very mild additional assumptions, while maintaining vn-consisytency and somputational ease. Read More

Events in an online social network can be categorized roughly into endogenous events, where users just respond to the actions of their neighbors within the network, or exogenous events, where users take actions due to drives external to the network. How much external drive should be provided to each user, such that the network activity can be steered towards a target state? In this paper, we model social events using multivariate Hawkes processes, which can capture both endogenous and exogenous event intensities, and derive a time dependent linear relation between the intensity of exogenous events and the overall network activity. Exploiting this connection, we develop a convex optimization framework for determining the required level of external drive in order for the network to reach a desired activity level. Read More

If a piece of information is released from a media site, can it spread, in 1 month, to a million web pages? This influence estimation problem is very challenging since both the time-sensitive nature of the problem and the issue of scalability need to be addressed simultaneously. In this paper, we propose a randomized algorithm for influence estimation in continuous-time diffusion networks. Our algorithm can estimate the influence of every node in a network with |V| nodes and |E| edges to an accuracy of $\varepsilon$ using $n=O(1/\varepsilon^2)$ randomizations and up to logarithmic factors O(n|E|+n|V|) computations. Read More

Evaluation metrics are an essential part of a ranking system, and in the past many evaluation metrics have been proposed in information retrieval and Web search. Discounted Cumulated Gains (DCG) has emerged as one of the evaluation metrics widely adopted for evaluating the performance of ranking functions used in Web search. However, the two sets of parameters, gain values and discount factors, used in DCG are determined in a rather ad-hoc way. Read More

Electronic health records contain rich textual data which possess critical predictive information for machine-learning based diagnostic aids. However many traditional machine learning methods fail to simultaneously integrate both vector space data and text. We present a supervised method using Laplacian eigenmaps to augment existing machine-learning methods with low-dimensional representations of textual predictors which preserve the local similarities. Read More


In recent years, total variation (TV) and Euler's elastica (EE) have been successfully applied to image processing tasks such as denoising and inpainting. This paper investigates how to extend TV and EE to the supervised learning settings on high dimensional data. The supervised learning problem can be formulated as an energy functional minimization under Tikhonov regularization scheme, where the energy is composed of a squared loss and a total variation smoothing (or Euler's elastica smoothing). Read More

Automatic image annotation (AIA) raises tremendous challenges to machine learning as it requires modeling of data that are both ambiguous in input and output, e.g., images containing multiple objects and labeled with multiple semantic tags. Read More

Differential quantities, including normals, curvatures, principal directions, and associated matrices, play a fundamental role in geometric processing and physics-based modeling. Computing these differential quantities consistently on surface meshes is important and challenging, and some existing methods often produce inconsistent results and require ad hoc fixes. In this paper, we show that the computation of the gradient and Hessian of a height function provides the foundation for consistently computing the differential quantities. Read More

We propose a novel approach to sufficient dimension reduction in regression, based on estimating contour directions of small variation in the response. These directions span the orthogonal complement of the minimal space relevant for the regression and can be extracted according to two measures of variation in the response, leading to simple and general contour regression (SCR and GCR) methodology. In comparison with existing sufficient dimension reduction techniques, this contour-based methodology guarantees exhaustive estimation of the central subspace under ellipticity of the predictor distribution and mild additional assumptions, while maintaining \sqrtn-consistency and computational ease. Read More

Nonlinear manifold learning from unorganized data points is a very challenging unsupervised learning and data visualization problem with a great variety of applications. In this paper we present a new algorithm for manifold learning and nonlinear dimension reduction. Based on a set of unorganized data points sampled with noise from the manifold, we represent the local geometry of the manifold using tangent spaces learned by fitting an affine subspace in a neighborhood of each data point. Read More

Many data types arising from data mining applications can be modeled as bipartite graphs, examples include terms and documents in a text corpus, customers and purchasing items in market basket analysis and reviewers and movies in a movie recommender system. In this paper, we propose a new data clustering method based on partitioning the underlying bipartite graph. The partition is constructed by minimizing a normalized sum of edge weights between unmatched pairs of vertices of the bipartite graph. Read More