Computer Science - Information Retrieval Publications (50)


Computer Science - Information Retrieval Publications

Estimating vaccination uptake is an integral part of ensuring public health. It was recently shown that vaccination uptake can be estimated automatically from web data, instead of slowly collected clinical records or population surveys. All prior work in this area assumes that features of vaccination uptake collected from the web are temporally regular. Read More

Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. Read More

Medical errors are leading causes of death in the US and as such, prevention of these errors is paramount to promoting health care. Patient Safety Event reports are narratives describing potential adverse events to the patients and are important in identifying and preventing medical errors. We present a neural network architecture for identifying the type of safety events which is the first step in understanding these narratives. Read More

Nested Chinese Restaurant Process (nCRP) topic models are powerful nonparametric Bayesian methods to extract a topic hierarchy from a given text corpus, where the hierarchical structure is automatically determined by the data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of nCRP topic models. However, hLDA has only been evaluated at small scale, because the existing collapsed Gibbs sampling and instantiated weight variational inference algorithms either are not scalable or sacrifice inference quality with mean-field assumptions. Read More

This paper describes the realization of the Ontology Web Search Engine. The Ontology Web Search Engine is realizable as independent project and as a part of other projects. The main purpose of this paper is to present the Ontology Web Search Engine realization details as the part of the Semantic Web Expert System and to present the results of the Ontology Web Search Engine functioning. Read More

Mental health forums are online communities where people express their issues and seek help from moderators and other users. In such forums, there are often posts with severe content indicating that the user is in acute distress and there is a risk of attempted self-harm. Moderators need to respond to these severe posts in a timely manner to prevent potential self-harm. Read More

In the last few years, microblogging platforms such as Twitter have given rise to a deluge of textual data that can be used for the analysis of informal communication between millions of individuals. In this work, we propose an information-theoretic approach to geographic language variation using a corpus based on Twitter. We test our models with tens of concepts and their associated keywords detected in Spanish tweets geolocated in Spain. Read More

Real-time monitoring and responses to emerging public health threats rely on the availability of timely surveillance data. During the early stages of an epidemic, the ready availability of line lists with detailed tabular information about laboratory-confirmed cases can assist epidemiologists in making reliable inferences and forecasts. Such inferences are crucial to understand the epidemiology of a specific disease early enough to stop or control the outbreak. Read More

This year, the DEFT campaign (D\'efi Fouilles de Textes) incorporates a task which aims at identifying the session in which articles of previous TALN conferences were presented. We describe the three statistical systems developed at LIA/ADOC for this task. A fusion of these systems enables us to obtain interesting results (micro-precision score of 0. Read More

The 2013 D\'efi de Fouille de Textes (DEFT) campaign is interested in two types of language analysis tasks, the document classification and the information extraction in the specialized domain of cuisine recipes. We present the systems that the LIA has used in DEFT 2013. Our systems show interesting results, even though the complexity of the proposed tasks. Read More

In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, $n$-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character flooding, capital letters, references to other users, hyperlinks, hashtags, etc.). Read More

Comparing images in order to recommend items from an image-inventory is a subject of continued interest. Added with the scalability of deep-learning architectures the once `manual' job of hand-crafting features have been largely alleviated, and images can be compared according to features generated from a deep convolutional neural network. In this paper, we compare distance metrics (and divergences) to rank features generated from a neural network, for content-based image retrieval. Read More

Recommendation system is a common demand in daily life and matrix completion is a widely adopted technique for this task. However, most matrix completion methods lack semantic interpretation and usually result in weak-semantic recommendations. To this end, this paper proposes a {\bf S}emantic {\bf A}nalysis approach for {\bf R}ecommendation systems \textbf{(SAR)}, which applies a two-level hierarchical generative process that assigns semantic properties and categories for user and item. Read More

Since the events of the Arab Spring, there has been increased interest in using social media to anticipate social unrest. While efforts have been made toward automated unrest prediction, we focus on filtering the vast volume of tweets to identify tweets relevant to unrest, which can be provided to downstream users for further analysis. We train a supervised classifier that is able to label Arabic language tweets as relevant to unrest with high reliability. Read More

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24. Read More

Feature extraction is a critical component of many applied data science workflows. In recent years, rapid advances in artificial intelligence and machine learning have led to an explosion of feature extraction tools and services that allow data scientists to cheaply and effectively annotate their data along a vast array of dimensions---ranging from detecting faces in images to analyzing the sentiment expressed in coherent text. Unfortunately, the proliferation of powerful feature extraction services has been mirrored by a corresponding expansion in the number of distinct interfaces to feature extraction services. Read More

Collaborative filtering is a broad and powerful framework for building recommendation systems that has seen widespread adoption. Over the past decade, the propensity of such systems for favoring popular products and thus creating echo chambers have been observed. This has given rise to an active area of research that seeks to diversify recommendations generated by such algorithms. Read More

The multilabel learning problem with large number of labels, features, and data-points has generated a tremendous interest recently. A recurring theme of these problems is that only a few labels are active in any given datapoint as compared to the total number of labels. However, only a small number of existing work take direct advantage of this inherent extreme sparsity in the label space. Read More

In recent years, the information retrieval (IR) community has witnessed the first successful applications of deep neural network models to short-text matching and ad-hoc retrieval. It is exciting to see the research on deep neural networks and IR converge on these tasks of shared interest. However, the two communities have less in common when it comes to the choice of programming languages. Read More

In this paper we examine the existence of correlation between movie similarity and low level features from respective movie content. In particular, we demonstrate the extraction of multi-modal representation models of movies based on subtitles, audio and metadata mining. We emphasize our research in topic modeling of movies based on their subtitles. Read More

Usually bilingual word vectors are trained "online". Mikolov et al. showed they can also be found "offline", whereby two pre-trained embeddings are aligned with a linear transformation, using dictionaries compiled from expert knowledge. Read More

The continuously increasing cost of the US healthcare system has received significant attention. Central to the ideas aimed at curbing this trend is the use of technology, in the form of the mandate to implement electronic health records (EHRs). EHRs consist of patient information such as demographics, medications, laboratory test results, diagnosis codes and procedures. Read More

Statistical Relational Learning (SRL) methods have shown that classification accuracy can be improved by integrating relations between samples. Techniques such as iterative classification or relaxation labeling achieve this by propagating information between related samples during the inference process. When only a few samples are labeled and connections between samples are sparse, collective inference methods have shown large improvements over standard feature-based ML methods. Read More

In this work, we present an approach for mining user preferences and recommendation based on reviews. There have been various studies worked on recommendation problem. However, most of the studies beyond one aspect user generated- content such as user ratings, user feedback and so on to state user preferences. Read More

In real-world, our DNA is unique but many people share same names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensic. Read More

Volatility prediction--an essential concept in financial markets--has recently been addressed using sentiment analysis methods. We investigate the sentiment of annual disclosures of companies in stock markets to forecast volatility. We specifically explore the use of recent Information Retrieval (IR) term weighting models that are effectively extended by related terms using word embeddings. Read More

The effectiveness of three stop words lists for Arabic Information Retrieval---General Stoplist, Corpus-Based Stoplist, Combined Stoplist ---were investigated in this study. Three popular weighting schemes were examined: the inverse document frequency weight, probabilistic weighting, and statistical language modelling. The Idea is to combine the statistical approaches with linguistic approaches to reach an optimal performance, and compare their effect on retrieval. Read More

We consider the problem of multi-message private information retrieval (MPIR) from $N$ non-communicating replicated databases. In MPIR, the user is interested in retrieving $P$ messages out of $M$ stored messages without leaking the identity of the retrieved messages. The information-theoretic sum capacity of MPIR $C_s^P$ is the maximum number of desired message symbols that can be retrieved privately per downloaded symbol. Read More

Deep Neural Networks, and specifically fully-connected convolutional neural networks are achieving remarkable results across a wide variety of domains. They have been trained to achieve state-of-the-art performance when applied to problems such as speech recognition, image classification, natural language processing and bioinformatics. Most of these deep learning models when applied to classification employ the softmax activation function for prediction and aim to minimize cross-entropy loss. Read More

One of the most used approaches for providing recommendations in various online environments such as e-commerce is collaborative filtering. Although, this is a simple method for recommending items or services, accuracy and quality problems still exist. Thus, we propose a dynamic multi-level collaborative filtering method that improves the quality of the recommendations. Read More

Traditionally a document is visualized by a word cloud. Recently, distributed representation methods for documents have been developed, which map a document to a set of topic embeddings. Visualizing such a representation is useful to present the semantics of a document in higher granularity; it is also challenging, as there are multiple topics, each containing multiple words. Read More

Top-$N$ recommender systems typically utilize side information to address the problem of data sparsity. As nowadays side information is growing towards high dimensionality, the performances of existing methods deteriorate in terms of both effectiveness and efficiency, which imposes a severe technical challenge. In order to take advantage of high-dimensional side information, we propose in this paper an embedded feature selection method to facilitate top-$N$ recommendation. Read More

Web archives are large longitudinal collections that store webpages from the past, which might be missing on the current live Web. Consequently, temporal search over such collections is essential for finding prominent missing webpages and tasks like historical analysis. However, this has been challenging due to the lack of popularity information and proper ground truth to evaluate temporal retrieval models. Read More

We show how faceted search using a combination of traditional classification systems and mixed-membership models can move beyond keyword search to inform resource discovery, hypothesis formulation, and argument extraction for interdisciplinary research. Our test domain is the history and philosophy of scientific work on animal mind and cognition. We demonstrate an application of our methods to the problem of identifying and extracting arguments about anthropomorphism during a critical period in the development of comparative psychology. Read More

Limited search and access patterns over Web archives have been well documented. One of the key reasons is the lack of understanding of the user access patterns over such collections, which in turn is attributed to the lack of effective search interfaces. Current search interfaces for Web archives are (a) either purely navigational or (b) have sub-optimal search experience due to ineffective retrieval models or query modeling. Read More

The digital revolution has brought most of the world on the world wide web. The data available on WWW has increased many folds in the past decade. Social networks, online clubs and organisations have come into existence. Read More

An outlier-resistance phase retrieval algorithm based on alternating direction method of multipliers (ADMM) is devised in this letter. Instead of the widely used least squares criterion that is only optimal for Gaussian noise environment, we adopt the least absolute deviation criterion to enhance the robustness against outliers. Considering both intensity- and amplitude-based observation models, the framework of ADMM is developed to solve the resulting non-differentiable optimization problems. Read More

Ancient Chinese texts present an area of enormous challenge and opportunity for humanities scholars interested in exploiting computational methods to assist in the development of new insights and interpretations of culturally significant materials. In this paper we describe a collaborative effort between Indiana University and Xi'an Jiaotong University to support exploration and interpretation of a digital corpus of over 18,000 ancient Chinese documents, which we refer to as the "Handian" ancient classics corpus (H\`an di\u{a}n g\u{u} j\'i, i.e, the "Han canon" or "Chinese classics"). Read More

In the area of ad-targeting, predicting user responses is essential for many applications such as Real-Time Bidding (RTB). Many of the features available in this domain are sparse categorical features. This presents a challenge especially when the user responses to be predicted are rare, because each feature will only have very few positive examples. Read More

Long-term Web archives comprise Web documents gathered over longer time periods and can easily reach hundreds of terabytes in size. Semantic annotations such as named entities can facilitate intelligent access to the Web archive data. However, the annotation of the entire archive content on this scale is often infeasible. Read More

Most existing techniques for spam detection on Twitter aim to identify and block users who post spam tweets. In this paper, we propose a Semi-Supervised Spam Detection (S3D) framework for spam detection at tweet-level. The proposed framework consists of two main modules: spam detection module operating in real-time mode, and model update module operating in batch mode. Read More

ImageNet is a large scale and publicly available image database. It currently offers more than 14 millions of images, organised according to the WordNet hierarchy. One of the main objective of the creators is to provide to the research community a relevant database for visual recognition applications such as object recognition, image classification or object localisation. Read More

Searching patients based on the relevance of their medical records is challenging because of the inherent implicit knowledge within the patients' medical records and queries. Such knowledge is known to the medical practitioners but may be hidden from a search system. For example, when searching for the patients with a heart disease, medical practitioners commonly know that patients who are taking the amiodarone medicine are relevant, since this drug is used to combat heart disease. Read More

Incremental data mining algorithms process frequent updates to dynamic datasets efficiently by avoiding redundant computation. Existing incremental extension to shared nearest neighbor density based clustering (SNND) algorithm cannot handle deletions to dataset and handles insertions only one point at a time. We present an incremental algorithm to overcome both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. Read More

Given a set of attributed subgraphs known to be from different classes, how can we discover their differences? There are many cases where collections of subgraphs may be contrasted against each other. For example, they may be assigned ground truth labels (spam/not-spam), or it may be desired to directly compare the biological networks of different species or compound networks of different chemicals. In this work we introduce the problem of characterizing the differences between attributed subgraphs that belong to different classes. Read More

Item recommendation task predicts a personalized ranking over a set of items for individual user. One paradigm is the rating-based methods that concentrate on explicit feedbacks and hence face the difficulties in collecting them. Meanwhile, the ranking-based methods are presented with rated items and then rank the rated above the unrated. Read More

This research presents an innovative and unique way of solving the advertisement prediction problem which is considered as a learning problem over the past several years. Online advertising is a multi-billion-dollar industry and is growing every year with a rapid pace. The goal of this research is to enhance click through rate of the contextual advertisements using Linear Regression. Read More

We use some of the largest order statistics of the random projections of a reference signal to construct a binary embedding that is adapted to signals correlated with such signal. The embedding is characterized from the analytical standpoint and shown to provide improved performance on tasks such as classification in a reduced-dimensionality space. Read More

Social network analysis is leveraged in a variety of applications such as identifying influential entities, detecting communities with special interests, and determining the flow of information and innovations. However, existing approaches for extracting social networks from unstructured Web content do not scale well and are only feasible for small graphs. In this paper, we introduce novel methodologies for query-based search engine mining, enabling efficient extraction of social networks from large amounts of Web data. Read More

Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today's web search. Read More