Computer Science - Databases Publications (50)


Computer Science - Databases Publications

Most modern data stores tend to be distributed, to enable the scaling of the data across multiple instances of commodity hardware. Although this ensures a near unlimited potential for storage, the data itself is not always ideally partitioned, and the cost of a network round-trip may cause a degradation of end-user experience with respect to response latency. The problem being solved is bringing the data objects closer to the frequent sources of requests using a dynamic repartitioning algorithm. Read More

In this paper we present a new error bound on sampling algorithms for frequent itemsets mining. We show that the new bound is asymptotically tighter than the state-of-art bounds, i.e. Read More

The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which promise an increase in expressiveness and performance. But how good are these extensions at extracting high performance from modern hardware platforms? While Spark has made impressive progress, we show that for relational workloads, there is still a significant gap compared with best-of-breed query engines. Read More

Codd's relational model describes just one possible world. To better cope with incomplete information, extended database models allow several possible worlds. Vague tables are one such convenient extended model where attributes accept sets of possible values (e. Read More

The paper presents a novel concept that analyzes and visualizes worldwide fashion trends. Our goal is to reveal cutting-edge fashion trends without displaying an ordinary fashion style. To achieve the fashion-based analysis, we created a new fashion culture database (FCDB), which consists of 76 million geo-tagged images in 16 cosmopolitan cities. Read More

We study changes in metrics that are defined on a cartesian product of trees. Such metrics occur naturally in many practical applications, where a global metric (such as revenue) can be broken down along several hierarchical dimensions (such as location, gender, etc). Given a change in such a metric, our goal is to identify a small set of non-overlapping data segments that account for the change. Read More

Sliding window join is one of the most important operators for stream applications. To produce high quality join results, a stream processing system must deal with the ubiquitous disorder within input streams which is caused by network delay, asynchronous source clocks, etc. Disorder handling involves an inevitable tradeoff between the latency and the quality of produced join results. Read More

This paper presents an intelligent user interface model dedicated to the exploration of complex databases. This model is implemented on a 3D metaphor : a virtual museum. In this metaphor, the database elements are embodied as museum objects. Read More

This paper introduces a principled incremental view maintenance (IVM) mechanism for in-database computation described by rings. We exemplify our approach by introducing the covariance matrix ring that we use for learning linear regression models over arbitrary equi-join queries. Our approach is a higher-order IVM algorithm that exploits the factorized structure of joins and aggregates to avoid redundant computation and improve performance. Read More

Today, huge amount of data is available on the web. Now there is a need to convert that data in knowledge which can be useful for different purposes. This paper depicts the use of data mining process, OLAP with the combination of multi agent system to find the knowledge from data in cloud computing. Read More

Analytics tasks manipulate structured data with variants of relational algebra (RA) and quantitative data with variants of linear algebra (LA). The two computational models have overlapping expressiveness, motivating a common programming model that affords unified reasoning and algorithm design. At the logical level we propose Lara, a lean algebra of three operators, that expresses RA and LA as well as relevant optimization rules. Read More

The amount of multidimensional data published on the semantic web (SW) is constantly increasing, due to initiatives such as Open Data and Open Government Data, among other ones. Models, languages, and tools, that allow to obtain valuable information efficiently, are thus required. Multidimensional data are typically represented as data cubes, and exploited using Online Analytical Processing (OLAP) techniques. Read More

The latest developments in digital have provided large data sets that can increasingly easily be accessed and used. These data sets often contain indirect localisation information, such as historical addresses. Historical geocoding is the process of transforming the indirect localisation information to direct localisation that can be placed on a map, which enables spatial analysis and cross-referencing. Read More

Local Process Models (LPM) describe structured fragments of process behavior occurring in the context of less structured business processes. Traditional LPM discovery aims to generate a collection of process models that describe highly frequent behavior, but these models do not always provide useful answers for questions posed by process analysts aiming at business process improvement. We propose a framework for goal-driven LPM discovery, based on utility functions and constraints. Read More

This paper explores methodologies, advantages and challenges related to the use of the Information Centric Network technology for developing NoSQL distributed databases, which are expected to play a central role in the forthcoming IoT and BigData era. ICN services make possible to simplify the development of the database software, improve performance, and provide data-level access control. We use our findings for devising a NoSQL spatio-temporal database, named OpenGeoBase, and evaluate its performance with a real data set related to Intelligent Transport System applications. Read More

Knowledge bases play a crucial role in many applications, for example question answering and information retrieval. Despite the great effort invested in creating and maintaining them, even the largest representatives (e.g. Read More

Subgraph queries also known as subgraph isomorphism search is a fundamental problem in querying graph-like structured data. It consists to enumerate the subgraphs of a data graph that match a query graph. This problem arises in many real-world applications related to query processing or pattern recognition such as computer vision, social network analysis, bioinformatic and big data analytic. Read More

Process-Aware Information Systems (PAIS) is an IT system that support business processes and generate large amounts of event logs from the execution of business processes. An event log is represented as a tuple of CaseID, Timestamp, Activity and Actor. Process Mining is a new and emerging field that aims at analyzing the event logs to discover, enhance and improve business processes and check conformance between run time and design time business processes. Read More

In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Read More

Log-linear models are arguably the most successful class of graphical models for large-scale applications because of their simplicity and tractability. Learning and inference with these models require calculating the partition function, which is a major bottleneck and intractable for large state spaces. Importance Sampling (IS) and MCMC-based approaches are lucrative. Read More

We introduce Fonduer, a knowledge base construction (KBC) framework for richly formatted information extraction (RFIE), where entity relations and attributes are conveyed via structural, tabular, visual, and textual expressions. Fonduer introduces a new programming model for KBC built around a unified data representation that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer is the first KBC system for richly formatted data and uses a human-in-the-loop paradigm for training machine learning systems, referred to as data programming. Read More

We introduce a unified framework for a class of optimization based statistical learning problems used by LogicBlox retail-planning and forecasting applications, where the input data is given by queries over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. The main challenge posed by computing these problems is the large number of records and of categorical features in the input data, which leads to very large compute times or failure to process the entire data. Read More

Today's process modeling languages often force the analyst or modeler to straightjacket real-life processes into simplistic or incomplete models that fail to capture the essential features of the domain under study. Conventional business process models only describe the lifecycles of individual instances (cases) in isolation. Although process models may include data elements (cf. Read More

In the last decade, many business applications have moved into the cloud. In particular, the "database-as-a-service" paradigm has become mainstream. While existing multi-tenant data management systems focus on single-tenant query processing, we believe that it is time to rethink how queries can be processed across multiple tenants in such a way that we do not only gain more valuable insights, but also at minimal cost. Read More

Heretofore the concept of "blockchain" has not been precisely defined. Accordingly the potential useful applications of this technology have been largely inflated. This work sidesteps the question of what constitutes a blockchain as such and focuses on the architectural components of the Bitcoin cryptocurrency, insofar as possible, in isolation. Read More

Blockchain technologies are taking the world by storm. Public blockchains, such as Bitcoin and Ethereum, enable secure peer-to-peer applications like crypto-currency or smart contracts. Their security and performance are well studied. Read More

We present a probabilistic approach to generate a small, query-able summary of a dataset for interactive data exploration. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. Read More

In this extended abstract we describe, mainly by examples, the main elements of the Ontological Multidimensional Data Model, which considerably extends a relational reconstruction of the multidimensional data model proposed by Hurtado and Mendelzon by means of tuple-generating dependencies, equality-generating dependencies, and negative constraints as found in Datalog+-. We briefly mention some good computational properties of the model. Read More

Query evaluation over probabilistic databases is known to be intractable in many cases, even in data complexity, i.e., when the query is fixed. Read More

We define and study the Functional Aggregate Query (FAQ) problem, which captures common computational tasks across a very wide range of domains including relational databases, logic, matrix and tensor computation, probabilistic graphical models, constraint satisfaction, and signal processing. Simply put, an FAQ is a declarative way of defining a new function from a database of input functions. We present "InsideOut", a dynamic programming algorithm, to evaluate an FAQ. Read More

DGCC protocol has been shown to achieve good performance on multi-core in-memory system. However, distributed transactions complicate the dependency resolution, and therefore, an effective transaction partitioning strategy is essential to reduce expensive multi-node distributed transactions. During failure recovery, log must be examined from the last checkpoint onwards and the affected transactions are re-executed based on the way they are partitioned and executed. Read More

A geometrical pattern is a set of points with all pairwise distances (or, more generally, relative distances) specified. Finding matches to such patterns has applications to spatial data in seismic, astronomical, and transportation contexts. For example, a particularly interesting geometric pattern in astronomy is the Einstein cross, which is an astronomical phenomenon in which a single quasar is observed as four distinct sky objects (due to gravitational lensing) when captured by earth telescopes. Read More

Nowadays, most systems and applications produce log records that are useful for security and monitoring purposes such as debugging programming errors, checking system status, and detecting configuration problems or even attacks. To this end, a log repository becomes necessary whereby logs can be accessed and visualized in a timely manner. This paper presents Loginson, a high-performance log centralization system for large-scale log collection and processing in large IT infrastructures. Read More

Energy consumption has become a first-class optimization goal in design and implementation of data-intensive computing systems. This is particularly true in the design of database management systems (DBMS), which was found to be the major consumer of energy in the software stack of modern data centers. Among all database components, the storage system is one of the most power-hungry elements. Read More

Video is one of the fastest-growing sources of data and is rich with interesting semantic information. Furthermore, recent advances in computer vision, in the form of deep convolutional neural networks (CNNs), have made it possible to query this semantic information with near-human accuracy (in the form of image tagging). However, performing inference with state-of-the-art CNNs is computationally expensive: analyzing videos in real time (at 30 frames/sec) requires a $1200 GPU per video stream, posing a serious computational barrier to CNN adoption in large-scale video data management systems. Read More

Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that "bolts on" versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database "for free". Read More

Users are rarely familiar with the content of a data source they are querying, and therefore cannot avoid using keywords that do not exist in the data source. Traditional systems may respond with an empty result, causing dissatisfaction, while the data source in effect holds semantically related content. In this paper we study this no-but-semantic-match problem on XML keyword search and propose a solution which enables us to present the top-k semantically related results to the user. Read More

Transactional frequent subgraph mining identifies frequent subgraphs in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the features provided by distributed in-memory dataflow systems such as Apache Spark or Apache Flink. Read More

With the need for flexible and on-demand decision support, Dynamic Data Warehouses (DDW) provide benefits over traditional data warehouses due to their dynamic characteristics in structuring and access mechanism. A DDW is a data framework that accommodates data source changes easily to allow seamless querying to users. Materialized Views (MV) are proven to be an effective methodology to enhance the process of retrieving data from a DDW as results are pre-computed and stored in it. Read More

This paper presents a personal account of the early legacy of discovery informatics, especially surrounding the first published definition of domain-independent DI. The state of DI is traced across various reference sources and the literature on the fourth paradigm of the scientific method. Observations are offered on DI, concluding that it will retain its appeal as a highly apt descriptor for research and practice activities that are inherent in our human nature. Read More

Applications running on parallel systems often need to join a streaming relation or a stored relation with data indexed in a parallel data storage system. Some applications also compute UDFs on the joined tuples. The join can be done at the data storage nodes, corresponding to reduce side joins, or by fetching data from the storage system, corresponding to map side join. Read More

Time series visualization of streaming telemetry (i.e., charting of key metrics such as server load over time) is increasingly prevalent in recent application deployments. Read More

Entity resolution (ER) presents unique challenges for evaluation methodology. While crowd sourcing provides a platform to acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching records can lead to enormous labelling requirements when seeking statistically consistent estimates of population parameters. Read More

An increasing amount of information is generated from the rapidly increasing number of sensor networks and smart devices. A wide variety of sources generate and publish information in different formats, thus highlighting interoperability as one of the key prerequisites for the success of Internet of Things (IoT). The BT Hypercat Data Hub provides a focal point for the sharing and consumption of available datasets from a wide range of sources. Read More

It is essential for the cellular network operators to provide cellular location services to meet the needs of their users and mobile applications. However, cellular locations, estimated by network-based methods at the server-side, bear with {\it high spatial errors} and {\it arbitrary missing locations}. Moreover, auxiliary sensor data at the client-side are not available to the operators. Read More

Platforms such as AirBnB, TripAdvisor, Yelp and related sites have transformed the way we search for accommodation, restaurants, etc. The underlying datasets in such applications have numerous attributes that are mostly Boolean or Categorical. Discovering the skyline of such datasets over a subset of attributes would identify entries that stand out while enabling numerous applications. Read More

We investigate the query evaluation problem for fixed queries over fully dynamic databases, where tuples can be inserted or deleted. The task is to design a dynamic algorithm that immediately reports the new result of a fixed query after every database update. We consider queries in first-order logic (FO) and its extension with modulo-counting quantifiers (FO+MOD), and show that they can be efficiently evaluated under updates, provided that the dynamic database does not exceed a certain degree bound. Read More

This paper presents an approach for transforming data granularity in hierarchical databases for binary decision problems by applying regression to categorical attributes at the lower grain levels. Attributes from a lower hierarchy entity in the relational database have their information content optimized through regression on the categories histogram trained on a small exclusive labelled sample, instead of the usual mode category of the distribution. The paper validates the approach on a binary decision task for assessing the quality of secondary schools focusing on how logistic regression transforms the students and teachers attributes into school attributes. Read More

Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy. Read More