Computer Science - Digital Libraries Publications (50)


Computer Science - Digital Libraries Publications

Complex networks have emerged as a simple yet powerful framework to represent and analyze a wide range of complex systems. The problem of ranking the nodes and the edges in complex networks is critical for a broad range of real-world problems because it affects how we access online information and products, how success and talent are evaluated in human activities, and how scarce resources are allocated by companies and policymakers, among others. This calls for a deep understanding of how existing ranking algorithms perform, and which are their possible biases that may impair their effectiveness. Read More

The Durham High Energy Physics Database (HEPData) has been built up over the past four decades as a unique open-access repository for scattering data from experimental particle physics papers. It comprises data points underlying several thousand publications. Over the last two years, the HEPData software has been completely rewritten using modern computing technologies as an overlay on the Invenio v3 digital library framework. Read More

Progress in science has advanced the development of human society across history, with dramatic revolutions shaped by information theory, genetic cloning, and artificial intelligence, among the many scientific achievements produced in the 20th century. However, the way that science advances itself is much less well-understood. In this work, we study the evolution of scientific development over the past century by presenting an anatomy of 89 million digitalized papers published between 1900 and 2015. Read More

A central question in science of science concerns how time affects citations. Despite the long-standing interests and its broad impact, we lack systematic answers to this simple yet fundamental question. By reviewing and classifying prior studies for the past 50 years, we find a significant lack of consensus in the literature, primarily due to the coexistence of retrospective and prospective approaches to measuring citation age distributions. Read More

We describe a set of tools, services and strategies of the Latin American Giant Observatory (LAGO) data repository network, to implement Data Accessibility, Reproducibility and Trustworthiness. Read More

Determining how scientific achievements influence the subsequent process of knowledge creation is a fundamental step in order to build a unified ecosystem for studying the dynamics of innovation and competitiveness. Yet, relying separately on data about scientific production on one side, through bibliometric indicators, and about technological advancements on the other side, through patents statistics, gives only a limited insight on the key interplay between science and technology which, as a matter of fact, move forward together within the innovation space. In this paper, using citation data of both scientific papers and patents, we quantify the direct impact of the scientific outputs of nations on further advancements in science and on the introduction of new technologies. Read More

Since a number of journals specifically focus on the review and publication of data sets, reviewing their policies seems an appropriate place to start in assessing what existing practice looks like in the 'real world' of reviewing and publishing data. This article outlines a study of the publicly available peer review policies of 39 scientific publications that publish data papers to discern which criteria are most and least frequently referenced. It also compares current practice with proposed criteria published in 2012. Read More

Recently, two new indicators (Equalized Mean-based Normalized Proportion Cited, EMNPC, the Mean-based Normalized Proportion Cited, MNPC) were proposed which are intended for sparse data. We propose a third indicator (Mantel-Haenszel quotient, MHq) belonging to the same indicator family. The MHq is based on the MH analysis - an established method for polling the data from multiple 2x2 contingency tables based on different subgroups. Read More

The aim of this study is to introduce an application that enables information sharing and communication between visually-impaired individuals and able-bodied. For the purposes of the study, web-based audio library automation was designed and the usability of the system was analyzed regarding the volunteers who record audio books and the visually-impaired individuals. The visually-impaired individuals who took part in the test procedures in order to make a general evaluation of the system reported that the system was theoretically necessary and successful. Read More

Even as we advance the frontiers of physics knowledge, our understanding of how this knowledge evolves remains at the descriptive levels of Popper and Kuhn. Using the APS publications data sets, we ask in this letter how new knowledge is built upon old knowledge. We do so by constructing year-to-year bibliographic coupling networks, and identify in them validated communities that represent different research fields. Read More

Like it or not, attempts to evaluate and monitor the quality of academic research have become increasingly prevalent worldwide. Performance reviews range from at the level of individuals, through research groups and departments, to entire universities. Many of these are informed by, or functions of, simple scientometric indicators and the results of such exercises impact onto careers, funding and prestige. Read More

The OpenITI team has achieved Optical Character Recognition (OCR) accuracy rates for classical Arabic-script texts in the high nineties. These numbers are based on our tests of seven different Arabic-script texts of varying quality and typefaces, totaling over 7,000 lines. These accuracy rates not only represent a distinct improvement over the actual accuracy rates of the various proprietary OCR options for classical Arabic-script texts, but, equally important, they are produced using an open-source OCR software, thus enabling us to make this Arabic-script OCR technology freely available to the broader Islamic, Persian, and Arabic Studies communities. Read More

Archival efforts such as (C)LOCKSS and Portico are in place to ensure the longevity of traditional scholarly resources like journal articles. At the same time, researchers are depositing a broad variety of other scholarly artifacts into emerging online portals that are designed to support web-based scholarship. These web-native scholarly objects are largely neglected by current archival practices and hence they become scholarly orphans. Read More

Only few digital libraries and reference managers offer recommender systems, although such systems could assist users facing information overload. In this paper, we introduce Mr. DLib's recommendations-as-a-service, which allows third parties to easily integrate a recommender system into their products. Read More

It is widely recognized that citation counts for papers from different fields cannot be directly compared because different scientific fields adopt different citation practices. Citation counts are also strongly biased by paper age since older papers had more time to attract citations. Various procedures aim at suppressing these biases and give rise to new normalized indicators, such as the relative citation count. Read More

A major challenge in network science is to determine whether an observed network property reveals some non-trivial behavior of the network's nodes, or if it is a consequence of the network's elementary properties. Statistical null models serve this purpose by producing random networks whilst keeping chosen network's properties fixed. While there is increasing interest in networks that evolve in time, we still lack a robust time-aware framework to assess the statistical significance of the observed structural properties of growing networks. Read More

The latest developments in digital have provided large data sets that can increasingly easily be accessed and used. These data sets often contain indirect localisation information, such as historical addresses. Historical geocoding is the process of transforming the indirect localisation information to direct localisation that can be placed on a map, which enables spatial analysis and cross-referencing. Read More

This study presents a large scale analysis of the distribution and presence of Mendeley readership scores over time and across disciplines. We study whether Mendeley readership scores (RS) can identify highly cited publications more effectively than journal citation scores (JCS). Web of Science (WoS) publications with DOIs published during the period 2004-2013 and across 5 major scientific fields have been analyzed. Read More

Every network scientist knows that preferential attachment combines with growth to produce networks with power-law in-degree distributions. So how, then, is it possible for the network of American Physical Society journal collection citations to enjoy a log-normal citation distribution when it was found to have grown in accordance with preferential attachment? This anomalous result, which we exalt as the preferential attachment paradox, has remained unexplained since the physicist Sidney Redner first made light of it over a decade ago. In this paper we propose a resolution to the paradox. Read More

In this study we have investigated the relationship between different document characteristics and the number of Mendeley readership counts, tweets, Facebook posts, mentions in blogs and mainstream media for 1.3 million papers published in journals covered by the Web of Science (WoS). It aims to demonstrate that how factors affecting various social media-based indicators differ from those influencing citations and which document types are more popular across different platforms. Read More

This is the first detailed study on the coverage of Microsoft Academic (MA). Based on the complete and verified publication list of a university, the coverage of MA was assessed and compared with two benchmark databases, Scopus and Web of Science (WoS), on the level of individual publications. Citation counts were analyzed and issues related to data retrieval and data quality were examined. Read More

The aim of this paper is to propose a simple modification to the original measure, the relative Hirsch index, which assigns each researcher a value between 0 (the bottom) and 1 (the top), expressing his/her distance to the top in a given field. By this normalization scholars from different scientific disciplines can be compared. Read More

We describe a simple model of how a publication's citations change over time, based on pure-birth stochastic processes with a linear cumulative advantage effect. The model is applied to citation data from the Physical Review corpus provided by APS. Our model reveals that papers fall into three different clusters: papers that have rapid initial citations and ultimately high impact (fast-hi), fast to rise but quick to plateau (fast-flat), or late bloomers (slow-late), which may either never achieve many citations, or do so many years after publication. Read More

In this paper, the scientometric evaluation of faculty members of 50 Greek Science and Engineering University Departments is presented. 1978 academics were examined in total. The number of papers, citations, h-index and i10-index have been collected for each academic, department, school and university using Google Scholar and the citations analysis program Publish or Perish. Read More

The objective of the OSCOSS research project on "Opening Scholarly Communication in the Social Sciences" is to build a coherent collaboration environment that facilitates scholarly communication workflows of social scientists in the roles of authors, reviewers, editors and readers. This paper presents the implementation of the core of this environment: the integration of the Fidus Writer academic word processor with the Open Journal Systems (OJS) submission and review management system. Read More

Scholia is a tool to handle scientific bibliographic information in Wikidata. The Scholia Web service creates on-the-fly scholarly profiles for researchers, organizations, journals, publishers, individual scholarly works, and for research topics. To collect the data, it queries the SPARQL-based Wikidata Query Service. Read More

When comparing the average citation impact of research groups, universities and countries, field normalisation reduces the influence of discipline and time. Confidence intervals for these indicators can help with attempts to infer whether differences between sets of publications are due to chance factors. Although both bootstrapping and formulae have been proposed for these, their accuracy is unknown. Read More

Quantifying the captures of a URI over time is useful for researchers to identify the extent to which a Web page has been archived. Memento TimeMaps provide a format to list mementos (URI-Ms) for captures along with brief metadata, like Memento-Datetime, for each URI-M. However, when some URI-Ms are dereferenced, they simply provide a redirect to a different URI-M (instead of a unique representation at the datetime), often also present in the TimeMap. Read More

Since Lawrence in 2001 proposed the open access (OA) citation advantage, the potential benefit of OA in relation to the citation impact has been discussed in depth. The methodology to test this postulate ranges from comparing the impact factors of OA journals versus traditional ones, to comparing citations of OA versus non-OA articles published in the same non-OA journals. However, conclusions are not entirely consistent among fields, and two possible explications have been suggested in those fields where a citation advantage has been observed for OA: the early view and the selection bias postulates. Read More

Most scientometricians reject the use of the journal impact factor for assessing individual articles and their authors. The well-known San Francisco Declaration on Research Assessment also strongly objects against this way of using the impact factor. Arguments against the use of the impact factor at the level of individual articles are often based on statistical considerations. Read More

The "reproducibility crisis" has been a highly visible source of scientific controversy and dispute. Here, I propose and review several avenues for identifying and prioritizing research studies for the purpose of targeted validation. Of the various proposals discussed, I identify scientific data science as being a strategy that merits greater attention among those interested in reproducibility. Read More

The emergence of new digital technologies has allowed the study of human behaviour at a scale and at level of granularity that were unthinkable just a decade ago. In particular, by analysing the digital traces left by people interacting in the online and offline worlds, we are able to trace the spreading of knowledge and ideas at both local and global scales. In this article we will discuss how these digital traces can be used to map knowledge across the world, outlining both the limitations and the challenges in performing this type of analysis. Read More

We show that the greater the scientific wealth of a nation, the more likely that it will tend to concentrate this excellence in a few premier institutions. That is, great wealth implies great inequality of distribution. The scientific wealth is interpreted in terms of citation data harvested by Google Scholar Citations for profiled institutions from all countries in the world. Read More

As digital collections of scientific literature are widespread and used frequently in knowledge-intense working environments, it has become a challenge to identify author names correctly. The treatment of homonyms is crucial for the reliable resolution of author names. Apart from varying handling of first, middle and last names, vendors as well as the digital library community created tools to address the problem of author name disambiguation. Read More

This study responds to the first measure undertaken on July 17, 2015 by IDNEUF prject, that of an exploratory analysis of the existing portals and aggregators of free French-language academic resources. The idea is to provide an overview of the most common trends and practices in the constitution and organization of digital online learning resource portals. The study of these trends would help to define the appropriate choices and conditions for designing the future common French-language portal and to optimize its services for the conservation, exchange, integration and pooling of educational resources within the distributed technological framework of French-language universities. Read More

This paper describes how semantic indexing can help to generate a contextual overview of topics and visually compare clusters of articles. The method was originally developed for an innovative information exploration tool, called Ariadne, which operates on bibliographic databases with tens of millions of records. In this paper, the method behind Ariadne is further developed and applied to the research question of the special issue "Same data, different results" - the better understanding of topic (re-)construction by different bibliometric approaches. Read More

After a clustering solution is generated automatically, labelling these clusters becomes important to help understanding the results. In this paper, we propose to use a Mutual Information based method to label clusters of journal articles. Topical terms which have the highest Normalised Mutual Information (NMI) with a certain cluster are selected to be the labels of the cluster. Read More

Keeping track of the ever-increasing body of scientific literature is an escalating challenge. We present PubTree a hierarchical search tool that efficiently searches the PubMed/MEDLINE dataset based upon a decision tree constructed using >26 million abstracts. The tool is implemented as a webpage, where users are asked a series of eighteen questions to locate pertinent articles. Read More

This paper aims to investigate the extent to which researchers display citation, and wants to examine whether there are researcher differences in citation personal display at the level of university, country, and academic rank. Physicists in 11 well-known universities in USA, Britain, and China were chosen as the object of study. It was manually identified if physicists had mentioned citation counts, citation-based indices, or a link to Google Scholar Citations (GSC) on the personal websites. Read More

The Cooperative Patent Classifications (CPC) jointly developed by the European and US Patent Offices provide a new basis for mapping and portfolio analysis. This update provides an occasion for rethinking the parameter choices. The new maps are significantly different from previous ones, although this may not always be obvious on visual inspection. Read More

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24. Read More

Wikipedia is one of the most popular sites on the Web, with millions of users relying on it to satisfy a broad range of information needs every day. Although it is crucial to understand what exactly these needs are in order to be able to meet them, little is currently known about why users visit Wikipedia. The goal of this paper is to fill this gap by combining a survey of Wikipedia readers with a log-based analysis of user activity. Read More

In this article we consider the basic ideas, approaches and results of developing of mathematical knowledge management technologies based on ontologies. These solutions form the basis of a specialized digital ecosystem OntoMath which consists of the ontology of the logical structure of mathematical documents Mocassin and ontology of mathematical knowledge OntoMathPRO, tools of text analysis, recommender system and other applications to manage mathematical knowledge. The studies are in according to the ideas of creating a distributed system of interconnected repositories of digitized versions of mathematical documents and project to create a World Digital Mathematical Library. Read More

Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. The metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. Read More

The research was proposed to exploit and extend the relational and contextual nature of the information assets of the Catasto Gregoriano, kept at the Archivio di Stato in Rome. Developed within the MODEUS project (Making Open Data Effectively Usable), this study originates from the following key ideas of MODEUS: to require Open Data to be expressed in terms of an ontology, and to include such an ontology as a documentation of the data themselves. Thus, Open Data are naturally linked by means of the ontology, which meets the requirements of the Linked Open Data vision. Read More

The new index of the author's popularity estimation is represented in the paper. The index is calculated on the basis of Wikipedia encyclopedia analysis (Wikipedia Index - WI). Unlike the conventional existed citation indices, the suggested mark allows to evaluate not only the popularity of the author, as it can be done by means of calculating the general citation number or by the Hirsch index, which is often used to measure the author's research rate. Read More

Recently, a review concluded that Google Scholar (GS) is not a suitable source of information "for identifying recent conference papers or other gray literature publications". The goal of this letter is to demonstrate that GS can be an effective tool to search and find gray literature, as long as appropriate search strategies are used. To do this, we took as examples the same two case studies used by the original review, describing first how GS processes original's search strategies, then proposing alternative search strategies, and finally generalizing each case study to compose a general search procedure aimed at finding gray literature in Google Scholar for two wide selected case studies: a) all contributions belonging to a congress (the ASCO Annual Meeting); and b) indexed guidelines as well as gray literature within medical institutions (National Institutes of Health) and governmental agencies (U. Read More

The German Broadcasting Archive (DRA) maintains the cultural heritage of radio and television broadcasts of the former German Democratic Republic (GDR). The uniqueness and importance of the video material stimulates a large scientific interest in the video content. In this paper, we present an automatic video analysis and retrieval system for searching in historical collections of GDR television recordings. Read More

Clustering scientific publications in an important problem in bibliometric research. We demonstrate how two software tools, CitNetExplorer and VOSviewer, can be used to cluster publications and to analyze the resulting clustering solutions. CitNetExplorer is used to cluster a large set of publications in the field of astronomy and astrophysics. Read More

In spite of recent advances in field delineation methods, bibliometricians still don't know the extent to which their topic detection algorithms reconstruct `ground truths', i.e. thematic structures in the scientific literature. Read More