Computer Science - Digital Libraries Publications (50)


Computer Science - Digital Libraries Publications

It is widely recognized that citation counts for papers from different fields cannot be directly compared because different scientific fields adopt different citation practices. Citation counts are also strongly biased by paper age since older papers had more time to attract citations. Various procedures aim at suppressing these biases and give rise to new normalized indicators, such as the relative citation count. Read More

A major challenge in network science is to determine whether an observed network property reveals some non-trivial behavior of the network's nodes, or if it is a consequence of the network's elementary properties. Statistical null models serve this purpose by producing random networks whilst keeping chosen network's properties fixed. While there is increasing interest in networks that evolve in time, we still lack a robust time-aware framework to assess the statistical significance of the observed structural properties of growing networks. Read More

The latest developments in digital have provided large data sets that can increasingly easily be accessed and used. These data sets often contain indirect localisation information, such as historical addresses. Historical geocoding is the process of transforming the indirect localisation information to direct localisation that can be placed on a map, which enables spatial analysis and cross-referencing. Read More

This study presents a large scale analysis of the distribution and presence of Mendeley readership scores over time and across disciplines. We study whether Mendeley readership scores (RS) can identify highly cited publications more effectively than journal citation scores (JCS). Web of Science (WoS) publications with DOIs published during the period 2004-2013 and across 5 major scientific fields have been analyzed. Read More

Every network scientist knows that preferential attachment combines with growth to produce networks with power-law in-degree distributions. So how, then, is it possible for the network of American Physical Society journal collection citations to enjoy a log-normal citation distribution when it was found to have grown in accordance with preferential attachment? This anomalous result, which we exalt as the preferential attachment paradox, has remained unexplained since the physicist Sidney Redner first made light of it over a decade ago. In this paper we propose a resolution to the paradox. Read More

In this study we have investigated the relationship between different document characteristics and the number of Mendeley readership counts, tweets, Facebook posts, mentions in blogs and mainstream media for 1.3 million papers published in journals covered by the Web of Science (WoS). It aims to demonstrate that how factors affecting various social media-based indicators differ from those influencing citations and which document types are more popular across different platforms. Read More

This is the first in-depth study on the coverage of Microsoft Academic (MA). The coverage of a verified publication list of a university was analyzed on the level of individual publications in MA, Scopus, and Web of Science (WoS). Citation counts were analyzed and issues related to data retrieval and data quality were examined. Read More

The aim of this paper is to propose a simple modification to the original measure, the relative Hirsch index, which assigns each researcher a value between 0 (the bottom) and 1 (the top), expressing his/her distance to the top in a given field. By this normalization scholars from different scientific disciplines can be compared. Read More

We describe a simple model of how a publication's citations change over time, based on pure-birth stochastic processes with a linear cumulative advantage effect. The model is applied to citation data from the Physical Review corpus provided by APS. Our model reveals that papers fall into three different clusters: papers that have rapid initial citations and ultimately high impact (fast-hi), fast to rise but quick to plateau (fast-flat), or late bloomers (slow-late), which may either never achieve many citations, or do so many years after publication. Read More

In this paper, the scientometric evaluation of faculty members of 50 Greek Science and Engineering University Departments is presented. 1978 academics were examined in total. The number of papers, citations, h-index and i10-index have been collected for each academic, department, school and university using Google Scholar and the citations analysis program Publish or Perish. Read More

The objective of the OSCOSS research project on "Opening Scholarly Communication in the Social Sciences" is to build a coherent collaboration environment that facilitates scholarly communication workflows of social scientists in the roles of authors, reviewers, editors and readers. This paper presents the implementation of the core of this environment: the integration of the Fidus Writer academic word processor with the Open Journal Systems (OJS) submission and review management system. Read More

Scholia is a tool to handle scientific bibliographic information in Wikidata. The Scholia Web service creates on-the-fly scholarly profiles for researchers, organizations, journals, publishers, individual scholarly works, and for research topics. To collect the data, it queries the SPARQL-based Wikidata Query Service. Read More

When comparing the average citation impact of research groups, universities and countries, field normalisation reduces the influence of discipline and time. Confidence intervals for these indicators can help with attempts to infer whether differences between sets of publications are due to chance factors. Although both bootstrapping and formulae have been proposed for these, their accuracy is unknown. Read More

Quantifying the captures of a URI over time is useful for researchers to identify the extent to which a Web page has been archived. Memento TimeMaps provide a format to list mementos (URI-Ms) for captures along with brief metadata, like Memento-Datetime, for each URI-M. However, when some URI-Ms are dereferenced, they simply provide a redirect to a different URI-M (instead of a unique representation at the datetime), often also present in the TimeMap. Read More

Since Lawrence in 2001 proposed the open access (OA) citation advantage, the potential benefit of OA in relation to the citation impact has been discussed in depth. The methodology to test this postulate ranges from comparing the impact factors of OA journals versus traditional ones, to comparing citations of OA versus non-OA articles published in the same non-OA journals. However, conclusions are not entirely consistent among fields, and two possible explications have been suggested in those fields where a citation advantage has been observed for OA: the early view and the selection bias postulates. Read More

Most scientometricians reject the use of the journal impact factor for assessing individual articles and their authors. The well-known San Francisco Declaration on Research Assessment also strongly objects against this way of using the impact factor. Arguments against the use of the impact factor at the level of individual articles are often based on statistical considerations. Read More

The "reproducibility crisis" has been a highly visible source of scientific controversy and dispute. Here, I propose and review several avenues for identifying and prioritizing research studies for the purpose of targeted validation. Of the various proposals discussed, I identify scientific data science as being a strategy that merits greater attention among those interested in reproducibility. Read More

The emergence of new digital technologies has allowed the study of human behaviour at a scale and at level of granularity that were unthinkable just a decade ago. In particular, by analysing the digital traces left by people interacting in the online and offline worlds, we are able to trace the spreading of knowledge and ideas at both local and global scales. In this article we will discuss how these digital traces can be used to map knowledge across the world, outlining both the limitations and the challenges in performing this type of analysis. Read More

We show that the greater the scientific wealth of a nation, the more likely that it will tend to concentrate this excellence in a few premier institutions. That is, great wealth implies great inequality of distribution. The scientific wealth is interpreted in terms of citation data harvested by Google Scholar Citations for profiled institutions from all countries in the world. Read More

As digital collections of scientific literature are widespread and used frequently in knowledge-intense working environments, it has become a challenge to identify author names correctly. The treatment of homonyms is crucial for the reliable resolution of author names. Apart from varying handling of first, middle and last names, vendors as well as the digital library community created tools to address the problem of author name disambiguation. Read More

This study responds to the first measure undertaken on July 17, 2015 by IDNEUF prject, that of an exploratory analysis of the existing portals and aggregators of free French-language academic resources. The idea is to provide an overview of the most common trends and practices in the constitution and organization of digital online learning resource portals. The study of these trends would help to define the appropriate choices and conditions for designing the future common French-language portal and to optimize its services for the conservation, exchange, integration and pooling of educational resources within the distributed technological framework of French-language universities. Read More

This paper describes how semantic indexing can help to generate a contextual overview of topics and visually compare clusters of articles. The method was originally developed for an innovative information exploration tool, called Ariadne, which operates on bibliographic databases with tens of millions of records. In this paper, the method behind Ariadne is further developed and applied to the research question of the special issue "Same data, different results" - the better understanding of topic (re-)construction by different bibliometric approaches. Read More

After a clustering solution is generated automatically, labelling these clusters becomes important to help understanding the results. In this paper, we propose to use a Mutual Information based method to label clusters of journal articles. Topical terms which have the highest Normalised Mutual Information (NMI) with a certain cluster are selected to be the labels of the cluster. Read More

Keeping track of the ever-increasing body of scientific literature is an escalating challenge. We present PubTree a hierarchical search tool that efficiently searches the PubMed/MEDLINE dataset based upon a decision tree constructed using >26 million abstracts. The tool is implemented as a webpage, where users are asked a series of eighteen questions to locate pertinent articles. Read More

This paper aims to investigate the extent to which researchers display citation, and wants to examine whether there are researcher differences in citation personal display at the level of university, country, and academic rank. Physicists in 11 well-known universities in USA, Britain, and China were chosen as the object of study. It was manually identified if physicists had mentioned citation counts, citation-based indices, or a link to Google Scholar Citations (GSC) on the personal websites. Read More

The Cooperative Patent Classifications (CPC) jointly developed by the European and US Patent Offices provide a new basis for mapping and portfolio analysis. This update provides an occasion for rethinking the parameter choices. The new maps are significantly different from previous ones, although this may not always be obvious on visual inspection. Read More

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24. Read More

Wikipedia is one of the most popular sites on the Web, with millions of users relying on it to satisfy a broad range of information needs every day. Although it is crucial to understand what exactly these needs are in order to be able to meet them, little is currently known about why users visit Wikipedia. The goal of this paper is to fill this gap by combining a survey of Wikipedia readers with a log-based analysis of user activity. Read More

In this article we consider the basic ideas, approaches and results of developing of mathematical knowledge management technologies based on ontologies. These solutions form the basis of a specialized digital ecosystem OntoMath which consists of the ontology of the logical structure of mathematical documents Mocassin and ontology of mathematical knowledge OntoMathPRO, tools of text analysis, recommender system and other applications to manage mathematical knowledge. The studies are in according to the ideas of creating a distributed system of interconnected repositories of digitized versions of mathematical documents and project to create a World Digital Mathematical Library. Read More

Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. The metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. Read More

The research was proposed to exploit and extend the relational and contextual nature of the information assets of the Catasto Gregoriano, kept at the Archivio di Stato in Rome. Developed within the MODEUS project (Making Open Data Effectively Usable), this study originates from the following key ideas of MODEUS: to require Open Data to be expressed in terms of an ontology, and to include such an ontology as a documentation of the data themselves. Thus, Open Data are naturally linked by means of the ontology, which meets the requirements of the Linked Open Data vision. Read More

The new index of the author's popularity estimation is represented in the paper. The index is calculated on the basis of Wikipedia encyclopedia analysis (Wikipedia Index - WI). Unlike the conventional existed citation indices, the suggested mark allows to evaluate not only the popularity of the author, as it can be done by means of calculating the general citation number or by the Hirsch index, which is often used to measure the author's research rate. Read More

Recently, a review concluded that Google Scholar (GS) is not a suitable source of information "for identifying recent conference papers or other gray literature publications". The goal of this letter is to demonstrate that GS can be an effective tool to search and find gray literature, as long as appropriate search strategies are used. To do this, we took as examples the same two case studies used by the original review, describing first how GS processes original's search strategies, then proposing alternative search strategies, and finally generalizing each case study to compose a general search procedure aimed at finding gray literature in Google Scholar for two wide selected case studies: a) all contributions belonging to a congress (the ASCO Annual Meeting); and b) indexed guidelines as well as gray literature within medical institutions (National Institutes of Health) and governmental agencies (U. Read More

The German Broadcasting Archive (DRA) maintains the cultural heritage of radio and television broadcasts of the former German Democratic Republic (GDR). The uniqueness and importance of the video material stimulates a large scientific interest in the video content. In this paper, we present an automatic video analysis and retrieval system for searching in historical collections of GDR television recordings. Read More

Clustering scientific publications in an important problem in bibliometric research. We demonstrate how two software tools, CitNetExplorer and VOSviewer, can be used to cluster publications and to analyze the resulting clustering solutions. CitNetExplorer is used to cluster a large set of publications in the field of astronomy and astrophysics. Read More

In spite of recent advances in field delineation methods, bibliometricians still don't know the extent to which their topic detection algorithms reconstruct `ground truths', i.e. thematic structures in the scientific literature. Read More

Affiliations: 1SciCom Research Group, Universidad de Castilla-La Mancha, Spain, 2SciCom Research Group, Universidad de Castilla-La Mancha, Spain, 3SciCom Research Group, Universidad de Castilla-La Mancha, Spain

Critical analysis of the state of the art is a necessary task when identifying new research lines worthwhile to pursue. To such an end, all the available work related to the field of interest must be taken into account. The key point is how to organize, analyze, and make sense of the huge amount of scientific literature available today on any topic. Read More

Advancements in technology and culture lead to changes in our language. These changes create a gap between the language known by users and the language stored in digital archives. It affects user's possibility to firstly find content and secondly interpret that content. Read More

The evolution of named entities affects exploration and retrieval tasks in digital libraries. An information retrieval system that is aware of name changes can actively support users in finding former occurrences of evolved entities. However, current structured knowledge bases, such as DBpedia or Freebase, do not provide enough information about evolutions, even though the data is available on their resources, like Wikipedia. Read More

Accessing Web archives raises a number of issues caused by their temporal characteristics. Additional knowledge is needed to find and understand older texts. Especially entities mentioned in texts are subject to change. Read More

Working with Web archives raises a number of issues caused by their temporal characteristics. Depending on the age of the content, additional knowledge might be needed to find and understand older texts. Especially facts about entities are subject to change. Read More

The Web is our primary source of all kinds of information today. This includes information about software as well as associated materials, like source code, documentation, related publications and change logs. Such data is of particular importance in research in order to conduct, comprehend and reconstruct scientific experiments that involve software. Read More

Software has long been established as an essential aspect of the scientific process in mathematics and other disciplines. However, reliably referencing software in scientific publications is still challenging for various reasons. A crucial factor is that software dynamics with temporal versions or states are difficult to capture over time. Read More

The Web has been around and maturing for 25 years. The popular websites of today have undergone vast changes during this period, with a few being there almost since the beginning and many new ones becoming popular over the years. This makes it worthwhile to take a look at how these sites have evolved and what they might tell us about the future of the Web. Read More

We show how faceted search using a combination of traditional classification systems and mixed-membership models can move beyond keyword search to inform resource discovery, hypothesis formulation, and argument extraction for interdisciplinary research. Our test domain is the history and philosophy of scientific work on animal mind and cognition. We demonstrate an application of our methods to the problem of identifying and extracting arguments about anthropomorphism during a critical period in the development of comparative psychology. Read More

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Read More

Ancient Chinese texts present an area of enormous challenge and opportunity for humanities scholars interested in exploiting computational methods to assist in the development of new insights and interpretations of culturally significant materials. In this paper we describe a collaborative effort between Indiana University and Xi'an Jiaotong University to support exploration and interpretation of a digital corpus of over 18,000 ancient Chinese documents, which we refer to as the "Handian" ancient classics corpus (H\`an di\u{a}n g\u{u} j\'i, i.e, the "Han canon" or "Chinese classics"). Read More

Benford's law is an empirical observation, first reported by Simon Newcomb in 1881 and then independently by Frank Benford in 1938: the first significant digits of numbers in large data are often distributed according to a logarithmically decreasing function. Being contrary to intuition, the law was forgotten as a mere curious observation. However, in the last two decades, relevant literature has grown exponentially, - an evolution typical of "Sleeping Beauties" (SBs) publications that go unnoticed (sleep) for a long time and then suddenly become center of attention (are awakened). Read More

In this paper we study the implications for conference program committees of adopting single-blind reviewing, in which committee members are aware of the names and affiliations of paper authors, versus double-blind reviewing, in which this information is not visible to committee members. WSDM 2017, the 10th ACM International ACM Conference on Web Search and Data Mining, performed a controlled experiment in which each paper was reviewed by four committee members. Two of these four reviewers were chosen from a pool of committee members who had access to author information; the other two were chosen from a disjoint pool who did not have access to this information. Read More

Curated web archive collections contain focused digital content which is collected by archiving organizations, groups, and individuals to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven design-based research approach, with considerable user feedback at all stages. Read More