Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24. Read More

Wikipedia is one of the most popular sites on the Web, with millions of users relying on it to satisfy a broad range of information needs every day. Although it is crucial to understand what exactly these needs are in order to be able to meet them, little is currently known about why users visit Wikipedia. The goal of this paper is to fill this gap by combining a survey of Wikipedia readers with a log-based analysis of user activity. Read More

In this article we consider the basic ideas, approaches and results of developing of mathematical knowledge management technologies based on ontologies. These solutions form the basis of a specialized digital ecosystem OntoMath which consists of the ontology of the logical structure of mathematical documents Mocassin and ontology of mathematical knowledge OntoMathPRO, tools of text analysis, recommender system and other applications to manage mathematical knowledge. The studies are in according to the ideas of creating a distributed system of interconnected repositories of digitized versions of mathematical documents and project to create a World Digital Mathematical Library. Read More

Document clustering is generally the first step for topic identification. Since many clustering methods operate on the similarities between documents, it is important to build representations of these documents which keep their semantics as much as possible and are also suitable for efficient similarity calculation. The metadata of articles in the Astro dataset contribute to a semantic matrix, which uses a vector space to capture the semantics of entities derived from these articles and consequently supports the contextual exploration of these entities in LittleAriadne. Read More

The research was proposed to exploit and extend the relational and contextual nature of the information assets of the Catasto Gregoriano, kept at the Archivio di Stato in Rome. Developed within the MODEUS project (Making Open Data Effectively Usable), this study originates from the following key ideas of MODEUS: to require Open Data to be expressed in terms of an ontology, and to include such an ontology as a documentation of the data themselves. Thus, Open Data are naturally linked by means of the ontology, which meets the requirements of the Linked Open Data vision. Read More

The new index of the author's popularity estimation is represented in the paper. The index is calculated on the basis of Wikipedia encyclopedia analysis (Wikipedia Index - WI). Unlike the conventional existed citation indices, the suggested mark allows to evaluate not only the popularity of the author, as it can be done by means of calculating the general citation number or by the Hirsch index, which is often used to measure the author's research rate. Read More

Recently, a review concluded that Google Scholar (GS) is not a suitable source of information "for identifying recent conference papers or other gray literature publications". The goal of this letter is to demonstrate that GS can be an effective tool to search and find gray literature, as long as appropriate search strategies are used. To do this, we took as examples the same two case studies used by the original review, describing first how GS processes original's search strategies, then proposing alternative search strategies, and finally generalizing each case study to compose a general search procedure aimed at finding gray literature in Google Scholar for two wide selected case studies: a) all contributions belonging to a congress (the ASCO Annual Meeting); and b) indexed guidelines as well as gray literature within medical institutions (National Institutes of Health) and governmental agencies (U. Read More

The German Broadcasting Archive (DRA) maintains the cultural heritage of radio and television broadcasts of the former German Democratic Republic (GDR). The uniqueness and importance of the video material stimulates a large scientific interest in the video content. In this paper, we present an automatic video analysis and retrieval system for searching in historical collections of GDR television recordings. Read More

Clustering scientific publications in an important problem in bibliometric research. We demonstrate how two software tools, CitNetExplorer and VOSviewer, can be used to cluster publications and to analyze the resulting clustering solutions. CitNetExplorer is used to cluster a large set of publications in the field of astronomy and astrophysics. Read More

In spite of recent advances in field delineation methods, bibliometricians still don't know the extent to which their topic detection algorithms reconstruct `ground truths', i.e. thematic structures in the scientific literature. Read More

Critical analysis of the state of the art is a necessary task when identifying new research lines worthwhile to pursue. To such an end, all the available work related to the field of interest must be taken into account. The key point is how to organize, analyze, and make sense of the huge amount of scientific literature available today on any topic. Read More

Advancements in technology and culture lead to changes in our language. These changes create a gap between the language known by users and the language stored in digital archives. It affects user's possibility to firstly find content and secondly interpret that content. Read More

The evolution of named entities affects exploration and retrieval tasks in digital libraries. An information retrieval system that is aware of name changes can actively support users in finding former occurrences of evolved entities. However, current structured knowledge bases, such as DBpedia or Freebase, do not provide enough information about evolutions, even though the data is available on their resources, like Wikipedia. Read More

Accessing Web archives raises a number of issues caused by their temporal characteristics. Additional knowledge is needed to find and understand older texts. Especially entities mentioned in texts are subject to change. Read More

Working with Web archives raises a number of issues caused by their temporal characteristics. Depending on the age of the content, additional knowledge might be needed to find and understand older texts. Especially facts about entities are subject to change. Read More

The Web is our primary source of all kinds of information today. This includes information about software as well as associated materials, like source code, documentation, related publications and change logs. Such data is of particular importance in research in order to conduct, comprehend and reconstruct scientific experiments that involve software. Read More

Software has long been established as an essential aspect of the scientific process in mathematics and other disciplines. However, reliably referencing software in scientific publications is still challenging for various reasons. A crucial factor is that software dynamics with temporal versions or states are difficult to capture over time. Read More

The Web has been around and maturing for 25 years. The popular websites of today have undergone vast changes during this period, with a few being there almost since the beginning and many new ones becoming popular over the years. This makes it worthwhile to take a look at how these sites have evolved and what they might tell us about the future of the Web. Read More

We show how faceted search using a combination of traditional classification systems and mixed-membership models can move beyond keyword search to inform resource discovery, hypothesis formulation, and argument extraction for interdisciplinary research. Our test domain is the history and philosophy of scientific work on animal mind and cognition. We demonstrate an application of our methods to the problem of identifying and extracting arguments about anthropomorphism during a critical period in the development of comparative psychology. Read More

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Read More

Ancient Chinese texts present an area of enormous challenge and opportunity for humanities scholars interested in exploiting computational methods to assist in the development of new insights and interpretations of culturally significant materials. In this paper we describe a collaborative effort between Indiana University and Xi'an Jiaotong University to support exploration and interpretation of a digital corpus of over 18,000 ancient Chinese documents, which we refer to as the "Handian" ancient classics corpus (H\`an di\u{a}n g\u{u} j\'i, i.e, the "Han canon" or "Chinese classics"). Read More

Benford's law is an empirical observation, first reported by Simon Newcomb in 1881 and then independently by Frank Benford in 1938: the first significant digits of numbers in large data are often distributed according to a logarithmically decreasing function. Being contrary to intuition, the law was forgotten as a mere curious observation. However, in the last two decades, relevant literature has grown exponentially, - an evolution typical of "Sleeping Beauties" (SBs) publications that go unnoticed (sleep) for a long time and then suddenly become center of attention (are awakened). Read More

In this paper we study the implications for conference program committees of adopting single-blind reviewing, in which committee members are aware of the names and affiliations of paper authors, versus double-blind reviewing, in which this information is not visible to committee members. WSDM 2017, the 10th ACM International ACM Conference on Web Search and Data Mining, performed a controlled experiment in which each paper was reviewed by four committee members. Two of these four reviewers were chosen from a pool of committee members who had access to author information; the other two were chosen from a disjoint pool who did not have access to this information. Read More

Curated web archive collections contain focused digital content which is collected by archiving organizations, groups, and individuals to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven design-based research approach, with considerable user feedback at all stages. Read More

Curated web archive collections contain focused digital contents which are collected by archiving organizations to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In this paper, we discuss how to best support collaborative construction and exploration of these collections through the ArchiveWeb system. ArchiveWeb has been developed using an iterative evaluation-driven design-based research approach, with considerable user feedback at all stages. Read More

Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today's web search. Read More

Scientific evaluation is a determinant of how scientists, institutions and funders behave, and as such is a key element in the making of science. In this article, we propose an alternative to the current norm of evaluating research with journal rank. Following a well-defined notion of scientific value, we introduce qualitative processes that can also be quantified and give rise to meaningful and easy-to-use article-level metrics. Read More

In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for interactive postcorrection of OCR'ed historical printings it is extremely useful to have a statistical profile available that provides an estimate of error classes with associated frequencies, and that points to conjectured errors and suspicious tokens. The method introduced in Reffle (2013) computes such a profile, combining lexica, pattern sets and advanced matching techniques in a specialized Expectation Maximization (EM) procedure. Read More

The distribution of scientific citations for publications selected with different rules (author, topic, institution, country, journal, etc.) collapse on a single curve if one plots the citations relative to their mean value. We find that the distribution of shares for the Facebook posts re-scale in the same manner to the very same curve with scientific citations. Read More

We analyzed the longitudinal activity of nearly 7,000 editors at the mega-journal PLOS ONE over the 10-year period 2006-2015. Using the article-editor associations, we develop editor-specific measures of power, activity, article acceptance time, citation impact, and editorial renumeration (an analogue to self-citation). We observe remarkably high levels of power inequality among the PLOS ONE editors, with the top-10 editors responsible for 3,366 articles -- corresponding to 2. Read More

Allometric scaling can reflect underlying mechanisms, dynamics and structures in complex systems; examples include typical scaling laws in biology, ecology and urban development. In this work, we study allometric scaling in scientific fields. By performing an analysis of the outputs/inputs of various scientific fields, including the numbers of publications, citations, and references, with respect to the number of authors, we find that in all fields that we have studied thus far, including physics, mathematics and economics, there are allometric scaling laws relating the outputs/inputs and the sizes of scientific fields. Read More

Number of published scholarly articles is growing exponentially. To tackle this information overload, researchers are increasingly depending on niche academic search engines. Recent works have shown that two major general web search engines: Google and Bing, have high level of agreement in their top search results. Read More

This article offers a personal perspective on the current state of academic publishing, and posits that the scientific community is beset with journals that contribute little valuable knowledge, overload the community's capacity for high-quality peer review, and waste resources. Open access publishing can offer solutions that benefit researchers and other information users, as well as institutions and funders, but commercial journal publishers have influenced open access policies and practices in ways that favor their economic interests over those of other stakeholders in knowledge creation and sharing. One way to free research from constraints on access is the diamond route of open access publishing, in which institutions and funders that produce new knowledge reclaim responsibility for publication via institutional journals or other open platforms. Read More

Whereas the generation of Shannon-type information is coupled to the second law of thermodynamics, redundancy--that is, the complement of information to the maximum entropy--can be increased by further distinctions: new options can discursively be generated. The dynamics of discursive knowledge production thus infuse the historical dynamics with a cultural evolution based on expectations (as different from observations). We distinguish among (i) the communication of information, (ii) the sharing of meaning, and (iii) discursive knowledge. Read More

The importance of dimensional analysis and dimensional homogeneity in bibliometric studies is always overlooked. In this paper, we look at this issue systematically and show that most h-type indices have the dimensions of [P], where [P] is the basic dimensional unit in bibliometrics which is the unit publication or paper. The newly introduced Euclidean index, based on the Euclidean length of the citation vector has the dimensions [P3/2]. Read More

A researcher may publish tens or hundreds of papers, yet these contributions to the literature are not uniformly distributed over a career. Past analyses of the trajectories of faculty productivity suggest an intuitive and canonical pattern: after being hired, productivity tends to rise rapidly to an early peak and then gradually declines. Here, we test the universality of this conventional narrative by analyzing the structures of individual faculty productivity time series, constructed from over 200,000 publications matched with hiring data for 2453 tenure-track faculty in all 205 Ph. Read More

As more scholarly content is being born digital or digitized, digital libraries are becoming increasingly vital to researchers leveraging scholarly big data for scientific discovery. Given the abundance of scholarly products-especially in environments created by the advent of social networking services-little is known about international scholarly information needs, information-seeking behavior, or information use. This paper aims to address these gaps by conducting an in-depth analysis of researchers in the United States and Qatar; learn about their research attitudes, practices, tactics, strategies, and expectations; and address the obstacles faced during research endeavors. Read More

We provide an up-to-date view on the knowledge management system ScienceWISE (SW) and address issues related to the automatic assignment of articles to research topics. So far, SW has been proven to be an effective platform for managing large volumes of technical articles by means of ontological concept-based browsing. However, as the publication of research articles accelerates, the expressivity and the richness of the SW ontology turns into a double-edged sword: a more fine-grained characterization of articles is possible, but at the cost of introducing more spurious relations among them. Read More

International collaboration in science continues to grow at a remarkable rate, but little agreement exists about dynamics of growth and organization at the discipline level. Some suggest that disciplines differ in their collaborative tendencies, reflecting their epistemic culture. This study examines collaborative patterns in six previously studied specialties to add new data and conduct analyses over time. Read More

Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. Read More

Collections of Web documents about specific topics are needed for many areas of current research. Focused crawling enables the creation of such collections on demand. Current focused crawlers require the user to manually specify starting points for the crawl (seed URLs). Read More

The recently proposed Euclidean index offers a novel approach to measure the citation impact of academic authors, in particular as an alternative to the h-index. We test if the index provides new, robust information, not covered by existing bibliometric indicators, discuss the measurement scale and the degree of distinction between analytical units the index offers. We find that the Euclidean index does not outperform existing indicators on these topics and that the main application of the index would be solely for ranking, which is not seen as a recommended practice. Read More

It is no secret that the number of scholarly events and venues available for researchers is and has been dramatically expanding. While this tremendous expansion is certainly a boon for academia as a whole, it has become increasingly difficult for many researchers to identify events and venues related to their work. Therefore, as opportunities to share scholarly work continue to expand, researchers may find themselves unable to determine effectively which venues publish data and research most in line with their scholarly interests. Read More

Cities are engines of the knowledge-based economy, because they are the primary sites of knowledge production activities that subsequently shape the rate and direction of technological change and economic growth. Patents provide a wealth of information to analyse the knowledge specialization at specific places, such as technological details and information on inventors and entities involved, including address information. The technology codes on each patent document indicate the specialization and scope of the underlying technological knowledge of a given invention. Read More

Web archives capture the history of the Web and are therefore an important source to study how societal developments have been reflected on the Web. However, the large size of Web archives and their temporal nature pose many challenges to researchers interested in working with these collections. In this work, we describe the challenges of working with Web archives and propose the research methodology of extracting and studying sub-collections of the archive focused on specific topics and events. Read More

The CENDARI infrastructure is a research supporting platform designed to provide tools for transnational historical research, focusing on two topics: Medieval culture and World War I. It exposes to the end users modern web-based tools relying on a sophisticated infrastructure to collect, enrich, annotate, and search through large document corpora. Supporting researchers in their daily work is a novel concern for infrastructures. Read More

The scientific paper output of the United Nations University (UNU) was bibliometrically analysed.It was found that (i) a noticeable continous paper output starts in 1995, (ii) about 65% of the research papers have been published as international cooperations and 18% as single-authored papers, (iv) the research papers rank above world average according to Pudovkin-Garfield Percentile Rank Index, and (v) paper content indicate the wide variety of scientific topics UNU has been and is working on. Read More

Capabilities to exchange health information are critical to accelerate discovery and its diffusion to healthcare practice. However, the same ethical and legal policies that protect privacy hinder these data exchanges, and the issues accumulate if moving data across geographical or organizational borders. This can be seen as one of the reasons why many health technologies and research findings are limited to very narrow domains. Read More

The Journal Impact Factor (JIF) has been heavily criticized over decades. This opinion piece argues that the JIF should not be demonized. It still can be employed for research evaluation purposes by carefully considering the context and academic environment. Read More

Citations are commonly held to represent scientific impact. To date, however, there is no empirical evidence in support of this postulate that is central to research assessment exercises and Science of Science studies. Here, we report on the first empirical verification of the degree to which citation numbers represent scientific impact as it is actually perceived by experts in their respective field. Read More