# Computer Science - Mathematical Software Publications (50)

## Search

## Computer Science - Mathematical Software Publications

**Authors:**Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, Pengcheng Yin

We describe DyNet, a toolkit for implementing neural network models based on dynamic declaration of network structure. In the static declaration strategy that is used in toolkits like Theano, CNTK, and TensorFlow, the user first defines a computation graph (a symbolic representation of the computation), and then examples are fed into an engine that executes this computation and computes its derivatives. In DyNet's dynamic declaration strategy, computation graph construction is mostly transparent, being implicitly constructed by executing procedural code that computes the network outputs, and the user is free to use different network structures for each input. Read More

This thesis examines a modern concept for machine numbers based on interval arithmetic called 'Unums' and compares it to IEEE 754 floating-point arithmetic, evaluating possible uses of this format where floating-point numbers are inadequate. In the course of this examination, this thesis builds theoretical foundations for IEEE 754 floating-point numbers, interval arithmetic based on the projectively extended real numbers and Unums. Read More

This paper introduces a method to reduce communication that is injected into the network during a sparse matrix-vector multiply by reorganizing messages on each node. This results in a reduction of the inter-node communication, replaced by less-costly intra-node communication, which reduces both the number and size of messages that are injected into the network. Read More

This is the user manual for the software package BSEPACK (Bethe--Salpeter Eigenvalue Solver Package). Read More

In this paper, an efficient divide-and-conquer (DC) algorithm is proposed for the symmetric tridiagonal matrices based on ScaLAPACK and the hierarchically semiseparable (HSS) matrices. HSS is an important type of rank-structured matrices.Most time of the DC algorithm is cost by computing the eigenvectors via the matrix-matrix multiplications (MMM). Read More

We propose a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion, targeting multi-core processor architectures. Complexity estimates and experimental comparisons demonstrate the advantages of this new approach. Read More

We consider the extension of the method of Gauss-Newton from complex floating-point arithmetic to the field of truncated power series with complex floating-point coefficients. With linearization we formulate a linear system where the coefficient matrix is a series with matrix coefficients. The structure of the linear system leads in the regular case to a block triangular system. Read More

We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1. Read More

SimTensor is a multi-platform, open-source software for generating artificial tensor data (either with CP/PARAFAC or Tucker structure) for reproducible research on tensor factorization algorithms. SimTensor is a stand-alone application based on MATALB. It provides a wide range of facilities for generating tensor data with various configurations. Read More

Simulations of physical phenomena are essential to the expedient design of precision components in aerospace and other high-tech industries. These phenomena are often described by mathematical models involving partial differential equations (PDEs) without exact solutions. Modern design problems require simulations with a level of resolution that is difficult to achieve in a reasonable amount of time even in effectively parallelized solvers. Read More

Iterative methods on irregular grids have been used widely in all areas of comptational science and engineering for solving partial differential equations with complex geometry. They provide the flexibility to express complex shapes with relatively low computational cost. However, the direction of the evolution of high-performance processors in the last two decades have caused serious degradation of the computational efficiency of iterative methods on irregular grids, because of relatively low memory bandwidth. Read More

We present the library Moore, which implements Interval Arithmetic in modern C++. This library is based on a new feature in the C++ language called concepts, which reduces the problems caused by template meta programming, and leads to a new approach for implementing interval arithmetic libraries in C++. Read More

Software for mixed-integer linear programming can return incorrect results for a number of reasons, one being the use of inexact floating-point arithmetic. Even solvers that employ exact arithmetic may suffer from programming or algorithmic errors, motivating the desire for a way to produce independently verifiable certificates of claimed results. Due to the complex nature of state-of-the-art MILP solution algorithms, the ideal form of such a certificate is not entirely clear. Read More

High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Thus,achieving high performance for the Gemm kernel translates into a high performance linear algebra stack above this kernel. Read More

In recent years, deep neural networks (DNNs), have yielded strong results on a wide range of applications. Graphics Processing Units (GPUs) have been one key enabling factor leading to the current popularity of DNNs. However, despite increasing hardware flexibility and software programming toolchain maturity, high efficiency GPU programming remains difficult: it suffers from high complexity, low productivity, and low portability. Read More

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. Read More

The paper presents a parallel math library, dMath, that demonstrates leading scaling when using intranode, internode, and hybrid-parallelism for deep learning (DL). dMath provides easy-to-use distributed primitives and a variety of domain-specific algorithms including matrix multiplication, convolutions, and others allowing for rapid development of scalable applications like deep neural networks (DNNs). Persistent data stored in GPU memory and advanced memory management techniques avoid costly transfers between host and device. Read More

We consider algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthogonal transformations. We use the framework of "algorithms by tiles". Within this framework, we study: (i) the tiled bidiagonalization algorithm BiDiag, which is a tiled version of the standard scalar bidiagonalization algorithm; and (ii) the R-bidiagonalization algorithm R-BiDiag, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R-factor. Read More

DiffSharp is an algorithmic differentiation or automatic differentiation (AD) library for the .NET ecosystem, which is targeted by the C# and F# languages, among others. The library has been designed with machine learning applications in mind, allowing very succinct implementations of models and optimization routines. Read More

We show that Automatic Differentiation (AD) operators can be provided in a dynamic language without sacrificing numeric performance. To achieve this, general forward and reverse AD functions are added to a simple high-level dynamic language, and support for them is included in an aggressive optimizing compiler. Novel technical mechanisms are discussed, which have the ability to migrate the AD transformations from run-time to compile-time. Read More

Heretofore, automatic checkpointing at procedure-call boundaries, to reduce the space complexity of reverse mode, has been provided by systems like Tapenade. However, binomial checkpointing, or treeverse, has only been provided in Automatic Differentiation (AD) systems in special cases, e.g. Read More

Arb is a C library for arbitrary-precision interval arithmetic using the midpoint-radius representation, also known as ball arithmetic. It supports real and complex numbers, polynomials, power series, matrices, and evaluation of many special functions. The core number types are designed for versatility and speed in a range of scenarios, allowing performance that is competitive with non-interval arbitrary-precision types such as MPFR and MPC floating-point numbers. Read More

The task of integrating a large number of independent ODE systems arises in various scientific and engineering areas. For nonstiff systems, common explicit integration algorithms can be used on GPUs, where individual GPU threads concurrently integrate independent ODEs with different initial conditions or parameters. One example is the fifth-order adaptive Runge-Kutta-Cash-Karp (RKCK) algorithm. Read More

Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due to the increased cost of memory movement, which is particularly noticeable for non-square matrices. Such implementations also require considerable workspace and modifications to the standard BLAS interface. Read More

The R package GFA provides a full pipeline for factor analysis of multiple data sources that are represented as matrices with co-occurring samples. It allows learning dependencies between subsets of the data sources, decomposed into latent factors. The package also implements sparse priors for the factorization, providing interpretable biclusters of the multi-source data Read More

Gramian matrices are a well-known encoding for properties of input-output systems such as controllability, observability or minimality. These so called system Gramian matrices were developed in linear system theory for applications such as model order reduction of control systems. Empirical Gramian matrices are an extension to the system Gramians for parametric and nonlinear systems as well as a data-driven method of computation. Read More

Despite decades of research in this area, mesh adaptation capabilities are still rarely found in numerical simulation software. We postulate that the primary reason for this is lack of usability. Integrating mesh adaptation into existing software is difficult as non-trivial operators, such as error metrics and interpolation operators, are required, and integrating available adaptive remeshers is not straightforward. Read More

Future architectures designed to deliver exascale performance motivate the need for novel algorithmic changes in order to fully exploit their capabilities. In this paper, the performance of several numerical algorithms, characterised by varying degrees of memory and computational intensity, are evaluated in the context of finite difference methods for fluid dynamics problems. It is shown that, by storing some of the evaluated derivatives as single thread- or process-local variables in memory, or recomputing the derivatives on-the-fly, a speed-up of ~2 can be obtained compared to traditional algorithms that store all derivatives in global arrays. Read More

We present a new probabilistic model checker Storm. Using state-of-the-art libraries, we aim for both high performance and versatility. This extended abstract gives a brief overview of the features of Storm. Read More

Ordering vertices of a graph is key to minimize fill-in and data structure size in sparse direct solvers, maximize locality in iterative solvers, and improve performance in graph algorithms. Except for naturally parallelizable ordering methods such as nested dissection, many important ordering methods have not been efficiently mapped to distributed-memory architectures. In this paper, we present the first-ever distributed-memory implementation of the reverse Cuthill-McKee (RCM) algorithm for reducing the profile of a sparse matrix. Read More

The development of fluid-structure interaction (FSI) software involves trade-offs between ease of use, generality, performance, and cost. Typically there are large learning curves when using low-level software to model the interaction of an elastic structure immersed in a uniform density fluid. Many existing codes are not publicly available, and the commercial software that exists usually requires expensive licenses and may not be as robust or allow the necessary flexibility that in house codes can provide. Read More

Even though in recent years the scale of statistical analysis problems has increased tremendously, many statistical software tools are still limited to single-node computations. However, statistical analyses are largely based on dense linear algebra operations, which have been deeply studied, optimized and parallelized in the high-performance-computing community. To make high-performance distributed computations available for statistical analysis, and thus enable large scale statistical computations, we introduce RElem, an open source package that integrates the distributed dense linear algebra library Elemental into R. Read More

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the theoretical peak performance at 65W to 240W respectively for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM). For bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) the performance is merely 5 to 7% of the theoretical peak performance in multicores and GPGPUs respectively. Read More

We present new versions of the previously published C and CUDA programs for solving the dipolar Gross-Pitaevskii equation in one, two, and three spatial dimensions, which calculate stationary and non-stationary solutions by propagation in imaginary or real time. Presented programs are improved and parallelized versions of previous programs, divided into three packages according to the type of parallelization. First package contains improved and threaded version of sequential C programs using OpenMP. Read More

We consider the problem of sampling $n$ numbers from the range $\{1,\ldots,N\}$ without replacement on modern architectures. The main result is a simple divide-and-conquer scheme that makes sequential algorithms more cache efficient and leads to a parallel algorithm running in expected time $\mathcal{O}\left(n/p+\log p\right)$ on $p$ processors. The amount of communication between the processors is very small and independent of the sample size. Read More

We present the Macaulay2 package NumericalImplicitization, which allows for user-friendly computation of the basic invariants of the image of a polynomial map, such as dimension, degree, and Hilbert function values. This package relies on methods of numerical algebraic geometry, such as homotopy continuation and monodromy. Read More

The estimation of probability densities based on available data is a central task in many statistical applications. Especially in the case of large ensembles with many samples or high-dimensional sample spaces, computationally efficient methods are needed. We propose a new method that is based on a decomposition of the distribution to be estimated in terms of so-called distribution elements (DEs). Read More

The Hermite methods of Goodrich, Hagstrom, and Lorenz (2006) use Hermite interpolation to construct high order numerical methods for hyperbolic initial value problems. The structure of the method has several favorable features for parallel computing. In this work, we propose algorithms that take advantage of the many-core architecture of Graphics Processing Units. Read More

We develop an algorithm to find all solutions of a generic system in a family of polynomial systems with parametric coefficients using numerical homotopy continuation and the action of the monodromy group. We argue that the expected number of homotopy paths that this algorithm needs to follow is roughly linear in the number of solutions. We demonstrate that our software implementation is competitive with the existing state-of-the-art methods implemented in other software packages. Read More

Graph algorithms have wide applicablity to a variety of domains and are often used on massive datasets. Recent standardization efforts such as the GraphBLAS specify a set of key computational kernels that hardware and software developers can adhere to. Graphulo is a processing framework that enables GraphBLAS kernels in the Apache Accumulo database. Read More

The software package developed in the MS thesis research implements functions for the intelligent guessing of polynomial sequence formulas based on user-defined expected sequence factors of the input coefficients. We present a specialized hybrid approach to finding exact representations for polynomial sequences that is motivated by the need for an automated procedures to discover the precise forms of these sums based on user guidance, or intuition, as to special sequence factors present in the formulas. In particular, the package combines the user input on the expected special sequence factors in the polynomial coefficient formulas with calls to the existing functions as subroutines that then process formulas for the remaining sequence terms already recognized by these packages. Read More

General infrastructure and scalable algorithms for sparse matrix multiplication enable succinct high-performance implementation of numerical methods and graph algorithms. We showcase the theoretical and practical quality of novel sparse matrix multiplication routines in Cyclops Tensor Framework (CTF) via MFBC: a Maximal Frontier Betweenness Centrality algorithm. Our sparse matrix multiplication algorithms and consequently MFBC perform asymptotically less communication than previous approaches. Read More

Parallel finite element algorithms based on object-oriented concepts are presented. Moreover, the design and implementation of a data structure proposed are utilized in realizing a parallel geometric multigrid method. The ParFEMapper and the ParFECommunicator are the key components of the data structure in the proposed parallel scheme. Read More

Inference on time series data is a common requirement in many scientific disciplines and internet of things (IoT) applications, yet there are few resources available to domain scientists to easily, robustly, and repeatably build such complex inference workflows: traditional statistical models of time series are often too rigid to explain complex time domain behavior, while popular machine learning packages require already-featurized dataset inputs. Moreover, the software engineering tasks required to instantiate the computational platform are daunting. cesium is an end-to-end time series analysis framework, consisting of a Python library as well as a web front-end interface, that allows researchers to featurize raw data and apply modern machine learning techniques in a simple, reproducible, and extensible way. Read More

Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain scientists to build applications within the comforts of the Python software ecosystem. In this paper we present Devito, a new finite difference DSL that provides optimized stencil computation from high-level problem specifications based on symbolic Python expressions. Read More

Scientists are drawn to synchrotrons and accelerator based light sources because of their brightness, coherence and flux. The rate of improvement in brightness and detector technology has outpaced Moore's law growth seen for computers, networks, and storage, and is enabling novel observations and discoveries with faster frame rates, larger fields of view, higher resolution, and higher dimensionality. Here we present an integrated software/algorithmic framework designed to capitalize on high throughput experiments, and describe the streamlined processing pipeline of ptychography data analysis. Read More

**Affiliations:**

^{1}University of Waterloo

Search engines for equations now exist, which return results matching the query's mathematical meaning or structural presentation. Operating over scientific papers, online encyclopedias, and math discussion forums, their content includes physics, math, and other sciences. They enable physicists to avoid jargon and more easily target mathematical content within and across disciplines. Read More

We describe GTApprox - a new tool for medium-scale surrogate modeling in industrial design. Compared to existing software, GTApprox brings several innovations: a few novel approximation algorithms, several advanced methods of automated model selection, novel options in the form of hints. We demonstrate the efficiency of GTApprox on a large collection of test problems. Read More

Exascale computing will feature novel and potentially disruptive hardware architectures. Exploiting these to their full potential is non-trivial. Numerical modelling frameworks involving finite difference methods are currently limited by the 'static' nature of the hand-coded discretisation schemes and repeatedly may have to be re-written to run efficiently on new hardware. Read More

Modular arithmetic is widely used in crytography and symbolic computation. This paper presents a vectorized Montgomery algorithm for modular multiplication, the key to fast modular arithmetic, that fully utilizes the SIMD instructions. We further show how the vectorized algorithm can be automatically generated by the {\SPIRAL} system, as part of the effort for automatic generation of a modular polynomial multiplication library. Read More