Computer Science - Mathematical Software Publications (50)


Computer Science - Mathematical Software Publications

This paper describes fast sorting techniques using the recent AVX-512 instruction set. Our implementations benefit from the latest possibilities offered by AVX-512 to vectorize a two-parts hybrid algorithm: we sort the small arrays using a branch- free Bitonic variant, and we provide a vectorized partitioning kernel which is the main component of the well-known Quicksort. Our algorithm sorts in-place and is straightforward to implement thanks to the new instructions. Read More

A flexible and highly-extensible data assimilation testing suite, named DATeS, is described in this paper. DATeS aims to offer a unified testing environment that allows researchers to compare different data assimilation methodologies and understand their performance in various settings. The core of DATeS is implemented in Python and takes advantage of its object-oriented capabilities. Read More

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutations are determined at runtime. To overcome this limitation, we introduce the open-source C++ library High-Performance Tensor Transposition (HPTT). Read More

In this paper, we present two algorithms based on the Froidure-Pin Algorithm for computing a finite semigroup. If $U$ is any semigroup, and $A$ be a subset of $U$, then we denote by $\langle A\rangle$ the least subsemigroup of $U$ containing $A$. If $B$ is any other subset of $U$, then, roughly speaking, the first algorithm we present describes how to use any information about $\langle A\rangle$, that has been found using the Froidure-Pin Algorithm, to compute the semigroup $\langle A, B\rangle$. Read More

Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen's algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Read More

BLASFEO is a dense linear algebra library providing high-performance implementation of BLAS- and LAPACK-like routines for use in embedded optimization. A key difference with respect to existing high-performance implementations of BLAS is that the computational performance is optimized for small to medium scale matrices, i.e. Read More

Conical functions appear in a large number of applications in physics and engineering. In this paper we describe an extension of our module CONICAL for the computation of conical functions. Specifically, the module includes now a routine for computing the function ${{\rm R}}^{m}_{-\frac{1}{2}+i\tau}(x)$, a real-valued numerically satisfactory companion of the function ${\rm P}^m_{-\tfrac12+i\tau}(x)$ for $x>1$. Read More

Background: Component-based modeling language Modelica (OpenModelica is open source implementation) is used for the numerical simulation of complex processes of different nature represented by ODE system. However, in OpenModelica standard library there is no routines for pseudo-random numbers generation, which makes it impossible to use for stochastic modeling processes. Purpose: The goal of this article is a brief overview of a number of algorithms for generation a sequence of uniformly distributed pseudo random numbers and quality assessment of the sequence given by them, as well as the ways to implement some of these algorithms in OpenModelica system. Read More

Domain-specific languages (DSLs) are of increasing importance in scientific high-performance computing to reduce development costs, raise the level of abstraction and, thus, ease scientific programming. However, designing and implementing DSLs is not an easy task, as it requires knowledge of the application domain and experience in language engineering and compilers. Consequently, many DSLs follow a weak approach using macros or text generators, which lack many of the features that make a DSL a comfortable for programmers. Read More

Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files. We estimate that billions of base64 messages are decoded every day. We are motivated to improve the efficiency of base64 encoding and decoding. Read More

Large scale parameter estimation problems are among some of the most computationally demanding problems in numerical analysis. An academic researcher's domain-specific knowledge often precludes that of software design, which results in inversion frameworks that are technically correct, but not scalable to realistically-sized problems. On the other hand, the computational demands for realistic problems result in industrial codebases that are geared solely for high performance, rather than comprehensibility or flexibility. Read More

We study the effect of adaptive mesh refinement on a parallel domain decomposition solver of a linear system of algebraic equations. These concepts need to be combined within a parallel adaptive finite element software. A prototype implementation is presented for this purpose. Read More

A novel and scalable geometric multi-level algorithm is presented for the numerical solution of elliptic partial differential equations, specially designed to run with high occupancy of streaming processors inside Graphics Processing Units(GPUs). The algorithm consists of iterative, superposed operations on a single grid, and it is composed of two simple full-grid routines: a restriction and a coarsened interpolation-relaxation. The restriction is used to collect sources using recursive coarsened averages, and the interpolation-relaxation simultaneously applies coarsened finite-difference operators and interpolations. Read More

The contraction method is a procedure that allows to establish non-trivial relations between Lie algebras and has had succesful applications in both mathematics and theoretical physics. This work deals with generalizations of the contraction procedure with a main focus in the so called S-expansion method as it includes most of the other generalized contractions. Basically, the S-exansion combines a Lie algebra $\mathcal{G}$ with a finite abelian semigroup $S$ in order to define new S-expanded algebras. Read More

This report provides an introduction to some Machine Learning tools within the most common development environments. It mainly focuses on practical problems, skipping any theoretical introduction. It is oriented to both students trying to approach Machine Learning and experts looking for new frameworks. Read More

We describe a parallel, adaptive, multi-block algorithm for explicit integration of time dependent partial differential equations on two-dimensional Cartesian grids. The grid layout we consider consists of a nested hierarchy of fixed size, non-overlapping, logically Cartesian grids stored as leaves in a quadtree. Dynamic grid refinement and parallel partitioning of the grids is done through the use of the highly scalable quadtree/octree library p4est. Read More

This research investigates the implementation mechanism of block-wise ILU(k) preconditioner on GPU. The block-wise ILU(k) algorithm requires both the level k and the block size to be designed as variables. A decoupled ILU(k) algorithm consists of a symbolic phase and a factorization phase. Read More

Constructing active sets is a key part of the Multivariate Decomposition Method. An algorithm for constructing optimal or quasi-optimal active sets is proposed in the paper. By numerical experiments, it is shown that the new method can provide sets that are significantly smaller than the sets constructed by the already existing method. Read More

Extreme-scale computational science increasingly demands multiscale and multiphysics formulations. Combining software developed by independent groups is imperative: no single team has resources for all predictive science and decision support capabilities. Scientific libraries provide high-quality, reusable software components for constructing applications with improved robustness and portability. Read More

The Landau collision integral is an accurate model for the small-angle dominated Coulomb collisions in fusion plasmas. We investigate a high order accurate, fully conservative, finite element discretization of the nonlinear multi-species Landau integral with adaptive mesh refinement using the PETSc library (www.mcs. Read More

Regional hydrology studies are often supported by high resolution simulations of subsurface flow that require expensive and extensive computations. Efficient usage of the latest high performance parallel computing systems becomes a necessity. The simulation software ParFlow has been demonstrated to meet this requirement and shown to have excellent solver scalability for up to 16,384 processes. Read More

The R package frailtySurv for simulating and fitting semi-parametric shared frailty models is introduced. frailtySurv implements semi-parametric consistent estimators for a variety of frailty distributions, including gamma, log-normal, inverse Gaussian and power variance function, and provides consistent estimators of the standard errors of the parameters' estimators. The parameters' estimators are asymptotically normally distributed, and therefore statistical inference based on the results of this package, such as hypothesis testing and confidence intervals, can be performed using the normal distribution. Read More

In this paper, we import tensor index notation including Einstein summation notation into programming by introducing two kinds of functions, tensor functions and scalar functions. Tensor functions are functions that contract the tensors given as an argument, and scalar functions are the others. As with ordinary functions, when a tensor function obtains a tensor as an argument, the tensor function treats the tensor as it is as a tensor. Read More

Simflowny is an open platform which automatically generates parallel code of scientific dynamical models for different simulation frameworks. Here we present major upgrades on this software to support an extended set of families of models, in particular: i) a new generic family for partial differential equations, which can include spatial derivatives of any order, ii) a new family for agent based models to study complex phenomena --either on a spatial domain or on a graph--. Additionally we introduce a flexible graphical user interface (GUI) to accommodate these and future families of equations. Read More

Scikit-multilearn is a Python library for performing multi-label classification. The library is compatible with the scikit/scipy ecosystem and uses sparse matrices for all internal operations. It provides native Python implementations of popular multi-label classification methods alongside novel framework for label space partitioning and division. Read More

Optimization of Mixed-Integer Non-Linear Programming (MINLP) supports important decisions in applications such as Chemical Process Engineering. But current solvers have limited ability for deductive reasoning or the use of domain-specific theories, and the management of integrality constraints does not yet exploit automated reasoning tools such as SMT solvers. This seems to limit both scalability and reach of such tools in practice. Read More

Fast Fourier Transforms (FFTs) are exploited in a wide variety of fields ranging from computer science to natural sciences and engineering. With the rising data production bandwidths of modern FFT applications, judging best which algorithmic tool to apply, can be vital to any scientific endeavor. As tailored FFT implementations exist for an ever increasing variety of high performance computer hardware, choosing the best performing FFT implementation has strong implications for future hardware purchase decisions, for resources FFTs consume and for possibly decisive financial and time savings ahead of the competition. Read More

This paper presents our work on designing scalable linear solvers for large-scale reservoir simulations. The main objective is to support implementation of parallel reservoir simulators on distributed-memory parallel systems, where MPI (Message Passing Interface) is employed for communications among computation nodes. Distributed matrix and vector modules are designed, which are the base of our parallel linear systems. Read More

This article describes the implementation of an all-in-one numerical procedure within the runtime StarPU. In order to limit the complexity of the method, for the sake of clarity of the presentation of the non-classical task-driven programming environnement, we have limited the numerics to first order in space and time. Results show that the task distribution is efficient if the tasks are numerous and individually large enough so that the task heap can be saturated by tasks which computational time covers the task management overhead. Read More

We describe DyNet, a toolkit for implementing neural network models based on dynamic declaration of network structure. In the static declaration strategy that is used in toolkits like Theano, CNTK, and TensorFlow, the user first defines a computation graph (a symbolic representation of the computation), and then examples are fed into an engine that executes this computation and computes its derivatives. In DyNet's dynamic declaration strategy, computation graph construction is mostly transparent, being implicitly constructed by executing procedural code that computes the network outputs, and the user is free to use different network structures for each input. Read More

This thesis examines a modern concept for machine numbers based on interval arithmetic called 'Unums' and compares it to IEEE 754 floating-point arithmetic, evaluating possible uses of this format where floating-point numbers are inadequate. In the course of this examination, this thesis builds theoretical foundations for IEEE 754 floating-point numbers, interval arithmetic based on the projectively extended real numbers and Unums. Read More

This paper introduces a method to reduce communication that is injected into the network during a sparse matrix-vector multiply by reorganizing messages on each node. This results in a reduction of the inter-node communication, replaced by less-costly intra-node communication, which reduces both the number and size of messages that are injected into the network. Read More

This is the user manual for the software package BSEPACK (Bethe--Salpeter Eigenvalue Solver Package). Read More

In this paper, an efficient divide-and-conquer (DC) algorithm is proposed for the symmetric tridiagonal matrices based on ScaLAPACK and the hierarchically semiseparable (HSS) matrices. HSS is an important type of rank-structured matrices.Most time of the DC algorithm is cost by computing the eigenvectors via the matrix-matrix multiplications (MMM). Read More

We propose a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion, targeting multi-core processor architectures. Complexity estimates and experimental comparisons demonstrate the advantages of this new approach. Read More

We consider the extension of the method of Gauss-Newton from complex floating-point arithmetic to the field of truncated power series with complex floating-point coefficients. With linearization we formulate a linear system where the coefficient matrix is a series with matrix coefficients. The structure of the linear system leads in the regular case to a block triangular system. Read More

We present efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design where we achieve performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700. Theoretical and experimental analysis of classical HT is performed for opportunities to exhibit higher degree of parallelism where parallelism is quantified as a number of parallel operations per level in the Directed Acyclic Graph (DAG) of the transform. Based on theoretical analysis of classical HT, an opportunity re-arrange computations in the classical HT is identified that results in Modified HT (MHT) where it is shown that MHT exhibits 1. Read More

SimTensor is a multi-platform, open-source software for generating artificial tensor data (either with CP/PARAFAC or Tucker structure) for reproducible research on tensor factorization algorithms. SimTensor is a stand-alone application based on MATALB. It provides a wide range of facilities for generating tensor data with various configurations. Read More

Simulations of physical phenomena are essential to the expedient design of precision components in aerospace and other high-tech industries. These phenomena are often described by mathematical models involving partial differential equations (PDEs) without exact solutions. Modern design problems require simulations with a level of resolution that is difficult to achieve in a reasonable amount of time even in effectively parallelized solvers. Read More

Iterative methods on irregular grids have been used widely in all areas of comptational science and engineering for solving partial differential equations with complex geometry. They provide the flexibility to express complex shapes with relatively low computational cost. However, the direction of the evolution of high-performance processors in the last two decades have caused serious degradation of the computational efficiency of iterative methods on irregular grids, because of relatively low memory bandwidth. Read More

We present the library Moore, which implements Interval Arithmetic in modern C++. This library is based on a new feature in the C++ language called concepts, which reduces the problems caused by template meta programming, and leads to a new approach for implementing interval arithmetic libraries in C++. Read More

Software for mixed-integer linear programming can return incorrect results for a number of reasons, one being the use of inexact floating-point arithmetic. Even solvers that employ exact arithmetic may suffer from programming or algorithmic errors, motivating the desire for a way to produce independently verifiable certificates of claimed results. Due to the complex nature of state-of-the-art MILP solution algorithms, the ideal form of such a certificate is not entirely clear. Read More

High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Thus,achieving high performance for the Gemm kernel translates into a high performance linear algebra stack above this kernel. Read More

In recent years, deep neural networks (DNNs), have yielded strong results on a wide range of applications. Graphics Processing Units (GPUs) have been one key enabling factor leading to the current popularity of DNNs. However, despite increasing hardware flexibility and software programming toolchain maturity, high efficiency GPU programming remains difficult: it suffers from high complexity, low productivity, and low portability. Read More

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. Read More

The paper presents a parallel math library, dMath, that demonstrates leading scaling when using intranode, internode, and hybrid-parallelism for deep learning (DL). dMath provides easy-to-use distributed primitives and a variety of domain-specific algorithms including matrix multiplication, convolutions, and others allowing for rapid development of scalable applications like deep neural networks (DNNs). Persistent data stored in GPU memory and advanced memory management techniques avoid costly transfers between host and device. Read More

We consider algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthogonal transformations. We use the framework of "algorithms by tiles". Within this framework, we study: (i) the tiled bidiagonalization algorithm BiDiag, which is a tiled version of the standard scalar bidiagonalization algorithm; and (ii) the R-bidiagonalization algorithm R-BiDiag, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R-factor. Read More

DiffSharp is an algorithmic differentiation or automatic differentiation (AD) library for the .NET ecosystem, which is targeted by the C# and F# languages, among others. The library has been designed with machine learning applications in mind, allowing very succinct implementations of models and optimization routines. Read More

We show that Automatic Differentiation (AD) operators can be provided in a dynamic language without sacrificing numeric performance. To achieve this, general forward and reverse AD functions are added to a simple high-level dynamic language, and support for them is included in an aggressive optimizing compiler. Novel technical mechanisms are discussed, which have the ability to migrate the AD transformations from run-time to compile-time. Read More

Heretofore, automatic checkpointing at procedure-call boundaries, to reduce the space complexity of reverse mode, has been provided by systems like Tapenade. However, binomial checkpointing, or treeverse, has only been provided in Automatic Differentiation (AD) systems in special cases, e.g. Read More