# Cristobal Navarro

## Contact Details

NameCristobal Navarro |
||

Affiliation |
||

Location |
||

## Pubs By Year |
||

## Pub CategoriesComputer Science - Distributed; Parallel; and Cluster Computing (6) Physics - Computational Physics (2) Physics - Statistical Mechanics (1) Computer Science - Architecture (1) |

## Publications Authored By Cristobal Navarro

A novel parallel simulation algorithm on the GPU, implemented in CUDA and C++, is presented for the simulation of Brownian particles that display excluded volume repulsion and interact with long and short range forces. When an explicit Euler-Maruyama integration step is performed to take into account the pairwise forces and Brownian motion, particle overlaps can appear. The excluded volume property brings up the need for correcting these overlaps as they happen, since predicting them is not feasible due to the random displacement of Brownian particles. Read More

The problem of parallel thread mapping is studied for the case of discrete orthogonal $m$-simplices. The possibility of a $O(1)$ time recursive block-space map $\lambda: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ is analyzed from the point of view of parallel space efficiency and potential performance improvement. The $2$-simplex and $3$-simplex are analyzed as special cases, where constant time maps are found, providing a potential improvement of up to $2\times$ and $6\times$ more efficient than a bounding-box approach, respectively. Read More

There is a stage in the GPU computing pipeline where a grid of thread-blocks, in \textit{parallel space}, is mapped onto the problem domain, in \textit{data space}. Since the parallel space is restricted to a box type geometry, the mapping approach is typically a $k$-dimensional bounding box (BB) that covers a $p$-dimensional data space. Threads that fall inside the domain perform computations while threads that fall outside are discarded at runtime. Read More

The study of data-parallel domain re-organization and thread-mapping techniques are relevant topics as they can increase the efficiency of GPU computations when working on spatial discrete domains with non-box-shaped geometry. In this work we study the potential benefits of applying a succint data re-organization of a tetrahedral data-parallel domain of size $\mathcal{O}(n^3)$ combined with an efficient block-space GPU map of the form $g:\mathbb{N} \rightarrow \mathbb{N}^3$. Results from the analysis suggest that in theory the combination of these two optimizations produce significant performance improvement as block-based data re-organization allows a coalesced one-to-one correspondence at local thread-space while $g(\lambda)$ produces an efficient block-space spatial correspondence between groups of data and groups of threads, reducing the number of unnecessary threads from $O(n^3)$ to $O(n^2\rho^3)$ where $\rho$ is the linear block-size and typically $\rho^3 \ll n$. Read More

The computational cost of transfer matrix methods for the Potts model is directly related to the problem of \textit{into how many ways can two adjacent blocks of a lattice be connected}. Answering this question leads to the generation of a combinatorial set of lattice configurations. This set defines the \textit{configuration space} of the problem, and the smaller it is, the faster the transfer matrix method can be. Read More

There is a stage in the GPU computing pipeline where a grid of thread-blocks is mapped to the problem domain. Normally, this grid is a k-dimensional bounding box that covers a k-dimensional problem no matter its shape. Threads that fall inside the problem domain perform computations, otherwise they are discarded at runtime. Read More

The transfer-matrix technique is a convenient way for studying strip lattices in the Potts model since the compu- tational costs depend just on the periodic part of the lattice and not on the whole. However, even when the cost is reduced, the transfer-matrix technique is still an NP-hard problem since the time T(|V|, |E|) needed to compute the matrix grows ex- ponentially as a function of the graph width. In this work, we present a parallel transfer-matrix implementation that scales performance under multi-core architectures. Read More