scirpy.ir_dist.metrics.GPUHammingDistanceCalculator

scirpy.ir_dist.metrics.GPUHammingDistanceCalculator#

class scirpy.ir_dist.metrics.GPUHammingDistanceCalculator(*, cutoff=2, gpu_n_blocks=10, gpu_block_width=1000)#

Computes pairwise distances between gene sequences based on the “hamming” distance metric with GPU support.

The code of this class is based on pwseqdist. Reused under MIT license, Copyright (c) 2020 Andrew Fiore-Gartland.

For performance reasons, the computation of the final result matrix is split up into several blocks. The parameter gpu_n_blocks determines the number of those blocks. The parameter gpu_block_width determines how much GPU memory is reserved for the computed result of each block in SPARSE representation.

E.g. there is a 1000x1000 (dense represenation) not yet computed result matrix with gpu_n_blocks=10 and gpu_block_width=20. Then the result matrix is computed in 10 blocks of 1000x100 (dense representation). Each of these blocks needs to fit into a 1000x20 block in SPARSE representation once computed and this 1000x20 block needs to fit into GPU memory. So there shouldn’t be a resulting row in a block that has more than 20 values <= cutoff.

The parameter gpu_block_width should be chosen based on the available GPU memory. Choosing lower values for gpu_n_blocks increases the performance but also increases the risk of running out of reserved memory, since the result blocks that need to fit into the reserved GPU memory in sparse representation get bigger.

Parameters:
  • cutoff (int (default: 2)) – Will eleminate distances > cutoff to make efficient use of sparse matrices.

  • gpu_n_blocks (int (default: 10)) – Number of blocks in which the final result matrix should be computed. Each block reserves GPU memory in which the computed result block has to fit in sparse representation. Lower values give better performance but increase the risk of running out of reserved memory. This value should be chosen based on the estimated sparsity of the result matrix and the size of the GPU device memory.

  • gpu_block_width (int (default: 1000)) – Maximum width of blocks in which the final result matrix should be computed. Each block reserves GPU memory in which the computed result block has to fit in sparse representation. Higher values allow for a lower number of result blocks (gpu_n_blocks) which increases the performance. This value should be chosen based on the GPU device memory.

Methods table#

calc_dist_mat(seqs[, seqs2])

Calculates the pairwise distances between two vectors of gene sequences based on the distance metric of the derived class and returns a CSR distance matrix.

Methods#

GPUHammingDistanceCalculator.calc_dist_mat(seqs, seqs2=None)#

Calculates the pairwise distances between two vectors of gene sequences based on the distance metric of the derived class and returns a CSR distance matrix. Also creates a histogram based on the minimum value per row of the distance matrix if histogram is set to True.

Return type:

csr_matrix