cuda

Current default value for rows_per_chunk parameter of the CSV writer is 8, which means that the input table is by default broken into many small slices that are written out sequentially. This reduces the performance by an order on magnitude in some cases.

In Python layer, the default is the number of rows (i.e. write table out in a single pass). We can follow this by setting rows_per_chunk

Current implementation of join can be improved by performing the operation in a single call to the backend kernel instead of multiple calls.

This is a fairly easy kernel and may be a good issue for someone getting to know CUDA/ArrayFire internals. Ping me if you want additional info.

PR NVIDIA/cub#218 fixes this CUB's radix sort. We should:

Check whether Thrust's other backends handle this case correctly.
Provide a guarantee of this in the stable_sort documentation.
Add regression tests to enforce this on all backends.

Report needed documentation

We do not have documentation specifying the different treelite Operator values that FIL supports. (https://github.com/dmlc/treelite/blob/46c8390aed4491ea97a017d447f921efef9f03ef/include/treelite/base.h#L40)

Report needed documentation
https://github.com/rapidsai/cuml/blob/branch-0.15/cpp/test/sg/fil_test.cu
There are multiple places in the fil_test.cu file

I often use -v just to see that something is going on, but a progress bar (enabled by default) would serve the same purpose and be more concise.

We can just factor out the code from futhark bench for this.

Segmented reduce uses the same template type OffsetIteratorT for begin and end offsets

static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedReduce::Sum
    (   void *              d_temp_storage,
        size_t &            temp_storage_bytes,
        InputIteratorT      d_in,
        OutputIteratorT     d_out,
        int                 num_segments,
        OffsetIterato

cuda

Here are 2,250 public repositories matching this topic...

NVIDIA / nvidia-docker

hashcat / hashcat

chainer / chainer

cupy / cupy

taskflow / taskflow

hybridgroup / gocv

rapidsai / cudf

arrayfire / arrayfire

NVIDIA / thrust

uber / aresdb

ROCm-Developer-Tools / HIP

rapidsai / cuml

Report needed documentation

dmlc / nnvm

Celtoys / Remotery

NVIDIA / libcudacxx

diku-dk / futhark

graphistry / pygraphistry

AlexiaJM / Deep-learning-with-cats

QuantScientist / Deep-Learning-Boot-Camp

mp3guy / ElasticFusion

Xtra-Computing / thundersvm

inducer / pycuda

sniklaus / 3d-ken-burns

NVIDIA / cutlass

NVIDIA / MinkowskiEngine

NVIDIA / cuda-samples

DeepGraphLearning / graphvite

lebedov / scikit-cuda

NVIDIA / cub

Cyclenerd / ethereum_nvidia_miner

Improve this page

Add this topic to your repo

Essential cookies

Always active

Analytics cookies