cuda

PR #6447 adds a public API to get the maximum number of registers per thread (numba.cuda.Dispatcher.get_regs_per_thread()). There are other attributes that might be nice to provide - shared memory per block, local memory per thread, const memory usage, maximum block size.

These are all available in the FuncAttr named tuple: https://github.com/numba/numba/blob/master/numba/cuda/cudadrv/drive

Problem:
catboost version: 0.23.2
Operating System: all
Tutorial: https://github.com/catboost/tutorials/blob/master/custom_loss/custom_metric_tutorial.md

Impossible to use custom metric (С++).

Code example

from catboost import CatBoost
train_data = [[1, 4, 5, 6],

Current default value for rows_per_chunk parameter of the CSV writer is 8, which means that the input table is by default broken into many small slices that are written out sequentially. This reduces the performance by an order on magnitude in some cases.

In Python layer, the default is the number of rows (i.e. write table out in a single pass). We can follow this by setting rows_per_chunk

Current implementation of join can be improved by performing the operation in a single call to the backend kernel instead of multiple calls.

This is a fairly easy kernel and may be a good issue for someone getting to know CUDA/ArrayFire internals. Ping me if you want additional info.

PR NVIDIA/cub#218 fixes this CUB's radix sort. We should:

Check whether Thrust's other backends handle this case correctly.
Provide a guarantee of this in the stable_sort documentation.
Add regression tests to enforce this on all backends.

Report needed documentation

We do not have documentation specifying the different treelite Operator values that FIL supports. (https://github.com/dmlc/treelite/blob/46c8390aed4491ea97a017d447f921efef9f03ef/include/treelite/base.h#L40)

Report needed documentation
https://github.com/rapidsai/cuml/blob/branch-0.15/cpp/test/sg/fil_test.cu
There are multiple places in the fil_test.cu file

I often use -v just to see that something is going on, but a progress bar (enabled by default) would serve the same purpose and be more concise.

We can just factor out the code from futhark bench for this.

Thank you for this fantastic work!

Could it be possible the fit_transform() method returns the KL divergence of the run?

Thx!

cuda

Here are 2,519 public repositories matching this topic...

NVIDIA / nvidia-docker

kaldi-asr / kaldi

hashcat / hashcat

numba / numba

catboost / catboost

chainer / chainer

cupy / cupy

taskflow / taskflow

hybridgroup / gocv

rapidsai / cudf

arrayfire / arrayfire

NVIDIA / thrust

uber / aresdb

ROCm-Developer-Tools / HIP

rapidsai / cuml

Report needed documentation

dmlc / nnvm

Celtoys / Remotery

NVIDIA / libcudacxx

diku-dk / futhark

graphistry / pygraphistry

AlexiaJM / Deep-learning-with-cats

QuantScientist / Deep-Learning-Boot-Camp

mp3guy / ElasticFusion

Xtra-Computing / thundersvm

CannyLab / tsne-cuda

inducer / pycuda

sniklaus / 3d-ken-burns

NVIDIA / cutlass

NVIDIA / MinkowskiEngine

NVIDIA / cuda-samples

Improve this page

Add this topic to your repo

Essential cookies

Always active

Analytics cookies