-
Updated
Sep 16, 2020 - Makefile
cuda
Here are 2,519 public repositories matching this topic...
-
Updated
Nov 14, 2020 - Shell
Problem:
catboost version: 0.23.2
Operating System: all
Tutorial: https://github.com/catboost/tutorials/blob/master/custom_loss/custom_metric_tutorial.md
Impossible to use custom metric (С++).
Code example
from catboost import CatBoost
train_data = [[1, 4, 5, 6],
-
Updated
Nov 4, 2020 - Python
-
Updated
Nov 12, 2020 - C++
-
Updated
Nov 11, 2020 - Go
Current default value for rows_per_chunk parameter of the CSV writer is 8, which means that the input table is by default broken into many small slices that are written out sequentially. This reduces the performance by an order on magnitude in some cases.
In Python layer, the default is the number of rows (i.e. write table out in a single pass). We can follow this by setting rows_per_chunk
Current implementation of join can be improved by performing the operation in a single call to the backend kernel instead of multiple calls.
This is a fairly easy kernel and may be a good issue for someone getting to know CUDA/ArrayFire internals. Ping me if you want additional info.
PR NVIDIA/cub#218 fixes this CUB's radix sort. We should:
- Check whether Thrust's other backends handle this case correctly.
- Provide a guarantee of this in the stable_sort documentation.
- Add regression tests to enforce this on all backends.
-
Updated
Nov 4, 2020 - C++
Report needed documentation
We do not have documentation specifying the different treelite Operator values that FIL supports. (https://github.com/dmlc/treelite/blob/46c8390aed4491ea97a017d447f921efef9f03ef/include/treelite/base.h#L40)
Report needed documentation
https://github.com/rapidsai/cuml/blob/branch-0.15/cpp/test/sg/fil_test.cu
There are multiple places in the fil_test.cu file
-
Updated
Sep 11, 2018 - C++
I often use -v just to see that something is going on, but a progress bar (enabled by default) would serve the same purpose and be more concise.
We can just factor out the code from futhark bench for this.
-
Updated
Oct 26, 2020 - Python
-
Updated
Sep 29, 2020 - Jupyter Notebook
-
Updated
Jul 22, 2020 - C++
Thank you for this fantastic work!
Could it be possible the fit_transform() method returns the KL divergence of the run?
Thx!
-
Updated
Oct 12, 2020 - Python
-
Updated
Oct 14, 2020 - Python
-
Updated
Oct 18, 2020 - C++
-
Updated
Oct 30, 2020 - Python
-
Updated
Nov 1, 2020 - C++
Improve this page
Add a description, image, and links to the cuda topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the cuda topic, visit your repo's landing page and select "manage topics."

PR #6447 adds a public API to get the maximum number of registers per thread (
numba.cuda.Dispatcher.get_regs_per_thread()). There are other attributes that might be nice to provide - shared memory per block, local memory per thread, const memory usage, maximum block size.These are all available in the
FuncAttrnamed tuple: https://github.com/numba/numba/blob/master/numba/cuda/cudadrv/drive