-
Updated
Sep 16, 2020 - Makefile
cuda
Here are 2,250 public repositories matching this topic...
-
Updated
Nov 4, 2020 - Python
-
Updated
Nov 8, 2020 - C++
-
Updated
Nov 8, 2020 - Go
Current implementation of join can be improved by performing the operation in a single call to the backend kernel instead of multiple calls.
This is a fairly easy kernel and may be a good issue for someone getting to know CUDA/ArrayFire internals. Ping me if you want additional info.
PR NVIDIA/cub#218 fixes this CUB's radix sort. We should:
- Check whether Thrust's other backends handle this case correctly.
- Provide a guarantee of this in the stable_sort documentation.
- Add regression tests to enforce this on all backends.
-
Updated
Nov 4, 2020 - C++
Report needed documentation
We do not have documentation specifying the different treelite Operator values that FIL supports. (https://github.com/dmlc/treelite/blob/46c8390aed4491ea97a017d447f921efef9f03ef/include/treelite/base.h#L40)
Report needed documentation
https://github.com/rapidsai/cuml/blob/branch-0.15/cpp/test/sg/fil_test.cu
There are multiple places in the fil_test.cu file
-
Updated
Sep 11, 2018 - C++
I often use -v just to see that something is going on, but a progress bar (enabled by default) would serve the same purpose and be more concise.
We can just factor out the code from futhark bench for this.
-
Updated
Oct 26, 2020 - Python
-
Updated
Sep 29, 2020 - Jupyter Notebook
-
Updated
Jul 22, 2020 - C++
-
Updated
Oct 12, 2020 - Python
-
Updated
Oct 14, 2020 - Python
-
Updated
Oct 18, 2020 - C++
-
Updated
Oct 30, 2020 - Python
-
Updated
Nov 1, 2020 - C++
-
Updated
Mar 18, 2020 - C++
Segmented reduce uses the same template type OffsetIteratorT for begin and end offsets
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceSegmentedReduce::Sum
( void * d_temp_storage,
size_t & temp_storage_bytes,
InputIteratorT d_in,
OutputIteratorT d_out,
int num_segments,
OffsetIterato-
Updated
Jan 3, 2020 - Shell
Improve this page
Add a description, image, and links to the cuda topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the cuda topic, visit your repo's landing page and select "manage topics."
Current default value for
rows_per_chunkparameter of the CSV writer is 8, which means that the input table is by default broken into many small slices that are written out sequentially. This reduces the performance by an order on magnitude in some cases.In Python layer, the default is the number of rows (i.e. write table out in a single pass). We can follow this by setting
rows_per_chunk