Skip to content
branch-22.02
Switch branches/tags
Code

Latest commit

Fixes #9164.

### Prelude
`lists::contains()` (introduced in #7039) returns a `BOOL8` column, indicating whether the specified search_key(s) exist at all in each corresponding list row of an input LIST column. It does not return the actual position.

### `index_of()`
This commit introduces `lists::index_of()`, to return the INT32 positions of the specified search_key(s) in a LIST column.

The search keys may be searched for using either `FIND_FIRST` (which finds the position of the first occurrence), or `FIND_LAST` (which finds the last occurrence). Both column_view and scalar search keys are supported.

As with `lists::contains()`, nested types are not supported as search keys in `lists::index_of()`.

If the search_key cannot be found, that output row is set to `-1`. Additionally, the row `output[i]` is set to null if:
  1. The `search_key`(scalar) or `search_keys[i]`(column_view) is null.
  2. The list row `lists[i]` is null

In all other cases, `output[i]` should contain a non-negative value.

### Semantic changes for `lists::contains()`
This commit also modifies the semantics of `lists::contains()`: it will now return nulls only for the following cases:
  1. The `search_key`(scalar) or `search_keys[i]`(column_view) is null.
  2. The list row `lists[i]` is null

In all other cases, a non-null bool is returned. Specifically `lists::contains()` no longer conforms to SQL semantics of returning `NULL` for list rows that don't contain the search key, while simultaneously containing nulls. In this case, `false` is returned.

### `lists::contains_null_elements()`
A new function has been introduced to check if each list row contains null elements. The semantics are similar to `lists::contains()`, in that the column returned is BOOL8 typed:
  1. If even 1 element in a list row is null, the returned row is `true`.
  2. If no element is null, the returned row is `false`.
  3. If the list row is null, the returned row is `null`.
  4. If the list row is empty, the returned row is `false`.

The current implementation is an inefficient placeholder, to be replaced once (#9588) is available. It is included here to reconstruct the SQL semantics dropped from `lists::contains()`.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Jason Lowe (https://github.com/jlowe)
  - Mark Harris (https://github.com/harrism)
  - Conor Hoekstra (https://github.com/codereport)

URL: #9510
a4dc42d

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
ci
 
 
 
 
cpp
 
 
 
 
 
 
 
 
img
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 cuDF - GPU DataFrames

Build Status

NOTE: For the latest stable README.md ensure you are on the main branch.

Resources

Overview

Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

For example, the following snippet downloads a CSV, then uses the GPU to parse it into rows and columns and run calculations:

import cudf, io, requests
from io import StringIO

url = "https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

tips_df = cudf.read_csv(StringIO(content))
tips_df['tip_percentage'] = tips_df['tip'] / tips_df['total_bill'] * 100

# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

Output:

size
1    21.729201548727808
2    16.571919173482897
3    15.215685473711837
4    14.594900639351332
5    14.149548965142023
6    15.622920072028379
Name: tip_percentage, dtype: float64

For additional examples, browse our complete API documentation, or check out our more detailed notebooks.

Quick Start

Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuDF.

Installation

CUDA/GPU requirements

  • CUDA 11.0+
  • NVIDIA driver 450.80.02+
  • Pascal architecture or better (Compute Capability >=6.0)

Conda

cuDF can be installed with conda (miniconda, or the full Anaconda distribution) from the rapidsai channel:

For cudf version == 21.08 :

# for CUDA 11.0
conda install -c rapidsai -c nvidia -c numba -c conda-forge \
    cudf=21.08 python=3.7 cudatoolkit=11.0

# or, for CUDA 11.2
conda install -c rapidsai -c nvidia -c numba -c conda-forge \
    cudf=21.08 python=3.7 cudatoolkit=11.2

For the nightly version of cudf :

# for CUDA 11.0
conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge \
    cudf python=3.7 cudatoolkit=11.0

# or, for CUDA 11.2
conda install -c rapidsai-nightly -c nvidia -c numba -c conda-forge \
    cudf python=3.7 cudatoolkit=11.2

Note: cuDF is supported only on Linux, and with Python versions 3.7 and later.

See the Get RAPIDS version picker for more OS and version info.

Build/Install from Source

See build instructions.

Contributing

Please see our guide for contributing to cuDF.

Contact

Find out more details on the RAPIDS site

Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

Apache Arrow on GPU

The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.