parquet

Append class to all HashCodeBuilders in Gaffer for the below issue to minimise hash collisions.

@Test
    void name() {
        Foo foo = new Foo();
        Bar bar = new Bar();

        assertFalse(foo.equals(bar));
        assertNotEquals(foo.hashCode(), bar.hashCode()); //fails
    }

    class Bar {
        int a = 3;

        @Override
        public int hashCode() {

Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.

I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with

driver = 'libhdfs'
return pyarrow.hdfs.c

Note sure if it could be interesting but:

When registering a table:

addr: 0.0.0.0:8084
tables:
  - name: "example"
    uri: "/data/"
    option:
      format: "parquet"
      use_memory_table: false

add in options:
glob

pattern: "file_typev1*.parquet"

or regexp

pattern: "\wfile_type\wv1\w*.parquet"

It would allow selecting in uri's with different exte

Currently, there isn't a way to get the table properties in the SparkOrcWriter via the WriterFactory.

Over time we've had some things leak into the diff methods that make it more cumbersome to use BigDiffy via code instead of CLI.

For example diffAvro here https://github.com/spotify/ratatool/blob/master/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala#L284

User has to manually pass in schema otherwise we they receive a non-informative error regarding null schema, add

Let's show some examples of integration with kgextension
https://kgextension.readthedocs.io/en/latest/

Could be another notebook added to the tutorial.

Where it fits, we might also integrate as a dependency?

Problem description

Reading a dataset with eager's read functionality raises a ValueError when providing columns.

Example code (ideally copy-pastable)

import pandas as pd

from tempfile import TemporaryDirectory
from functools import partial
from storefact import get_store_from_url

from kartothek.io.eager import store_dataframes_as_dataset, read_dataset_as_data

parquet

Here are 228 public repositories matching this topic...

gchq / Gaffer

apache / drill

apache / parquet-mr

uber / petastorm

quiltdata / quilt

apache / parquet-format

roapi / roapi

HariSekhon / DevOps-Python-tools

Cinchoo / ChoETL

Netflix / iceberg

skale-me / skale

Intel-bigdata / OAP

ranaroussi / pystore

apache / parquet-cpp

RandomFractals / vscode-data-preview

moshe / elasticsearch_loader

elastacloud / parquet-dotnet

spotify / ratatool

DerwenAI / kglab

ironSource / parquetjs

mukunku / ParquetViewer

scikit-hep / awkward-0.x

Chabane / bigdata-playground

cldellow / sqlite-parquet-vtable

JDASoftwareGroup / kartothek

Problem description

Example code (ideally copy-pastable)

fraugster / parquet-go

mjakubowski84 / parquet4s

Eugene-Mark / bigdata-file-viewer

sunchao / parquet-rs

51zero / eel-sdk

Improve this page

Add this topic to your repo