parquet

This will be a breaking change so needs to be done with a major release.

Currently OperationDetail and OperationField are inner classes within OperationServiceV2. These should be in their own outer classes for ease of serialisation and consistency.

Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.

I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with

driver = 'libhdfs'
return pyarrow.hdfs.c

Currently, there isn't a way to get the table properties in the SparkOrcWriter via the WriterFactory.

Over time we've had some things leak into the diff methods that make it more cumbersome to use BigDiffy via code instead of CLI.

For example diffAvro here https://github.com/spotify/ratatool/blob/master/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala#L284

User has to manually pass in schema otherwise we they receive a non-informative error regarding null schema, add

Something like:

from kartothek.core.dataset import DatasetMetadata
from kartothek.core.factory import DatasetFactory
from kartothek.io_components.metapartition import SINGLE_TABLE


def get_pyarrow_schema(factory: DatasetFactory, table: str = SINGLE_TABLE):
    dm = DatasetMetadata(uuid=factory.dataset_uuid).load_from_store(
        uuid=factory.dataset_uuid, store=factory.store

When opening a parquet file, ParquetViewer first launches a popup "Select fields to load", where you either can confirm to load all fields, or select the fields you want.

In all use cases relevant for me, I want to display all fields. Hence I'm wondering if it would be possible to skip this popup all together? It's just inconvenient to always confirm the "All fields...", before you see any data

parquet

Here are 167 public repositories matching this topic...

gchq / Gaffer

apache / parquet-mr

quiltdata / quilt

uber / petastorm

apache / parquet-format

skale-me / skale

Netflix / iceberg

HariSekhon / DevOps-Python-tools

apache / parquet-cpp

Cinchoo / ChoETL

Intel-bigdata / OAP

moshe / elasticsearch_loader

ranaroussi / pystore

spotify / ratatool

elastacloud / parquet-dotnet

scikit-hep / awkward-array

ironSource / parquetjs

Chabane / bigdata-playground

cldellow / sqlite-parquet-vtable

51zero / eel-sdk

sunchao / parquet-rs

JDASoftwareGroup / kartothek

mukunku / ParquetViewer

lightcopy / parquet-index

indix / schemer

mjakubowski84 / parquet4s

saurfang / sparksql-protobuf

awslabs / amazon-s3-find-and-forget

fraugster / parquet-go

Re1tReddy / Spark

Improve this page

Add this topic to your repo