parquet

Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.

I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with

driver = 'libhdfs'
return pyarrow.hdfs.c

Note sure if it could be interesting but:

When registering a table:

addr: 0.0.0.0:8084
tables:
  - name: "example"
    uri: "/data/"
    option:
      format: "parquet"
      use_memory_table: false

add in options:
glob

pattern: "file_typev1*.parquet"

or regexp

pattern: "\wfile_type\wv1\w*.parquet"

It would allow selecting in uri's with different exte

Currently, there isn't a way to get the table properties in the SparkOrcWriter via the WriterFactory.

I'm submitting a

[x ] bug report.

Current Behaviour:

After #249
Trying to run tests with pytest tests/rdf_tests/test_rdf_basic.py -k test_rdf_runner -s, you get a report file with all the tests run.
Some tests return errors, for example:

{
    "Basic - Term 7": {
        "input": "basic/data-4.ttl",
        "query": "basic/term-7.rq",
        "error": "Expected {Sele

Over time we've had some things leak into the diff methods that make it more cumbersome to use BigDiffy via code instead of CLI.

For example diffAvro here https://github.com/spotify/ratatool/blob/master/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala#L284

User has to manually pass in schema otherwise we they receive a non-informative error regarding null schema, add

When an Item in the queue is added with incorrect type for the corresponding Data Mapper, the Job fails during planning, without any information about which datamapper/queue item id is involved.

Let's take a Data Mapper with a identifier of type int for instance. If we add foo to the deletion queue, the find will fail with a log like this:

{
  "EventData": {
    "Error": "ValueError

parquet

Here are 250 public repositories matching this topic...

multiprocessio / dsq

apache / drill

gchq / Gaffer

apache / parquet-mr

uber / petastorm

roapi / roapi

quiltdata / quilt

apache / parquet-format

bigdatagenomics / adam

HariSekhon / DevOps-Python-tools

Cinchoo / ChoETL

Netflix / iceberg

ranaroussi / pystore

skale-me / skale

DerwenAI / kglab

I'm submitting a

Current Behaviour:

apache / parquet-cpp

RandomFractals / vscode-data-preview

moshe / elasticsearch_loader

spotify / ratatool

elastacloud / parquet-dotnet

sksamuel / centurion

mukunku / ParquetViewer

ironSource / parquetjs

rilldata / rill-developer

scikit-hep / awkward-0.x

Eugene-Mark / bigdata-file-viewer

mjakubowski84 / parquet4s

Chabane / bigdata-playground

cldellow / sqlite-parquet-vtable

awslabs / amazon-s3-find-and-forget

Improve this page

Add this topic to your repo