Skip to content
#

parquet

Here are 250 public repositories matching this topic...

Jonathanpro
Jonathanpro commented Jan 2, 2019

Hello everyone,
Recently I tried to set up petastorm on my company's hadoop cluster.
However as the cluster uses Kerberos for authentication using petastorm failed.
I figured out that petastorm relies on pyarrow which actually supports kerberos authentication.

I hacked "petastorm/petastorm/hdfs/namenode.py" line 250
and replaced it with

driver = 'libhdfs'
return pyarrow.hdfs.c
enhancement good first issue
tiphaineruy
tiphaineruy commented Oct 11, 2021

Note sure if it could be interesting but:

When registering a table:

addr: 0.0.0.0:8084
tables:
  - name: "example"
    uri: "/data/"
    option:
      format: "parquet"
      use_memory_table: false

add in options:
glob

pattern: "file_typev1*.parquet"

or regexp

pattern: "\wfile_type\wv1\w*.parquet"

It would allow selecting in uri's with different exte

enhancement good first issue help wanted

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

  • Updated May 26, 2022
  • Python
kglab
Mec-iS
Mec-iS commented Mar 30, 2022

I'm submitting a

  • [x ] bug report.

Current Behaviour:

After #249
Trying to run tests with pytest tests/rdf_tests/test_rdf_basic.py -k test_rdf_runner -s, you get a report file with all the tests run.
Some tests return errors, for example:

{
    "Basic - Term 7": {
        "input": "basic/data-4.ttl",
        "query": "basic/term-7.rq",
        "error": "Expected {Sele
help wanted good first issue
idreeskhan
idreeskhan commented Dec 30, 2019

Over time we've had some things leak into the diff methods that make it more cumbersome to use BigDiffy via code instead of CLI.

For example diffAvro here https://github.com/spotify/ratatool/blob/master/ratatool-diffy/src/main/scala/com/spotify/ratatool/diffy/BigDiffy.scala#L284

User has to manually pass in schema otherwise we they receive a non-informative error regarding null schema, add

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL

  • Updated Feb 1, 2019
  • TypeScript
amazon-s3-find-and-forget
matteofigus
matteofigus commented May 12, 2021

When an Item in the queue is added with incorrect type for the corresponding Data Mapper, the Job fails during planning, without any information about which datamapper/queue item id is involved.

Let's take a Data Mapper with a identifier of type int for instance. If we add foo to the deletion queue, the find will fail with a log like this:

{
  "EventData": {
    "Error": "ValueError
bug good first issue

Improve this page

Add a description, image, and links to the parquet topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the parquet topic, visit your repo's landing page and select "manage topics."

Learn more