data-engineering

The Mixed Time-Series chart type allows for configuring the title of the primary and the secondary y-axis.
However, while only the title of the primary axis is shown next to the axis, the title of the secondary one is placed at the upper end of the axis where it gets hidden by bar values and zoom controls.

How to reproduce the bug

Create a mixed time-series chart
Configure axi

Hello!

I've found an issue here:parallel pipelines.

I believe this is a bit misleading - because the work is not done in parallel, unless the relevant executor is chosen. I suggest clarifying and adding a note which emphasises that to actually get parallel execution, one needs to set LocalDaskExecutor or DaskExecutor as the executor class.

Page reference: https://docs.prefect.io/core/c

Describe the bug
data docs columns shrink to 1 character width with long query

To Reproduce
Steps to reproduce the behavior:

make a batch from a long query string
run validation
render result to data docs
See screenshot
<img width="1525" alt="Data_documentation_compiled_by_Great_Expectations" src="https://user-images.githubusercontent.com/928247/103230647-30eca500-4

We should do something like https://blog.questionable.services/article/kubernetes-deployments-configmap-change/ to ensure that if the pod sweeper has a different config map the underlying pod gets rolled.

Under the hood, Benthos csv input uses the standard encoding/csv packages's csv.Reader struct.

The current implementation of csv input doesn't allow setting the LazyQuotes field.

We have a use case where we need to set the LazyQuotes field in order to make things work correctly.

Is your feature request related to a problem? Please describe.
Currently in feature_store.yaml, we can only specify a region for DynamoDB provider. As a result, it requires an actual DynamoDB to be available when we want to do local development/testing or integration testing in a sandbox environment.

Describe the solution you'd like
A way to solve this is to let user pass an endpoint

When we show data for a metric, we currently don't include the current day's worth of data. For users just getting set up, they may only have events from today, and want to test out if the query is working, and by excluding events from 'today', they can't see results.

TODO:

In packages/back-end/src/services/experiments.ts on line 329, instead of using the current date as the value

On more advanced versions of LakeFS (probably > = v1.0.0), we would like to remove the logic that tries to fill the generation field in DB when loading old dumps. It means we will no longer support loading dump that made with a version lower than v0.61.0.

When running tasks in a grid, a number is appended at the end of the tasks product names. For example, the following pipeline.yaml file:

- source: example_task.py
   name: example-task
   product:
       data: output/output_dataframe.csv
   grid:
       input_dataframe: ['birds.csv', 'fish.csv', 'flowers.csv']

would result in 3 products: output_dataframe-1.csv, output_datafram

(1) Add docstrings to methods
(2) Covert .format() methods to f strings for readability
(3) Make sure we are using Python 3.8 throughout
(4) zip extract_all() in ingest_flights.py can be simplified with a Path parameter

if they are not class methods then the method would be invoked for every test and a session would be created for each of those tests.

`class PySparkTest(unittest.TestCase):
@classmethod
def suppress_py4j_logging(cls):
logger = logging.getLogger('py4j')
logger.setLevel(logging.WARN)

@classmethod
def create_testing_pyspark_session(cls):
    return Sp

Background

This thread is borne out of the discussion from #968 , in an effort to make documentation more beginner-friendly & more understandable.
One of the subtasks mentioned in that thread was to go through the function docstrings and include a minimal working example to each of the public functions in pyjanitor.

Criteria reiterated here for the benefit of discussion:

It sh

Follow the implementation example of ingestion/tests/integration/ometa/test_ometa_database_service_api.py to implement the testing of the Python client for PipelineService.

data-engineering

Here are 1,152 public repositories matching this topic...

apache / superset

How to reproduce the bug

eugeneyan / applied-ml

andkret / Cookbook

datastacktv / data-engineer-roadmap

PrefectHQ / prefect

great-expectations / great_expectations

airbytehq / airbyte

benthosdev / benthos

DataTalksClub / data-engineering-zoomcamp

feast-dev / feast

growthbook / growthbook

awslabs / aws-data-wrangler

treeverse / lakeFS

adilkhash / Data-Engineering-HowTo

kantord / just-dashboard

ploomber / ploomber

quiltdata / quilt

benthecoder / yt-channels-DS-AI-ML-CS

GoogleCloudPlatform / data-science-on-gcp

san089 / goodreads_etl_pipeline

AlexIoannides / pyspark-example-project

pyjanitor-devs / pyjanitor

Background

abhishek-ch / around-dataengineering

sodadata / soda-sql

open-metadata / OpenMetadata

oleg-agapov / data-engineering-book

san089 / Udacity-Data-Engineering-Projects

gunnarmorling / awesome-opensource-data-engineering

mlrun / mlrun

automaticmode / active_workflow

Improve this page

Add this topic to your repo