data-cleaning

Hello.
I've come across what (to me) seems to be a problem with the FILENAME and FILENUM variables.

# mlr --version
Miller v5.6.2

# cat /tmp/csv1
A,B,C
_2GB,255,2
_4GB,120,4
_6GB,50,6
_10GB,10,10

# cat /tmp/csv2
FIRST,SECOND,THIRD,FOURTH
1,2,3,4
5,6,7,8
9,10,11,12
13,14,15,16

# mlr --icsv cat then put 'print FILENAME'   /tmp/csv1 /tmp/csv2
/tmp/csv1
A=_2GB,B=255,C=2
/

Right now it's not immediately obvious they're deprecated. They should either not have reference pages at all, or the reference page should just say, don't use X, use Y.

While attempting to fill an incomplete batch, no attempt is made to follow the strategy specified by step_to_index_fn in choosing the next batch/samples from the next batch as of now.

validate::variables returns R variables that are not columns
data to be checked.

library(validate)
rules <- validator(A %in% letter[1:2])
variables(rules)

## [1] "A"      "letter"

As detailed in:
https://github.com/marketplace/actions/run-circleci-artifacts-redirector?version=0.1.0

It is used to link from the PR to the docs rendered by circleci, for instance in scikit-learn or sphinx-gallery. It helps reviewing PRs.

The table is not labeled properly. The proper label should be Table 1.3.4. Also, the description is a bit unclear. You may want to eliminate the last sentence

Note this is intended to just be a demo issue to suggest enhancements on the book.

Currently, the SeriesSchema object doesn't validate the index of the schema. The purpose of this task is to extend the __init__ signature of SeriesSchema to take an index argument, which would take a pa.Index or pa.MultiIndex. In the validate / __call__ call, the index should be checked.

@ekstroem

Dear @ekstroem , would it be possible to add github_document as an output format? Same document as the .Rmd for output: html, but then with output: github_document!

Best,
CJ

This can be useful to get an overview of string structure of a columns
def patterns(self, input_cols, output_cols=None, mode=0):

See https://github.com/ironmussa/Optimus/blob/develop-3.0/optimus/engines/base/columns.py#L153 For more info about the param

Context

A number of elements on the viz editor are half baked or not optimally implemented - the user ends up having to read extra, often redundant text or looking at over emphasized/unnecessary elements - which means a slower viz-building workflow
This need not be done now, but I document them just so we don't forget about them

Idea/proposal

Clearly indicate "Advanced" state;

We had the following comment on a review:

For example, bodiversity data sets often contain mutliple columns denoting the taxon (order, family, genus, etc) because often questions for those data require aggregating at different levels. One of the problems with this is that as a result the taxon information in the dataset is highly denormalized, with corresponding problems as a result. How woul

Labeled training example pairs should be stored in a table for selection and reuse. Data stored for examples should include:

Source
Source ids
Label
Label date
Comment to store labeling rules applied by labeler

Storing examples like this allows them to by reused in the following ways:

Select specific subsets of labeled pairs to build models from
Store multiple labels f

found via http://www.gbif.org/newsroom/uses/2016-gueta-et-al

https://www.researchgate.net/profile/Yohay_Carmel/publication/303833067_Quantifying_the_value_of_user-level_data_cleaning_for_big_data_A_case_study_using_mammal_distribution_models/links/57bf37ea08aeb95224d1039d.pdf

Some cleaning tasks done (Table 1)

repair these points

Wrong coordinate systems - i assume means UTM vs. dec

The episode overview, key points and one of the headings all mention renaming a column, but the episode doesn't actually contain any information on how to do this.

I would suggest adding something like this:

You can rename a column by opening the drop-down menu at the top of the column and choosing Edit column -> Rename this column. You will then be prompted to enter the new column name.

data-cleaning

Here are 518 public repositories matching this topic...

johnkerl / miller

justmarkham / DAT8

justmarkham / pandas-videos

cgnorthcutt / cleanlab

data-forge / data-forge-ts

sfirke / janitor

Andrew-Kang-G / url-knife

msamogh / nonechucks

data-cleaning / validate

dirty-cat / dirty_cat

jim-schwoebel / voicebook

pandera-dev / pandera

ekstroem / dataMaid

ChrisMuir / refinr

HoloClean / HoloClean-Legacy-deprecated

ironmussa / Bumblebee

ajaymache / data-analysis-using-python

akvo / akvo-lumen

Context

Idea/proposal

Clearly indicate "Advanced" state;

msberends / clean

ropensci / taxa

ammsa / DTCleaner

scottythered / gratefuldata

dssg / pgdedupe

LoLei / redditcleaner

sharmaroshan / Drugs-Recommendation-using-Reviews

ropensci / scrubr

LibraryCarpentry / lc-open-refine

umich-dbgroup / foofah

sharmaroshan / FIFA-2019-Analysis

jmcastagnetto / covid-19-data-cleanup

Improve this page

Add this topic to your repo