Skip to content
#

data-cleaning

Here are 518 public repositories matching this topic...

trantor
trantor commented Jan 23, 2020

Hello.
I've come across what (to me) seems to be a problem with the FILENAME and FILENUM variables.

# mlr --version
Miller v5.6.2

# cat /tmp/csv1
A,B,C
_2GB,255,2
_4GB,120,4
_6GB,50,6
_10GB,10,10

# cat /tmp/csv2
FIRST,SECOND,THIRD,FOURTH
1,2,3,4
5,6,7,8
9,10,11,12
13,14,15,16

# mlr --icsv cat then put 'print FILENAME'   /tmp/csv1 /tmp/csv2
/tmp/csv1
A=_2GB,B=255,C=2
/
cosmicBboy
cosmicBboy commented Apr 11, 2020

Currently, the SeriesSchema object doesn't validate the index of the schema. The purpose of this task is to extend the __init__ signature of SeriesSchema to take an index argument, which would take a pa.Index or pa.MultiIndex. In the validate / __call__ call, the index should be checked.

Kiarii
Kiarii commented Sep 26, 2019

Context

A number of elements on the viz editor are half baked or not optimally implemented - the user ends up having to read extra, often redundant text or looking at over emphasized/unnecessary elements - which means a slower viz-building workflow
This need not be done now, but I document them just so we don't forget about them

Idea/proposal

Clearly indicate "Advanced" state;

zachary-foster
zachary-foster commented Apr 10, 2018

We had the following comment on a review:

For example, bodiversity data sets often contain mutliple columns denoting the taxon (order, family, genus, etc) because often questions for those data require aggregating at different levels. One of the problems with this is that as a result the taxon information in the dataset is highly denormalized, with corresponding problems as a result. How woul

ecsalomon
ecsalomon commented Sep 28, 2017

Labeled training example pairs should be stored in a table for selection and reuse. Data stored for examples should include:

  • Source
  • Source ids
  • Label
  • Label date
  • Comment to store labeling rules applied by labeler

Storing examples like this allows them to by reused in the following ways:

  • Select specific subsets of labeled pairs to build models from
  • Store multiple labels f
ebunge
ebunge commented Aug 16, 2019

The episode overview, key points and one of the headings all mention renaming a column, but the episode doesn't actually contain any information on how to do this.

I would suggest adding something like this:

You can rename a column by opening the drop-down menu at the top of the column and choosing Edit column -> Rename this column. You will then be prompted to enter the new column name.

Improve this page

Add a description, image, and links to the data-cleaning topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the data-cleaning topic, visit your repo's landing page and select "manage topics."

Learn more

You can’t perform that action at this time.