data-cleaning
Here are 518 public repositories matching this topic...
-
Updated
Apr 18, 2016 - Jupyter Notebook
-
Updated
Feb 24, 2020 - Jupyter Notebook
-
Updated
Apr 23, 2020 - Python
-
Updated
Jun 7, 2020 - TypeScript
Right now it's not immediately obvious they're deprecated. They should either not have reference pages at all, or the reference page should just say, don't use X, use Y.
-
Updated
Jun 7, 2020 - JavaScript
While attempting to fill an incomplete batch, no attempt is made to follow the strategy specified by step_to_index_fn in choosing the next batch/samples from the next batch as of now.
Write tests
validate::variables returns R variables that are not columns
data to be checked.
library(validate)
rules <- validator(A %in% letter[1:2])
variables(rules)## [1] "A" "letter"
As detailed in:
https://github.com/marketplace/actions/run-circleci-artifacts-redirector?version=0.1.0
It is used to link from the PR to the docs rendered by circleci, for instance in scikit-learn or sphinx-gallery. It helps reviewing PRs.
The table is not labeled properly. The proper label should be Table 1.3.4. Also, the description is a bit unclear. You may want to eliminate the last sentence
Note this is intended to just be a demo issue to suggest enhancements on the book.
Currently, the SeriesSchema object doesn't validate the index of the schema. The purpose of this task is to extend the __init__ signature of SeriesSchema to take an index argument, which would take a pa.Index or pa.MultiIndex. In the validate / __call__ call, the index should be checked.
-
Updated
Nov 7, 2019 - C++
-
Updated
Sep 15, 2018 - Python
This can be useful to get an overview of string structure of a columns
def patterns(self, input_cols, output_cols=None, mode=0):
See https://github.com/ironmussa/Optimus/blob/develop-3.0/optimus/engines/base/columns.py#L153 For more info about the param
-
Updated
Jan 2, 2019 - Jupyter Notebook
Context
A number of elements on the viz editor are half baked or not optimally implemented - the user ends up having to read extra, often redundant text or looking at over emphasized/unnecessary elements - which means a slower viz-building workflow
This need not be done now, but I document them just so we don't forget about them
Idea/proposal
Clearly indicate "Advanced" state;
We had the following comment on a review:
For example, bodiversity data sets often contain mutliple columns denoting the taxon (order, family, genus, etc) because often questions for those data require aggregating at different levels. One of the problems with this is that as a result the taxon information in the dataset is highly denormalized, with corresponding problems as a result. How woul
-
Updated
Jun 21, 2016 - Java
-
Updated
Sep 3, 2019
Labeled training example pairs should be stored in a table for selection and reuse. Data stored for examples should include:
- Source
- Source ids
- Label
- Label date
- Comment to store labeling rules applied by labeler
Storing examples like this allows them to by reused in the following ways:
- Select specific subsets of labeled pairs to build models from
- Store multiple labels f
-
Updated
Apr 29, 2020 - Jupyter Notebook
found via http://www.gbif.org/newsroom/uses/2016-gueta-et-al
Some cleaning tasks done (Table 1)
repair these points
- Wrong coordinate systems - i assume means UTM vs. dec
The episode overview, key points and one of the headings all mention renaming a column, but the episode doesn't actually contain any information on how to do this.
I would suggest adding something like this:
You can rename a column by opening the drop-down menu at the top of the column and choosing Edit column -> Rename this column. You will then be prompted to enter the new column name.
-
Updated
Apr 23, 2018 - CSS
-
Updated
May 31, 2019 - Jupyter Notebook
-
Updated
Jun 11, 2020 - R
Improve this page
Add a description, image, and links to the data-cleaning topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the data-cleaning topic, visit your repo's landing page and select "manage topics."
Hello.
I've come across what (to me) seems to be a problem with the FILENAME and FILENUM variables.