pandas-profiling / pandas-profiling

The program compares two files at a time and does the following 1.Gathering metadata on the individual tables(column count,record count,list of columns with datatype etc) 2.Identifying matching columns between tables based on names as well as data. Using machine learning, we are handling syntactic as well as semantic variations of column names for accurate matching. 3. Finding duplicate columns in single table with the option to deduplicate if required 4. Finding columns with missing data/null values.

python data-profiling

Updated Feb 17, 2018
Python

mtna / rds-js-examples

Star

TypeScript/JavaScript example code using the RDS API

javascript metadata data-science typescript data-validation data-mapping data-transformation open-data rds data-profiling data-ingestion data-dissemination rich-data-services metadata-technology-north-america mtna rds-api

Updated Aug 1, 2020
TypeScript

BJennWare / hitucc

Star

Distributable UCC Discovery Algorithm based on Akka

akka distributed java8 data-profiling unique-column-combination

Updated Jan 7, 2020
Java

statsim / profile

Star

Profile. Generate data profiles in the browser (work in progress)

statistics streaming-algorithms online-algorithms data-profiling data-profile

Updated Aug 1, 2020
JavaScript

ArghyaBiswas0 / FIFA19DreamTeam

Star

A R Notebook to perform basic data profiling and exploratory data analysis on the FIFA19 players dataset and create a dream-team of the top 11 players considering various player attributes.

r exploratory-data-analysis data-profiling

Updated Nov 28, 2019
HTML

jmakeig / data-profile

Star

Sandbox to test out ideas for profiling document data

json xml marklogic data-quality data-profiling

Updated Mar 23, 2018
HTML

camillereaves / subreddit-crossposting

Star

Map naturally-occurring inter-subreddit content sharing patterns on Reddit by analyzing how posts are “cross-posted" between subreddits based on 2.5 million posts across the top 2,500 subreddits. Uses ECL and HPCC Systems.

data-mining reddit data-analysis social-network-analysis data-processing ecl data-cleaning data-profiling hpcc hpcc-platform mapping-tools hpcc-systems data-analysis-in-ecl

Updated Jul 14, 2019
ECL

wosaku / data-profiling-mask-analyzer

Star

Python function to generate a mask analysis

python data-quality data-profiling mask-analysis mask-analyzer

Updated Jul 22, 2017
Jupyter Notebook

bballamudi / great_expectations

Star

Always know what to expect from your data.

data-quality data-profiling

Updated Oct 26, 2019
Python

VIDA-NYU / sato

Star

Fork of Sato for easy deployment as a Python package

table data-processing data-profiling

Updated Dec 18, 2019
Python

p-disha / NYC-Open-Dataset-Analysis

Star

Identified data types for each distinct column value on 1900 data sets. For each column, summarized semantic types present in the column, using Fuzzy Logic, Levenshtein distance. Identified & derived inference the 3 most frequent 311 complaint types by borough.

visualization json big-data python3 pyspark levenshtein-distance matplotlib fuzzy-logic data-profiling big-data-analytics nyc-opendata nyc-open-data 311-data

Updated Apr 15, 2020
Python

christianbors / OpenRefineQualityMetrics

Star

MetricDoc is an interactive visual exploration environment for assessing data quality

data-wrangling data-quality-checks visual-analytics interactive-visualizations data-quality data-profiling quality-metrics

Updated Mar 30, 2020
JavaScript

bballamudi / Optimus

Star

🚚 Agile Data Science Workflows made easy with Pyspark

pyspark data-quality data-profiling

Updated Oct 27, 2019
Jupyter Notebook

bballamudi / deequ

Star

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

data-quality-checks data-profiling

Updated Oct 24, 2019
Scala

data-profiling

Here are 31 public repositories matching this topic...

pandas-profiling / pandas-profiling

great-expectations / great_expectations

ironmussa / Optimus

fbdesignpro / sweetviz

ironmussa / Bumblebee

ing-bank / popmon

psebenick / data-profiling

VIDA-NYU / datamart

ahmadassaf / openData-checker

mtna / rds-r

liesebekkers / data-cleaning

gandalf1819 / NYCOpenData-Profiling-Analysis

hpcc-systems / DataPatterns

darenasc / auto-fes

mtna / rds-js

giagiannis / data-profiler

rounayak / Data-Profiling-Tool

mtna / rds-js-examples

BJennWare / hitucc

statsim / profile

ArghyaBiswas0 / FIFA19DreamTeam

jmakeig / data-profile

camillereaves / subreddit-crossposting

wosaku / data-profiling-mask-analyzer

bballamudi / great_expectations

VIDA-NYU / sato

p-disha / NYC-Open-Dataset-Analysis

christianbors / OpenRefineQualityMetrics

bballamudi / Optimus

bballamudi / deequ

Improve this page

Add this topic to your repo