Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
README
| |
|--\ /-|-.-. .
|__/\/ | \/_ . :.
___/ _ _ _ __ ___/ . .
/ _|| | | || \/ \ / \ . : .:
\_ \| |_| || | | || [ ] | ..
\___/|____/||_|_|_| \___/ ..
/ . .. .:
.: . .:
ByteFreq Data Profiler
Copyright, ByteSumo Limited, 2014.
All rights reserved.
Author: Andrew J Morgan
Introduction
The ByteFreq Data Profiler is a tool that allows you to gain insight into the quality of your data through profiling the data
formats of each field in your files. It is an enabling tool, which provides the primary research needed to drive automated
data quality engines downstream.
What it does
The profiler reads in your file based data, and column by column converts the data item level contents into a generalised
pattern-string that helps to characterise the format of the data held in each column. It then can output this as raw data,
or it can track the frequency of each pattern found per column in an associative array to generate frequency counts that
summarise the content of your data file, field by field.
The software enables three outputs necessary to construct an automated data quality engine: human readable reports, metrics
in a form that can be processed automatically, and a raw mapping between input data points and the profile strings calculated.
Example Output of a Human Readable Report:
(generated using the option -v report='1')
file column count pattern
======= =============== ===== =======
out.tab col_00002_email 3300 aaaa_999_9.aaaaaaa@aaaaaaaaaaaaaa.aaa
out.tab col_00002_email 1258 aaaaaa9aaaaa99999@aaaaaaa.aaa
out.tab col_00002_email 810 aaaa_99_9.aaaaaaa@aaaaaaaaaaaaaa.aaa
out.tab col_00002_email 720 aaaaa_99_9.aaaaaaa@aaaaaaaaaaaaaa.aaa
out.tab col_00002_email 398 AAAA-AAAA-999@aaaaaaaaaaaaaa.aaa
out.tab col_00002_email 382 aaaaaa9aaaaa9999@aaaaaaa.aaa
out.tab col_00002_email 120 AA_AAA_AAA_99@aaaaaaaaaaaaaa.aaa
out.tab col_00002_email 111 aaaaaaaaa.aa@aaaaaaaaaaaaa.aaa
out.tab col_00002_email 108 aaaa.aa.aa9@aaaaaaaaaaaaaa.aaa
out.tab col_00002_email 93 aaaaaaaaaaa99999@aaaaaaa.aaa
out.tab col_00002_email 92 Aaaa.Aaa@aaaaaaaaaaaaaa.aaa
To interpret the output, an understanding of how the program generalises data into the pattern-string is useful.
Below is a description of the two available algorithms you can select from when studying your data, the first resulting
in a granular study of your data, and the second a more generalised study of your data. Note the default setting is
to produce more granular output.
Highly Granular Pattern Construction:
For all data content, apply the following character level translation:
1) Translate all upper case characters in the data [A-Z] to becoming an "A"
2) Translate all lower case characters in the data [a-z] to an "a"
3) Translate all numbers in the data [0-9] to a "9"
4) Translate all tab marks to a "T"
5) Leave all remaining whitespace and high ASCII and punctuation as is
For example, a field called Address, may contain raw data such as:
John Smith, 33 High Street, Townford, Countyshire, UK, TC2 03R
which after transformation becomes:
Aaaa Aaaaa, 99 Aaaa Aaaaaa, Aaaaaaaa, Aaaaaaaaaaa, AA, AA9 99A
Less Granular Pattern Construction
For all data content items, apply the following character level translation:
1) Translate all repeating occurrences of upper case characters [A-Z]+ to becoming an "A"
2) Translate all repeating occurrences of lower case characters [a-z]+ to an "a"
3) Translate all sequences of numbers in the data [0-9]+ to a "9"
4) Translate all tab marks to a "T"
5) Leave all remaining whitespace, high ASCII, and punctuation as is
For example, a field called Address, may contain raw data such as:
John Smith, 33 High Street, Townford, Countyshire, UK, TC2 03R
which after transformation becomes:
Aa Aa, 9 Aa Aa, Aa, Aa, AA, A9 9A
Notice that the profile is more general using this rule, so the profiles match more occurances.
Technology
The profiler is written in AWK, and the intended program to run it is GAWK. There is a very specific reason for this.
AWK is a core part of the unix kernal, and as such, is both robust and ubiquitous in all types of computing environments.
This means the profiler can be introduced to highly locked down production systems with no new software installation per se
needed to run the scripts, and with minimal introduction of risk to the existing services hosted.
In addition to code being able to run on any unix system, it can also be run on windows systems via 'nix emulators such
as MKS (a 64-bit closed-source unix emulator for windows) or cygwin (a 32-bit open-source unix emulator for windows), both
common in enterprise production environments.
If you wish to run this code in a JAVA environment, it is perhaps possible via the JAWK library.
Input and Output Configuration options
The code has only a few simple configuration options carefully chosen to meet the needs of many. The three main output choices
configurable are:
-v report ="0" -- (0) produce profile metrics data, directly loadable to a database or data quality rules engine
-v report ="1" -- (1) produce profile metrics text reports, in a format easily printed
-v report ="2" -- (2) produce a modified copy of your input data, having additional columns containing the calculated profile
strings that can be loaded directly to a database or fed into a data quality rules engine.
Option Notes:
For option (0), the profiler will generate a metrics file having 4 column tab delimited file having the following delimited
file layout:
<filename><column_identifier><count><pattern>
where the column names specified in the header are prefixed with .col_<column_id>_.
For option (1) the report output includes these same fields but formatted into a more readable layout as seen in the example
given in the .What it does. section.
For option (2), the file will output the original data alongside new columns holding the calculated profiles. This is a critical
output, as when a profile is found that needs investigation, this output can be used to find the matching raw data records of
interest. The simplest was to do this is using filters in excel, but more sophisticated options are also available.
Command Line Options
The program can be called from the command line as follows in this example:
awk -F"\t" -f bytefreq_v1.04.awk -v header="1" -v report="1" -v grain="L" input.data > output.report.txt
Usage:
-F"\t" Use the native -F option in AWK to set your input data file delimiter, also known as it's Field Separator, FS.
The example above shows the setting for a tab delimited file, but there are many advanced field and record separator
choices available if you read the AWK documentation.
-f bytefreq_v1.01.awk The -f option tells awk to run the profiler code. Be sure to include a fully qualified file path
to the code if your working directory is not where the code is sitting.
-v header="1" The command line option to set the row to use to pick up the headers in the file.
If you set this value to row 1, it will use the first row as the header row. If you set it to X, the code
will use the Xth row to pick up headers and ignore all proceeding lines in the file, a feature that can
occasionally be very handy. The default value if this setting if it is not explicitly set on the command
line is "0", meaning, no header record.
-v report="0" Sets the output to a machine readable profile frequency dataset.
-v report="1" Sets the output to a human readable profile frequency report.
-v report="2" Sets the output to raw + profiled full volume data in a machine readable format.
-v grain="L" Set the Less granular level of pattern formats in the output
-v grain="H" Set the Highly granular level of pattern formats in the output
input.data The file you wish to examine. Note if the working directory doesn't hold the file, you need to set this
value to being the fully qualified path and file name.
> output.report.txt This is the standard unix notation meaning 'redirect the output to a file', which if not set
means the profiler will output it's metrics to STDOUT (standard out) be sure to use a fully qualified
name if you wish the output to be written to a different folder.
NOTES ON PARALLEL RUNNING
Should you wish, you can run this program using parallel,
using a handy tool found at http://www.gnu.org/software/parallel
An example of the syntax is:
# create a second test file
cat testdata.tab > testdata2.tab
# pass a file glob to parallel, and run the profiler over the files in parallel. Fold output into single report.
ls *.tab | parallel -q gawk -F"\t" -f bytefreq_v1.04.awk -v report="0" -v header="1" -v grain="H" ::: | gawk -F"\t" 'NF==6 {print $0}' > output.rpt
Parallel should be cited if used in acedemic work.
@article{Tange2011a,
title = {GNU Parallel - The Command-Line Power Tool},
author = {O. Tange},
address = {Frederiksberg, Denmark},
journal = {;login: The USENIX Magazine},
month = {Feb},
number = {1},
volume = {36},
url = {http://www.gnu.org/s/parallel},
year = {2011},
pages = {42-47},
doi = {10.5281/zenodo.16303}
}