trace-classifier
A library for building a classifier for location traces on spark using tensorflow and tensorframes.
TOC
Overview
trace-classifier is a skeleton for building a neural-network based classifier for location traces.
The high-level overview of trace-classifier looks like this:
What you need to provide:
- The classes you want the model to learn.
- How to cut up a trace.
- Which features to use.
- Training data.
- A model architecture (optional).
trace-classifier provides the code for:
- Cutting up and featurising a trace into model input (rectangular matrices of fixed size).
- Performing k-fold cross validation and save the best model.
- Freezing and optimizing for faster inference.
- Using pretrained model to infer the class of unknown traces.
Best place to start is by reading On Reading the Code section and check out the example in sample_model/.
Install
pip install git+https://github.com/mapbox/trace-classifier@v0.2.0
Develop
python setup.py develop
Make targets
make image: Build a docker image for testingmake venv: Creates virtual environmentmake install: Ensures that virtual environment exists then installs requirements from requirements-dev.txt and requirements.txtmake clean: Deletes virtual environmentmake test-local: Ensures that install is completed and then runs testsmake test-docker: Ensures that docker image is created and then runs tests in a docker container
Environment
The code has been verified to work in the following environment:
- Python:
>3.5 - PySpark:
2.4.4- Recommend setting
spark.sql.codegen.wholeStagetoFalse - Recommend setting
spark.sql.caseSensitivetoTrue
- Recommend setting
- Tensorframes:
0.4.0-s_2.11or0.5.0-s_2.11 - Tensorflow: Must match what tensforframes is using.
- Tensorframes
0.4.0-s_2.11uses tensorflow1.6 - Tensorframes
0.5.0-s_2.11uses tensorflow1.10
- Tensorframes
- Keras:
>2.2.1
Notes on running in Zeppelin notebook: Zeppelin notebook's pyspark interpreter might fail to find tensorframes python library even when tensorframes has been added as a spark package.
What this error looks like
%spark
// Try importing tensorframes in spark shell (scala)
import org.tensorframes.{dsl => tf}
import org.tensorframes.dsl.Implicits._
//=> Success
But
%pyspark
# Try loading tensorframes in pyspark shell
import tensorframes as tfs
tfs.__version__
#=> ImportError: No module named 'tensorframes'
How to fix this error
This is because tensorframes' python binding is inside the jar file and $PYTHONPATH does not contain the path to tensorframes jar.
Do not pip install tensorframes as that installs an older version of tensorframes' python library that is not compatible with tensorframes 0.4.0-s_2.11 or 0.5.0-s_2.11.
First, find where tensorframes jar is located (usually in ~/.ivy2/jars/):
%sh
# Find tensorframes jar
ls ~/.ivy2/jars/
# should see databricks_tensorframes-0.x.0-s_2.11.jar where x is 4 or 5
Next, check if this path is in $PYTHONPATH:
%sh
# Check if path to tensorframes jar is in PYTHONPATH
echo $PYTHONPATH
If not, add to $PYTHONPATH:
%pyspark
import sys, glob, os
sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar")))
Try importing tensorframes again:
%pyspark
import tensorframes as tfs
tfs.__version__
#=> '2.0.0'
Training
The process of turning location traces into model input matrix is distributed to slave nodes. Training of a model, however, is done on the driver node – i.e. all data is brought to the driver node. Current implementation assumes all the training data can fit in memory.
See sample_model/ for an example training script.
To start the tensorboard server and monitor the training progress, run
tensorboard --logdir <saved_logs_dir>
from the terminal. <saved_logs_dir> is the directory that contains tensorflow logs – see SAVE section under Training Config File.
Inferencing
Inference is distributed: pretrained tensorflow model is distributed to slave nodes by tensorframes.
See sample_model/ for an example inference script.
On Reading the Code
Preprocessing before Preprocessing
A location trace is a series of coordinates [longitude, latitude, altitude, timestamp].
The preprocessing pipeline in trace-classifier requires that
- Timestamps are Unix time in ms,
- Traces be stored as
<array<array<double>>in a pyspark DataFrame column.
If raw data does not satisfy the above, it need to be preprocessed prior to feeding to trace-classifier's preprocessing pipeline.
See the training script in sample_model/ for an example.
Label vs. Class
For brevity, in this repo a class is defined as the string name of a group, and label is an integer representation of a class. Classes are usually stored as a list, where the position of a class in this list is its corresponding label.
For example (totally irrelevant to classifying traces), say that we have images of cats and dogs that need to be separated based on whether the image contains a cat or a dog:
classes = [ "cat", "dog" ]
| |
| |
V v
label 0 1
Analogies
Like the above example, documentations and variable names used in this repo make use of NLP and imagery analogies:
sentence:: a full location tracephrase:: a piece of a tracesword:: a group of coordinates which may or may not be consecutiveword vec:: features derived from coordinates in wordword vec component:: a feature (i.e. a characteristic) of a word, computed using the some or all coordinates in the word.alphabet:: a single[longitude, latitude, altitude, timestamp]coordinate
Features as 1D Signals
Each feature is treated independently by the model. Neurons in convolution layers only ever sees data coming from one signal.
For example (again, irrelevant to classifying traces), say if we have two features: temperature and brightness. A neuron in the first layer can take in the {temperature at time t1, temperature at time t2} as input, but it cannot take {temperature at time t1, brightness at time t1} as input.
This this analogous to a 1-pixel-wide image with multiple channels:
word vec size:: Number of channelsphrase length:: image height
The architecture provided in architecture.py, which is modified from [2], takes 1D features as input.
Features as a 2D Signal
This is analogous to a 1-channel image:
word vec size:: image widthphrase length:: image height
This repo does not contain an architecture for 2D features. See Customize section on how to add your own architecture.
Performance Optimization
Sometimes a task is broken down into pieces.
Example 1: cheap-ruler measurement of distance.
- Cheap ruler is a copied from [1].
- The scripts for computing the degree-to-kilometers multipliers is in
cheap_rulerfunction incheap_ruler.py. - Calculation of distance is done in
create_word_vecsfunction inword_vec.py.
Example 2: normalization by mean-MAD.
- The scripts for computing the mean and the MAD are in
scaler.py. - Normalizing by subtracting the mean then dividing by MAD is done in
create_word_vecsfunction inword_vec.py.
This is to make life easier for Spark's optimizer.
Files Required for Training a Model
Data
Location traces. Not provided in this repo.
Training Config File
trace-classifier expects a training configuration json file or dictionary. See sample_model/ for an example config json file.
The configuration file should contain 5 sections:
PREPROCESS: parameters to be used in preprocessingINPUT: specification of model inputTRAIN: parameters to be used in training a modelVALIDATE: parameters to be used in validating a modelSAVE: where to save files
Note that field names are case sensitive.
PREPROCESS
| Field | Description |
|---|---|
| DESCRIPTION | (optional) For any remark. |
| word_size | Tuple of integers (N, B, S). How to create a word. N = number of coordinates in a wordB = stride between coordinates in a word (B=1 for consecutive coordinates)S = stride between words (S=N for no overlap between words)See docstring for create_words function in word_vec.py. |
| desired_ops | List of tuples, or list of list of tuples. Features of a word. See docstring for create_word_vecs function in word_vec.py. |
| normalize | One of {False, 'mean-mad'}. How to normalize word vec. See docstring for create_word_vecs function in word_vec.py. |
| clip_rng | (optional) Tuple of two integers. The (min, max) range to clip each component in the word vec. See docstring for create_word_vecs function in word_vec.py. |
| ndigits | (optional) Integer. Number of digits to round off the word vec components. |
| desired_phrase_length | Integer. Standard length of a phrase. |
INPUT
| Field | Description |
|---|---|
| DESCRIPTION | (optional) For any remark. |
| ndims | Integer. Dimension of model input signal. See Features as 1D Signals and Features as a 2D Signal section. |
| classes | List of strings. Class names. See Label vs. Class section. |
TRAIN
| Field | Description |
|---|---|
| DESCRIPTION | (optional) For any remark. |
| loss | String. Loss function. String name of a Keras built-in loss function, or a custom loss function. See custom_objects.py for a list of custom loss function currently supported. |
| metrics | String, or a list of strings. Metrics to evaluate the model performance during training. Either the string name of a Keras built-in metric or a custom stateful metric object. See custom_objects.py for a list of stateful custom metrics currently supported. |
| optimizer | String. String name of a Keras built-in optimizer; not all Keras built-in optimizers are supported – see custom_objects.py for a list of supported optimizers. |
| k_fold_cv | Integer. Number of folds for k-fold cross validation. |
| BATCH_GENERATOR | Dictionary. Arguments for the batch generator. See table below. |
| FIT | Dictionary. Arguments for Keras' fit_generator. See table below. |
| EARLY_STOPPING | Dictionay. Arguments for Keras' EarlyStopping object. See table below. |
| MODEL_CHECKPOINT | Dictionary. Arguments for Keras' ModelCheckpoint object. See table below. |
BATCH_GENERATOR
| Field | Description |
|---|---|
| DESCRIPTION | (optional) For any remark. |
| batch_size | Integer. The batch size. |
| shuffle | Boolean. Whether to shuffle data. |
FIT
| Field | Description |
|---|---|
| DESCRIPTION | (optional) For any remark. |
Other fields supported are all the arguments of Keras' fit_generator except epochs, validation_data, callbacks and workers.
EARLY_STOPPING
| Field | Description |
|---|---|
| DESCRIPTION | (optional) For any remark. |
Other fields supported are all the arguments of Keras' EarlyStopping object.
MODEL_CHECKPOINT
| Field | Description |
|---|---|
| DESCRIPTION | (optional) For any remark. |
Other fields supported are all arguments of Keras' ModelCheckpoint object except filepath, save_best_only, period and verbose.
VALIDATE
| Field | Description |
|---|---|
| DESCRIPTION | (optional) For any remark. |
| BATCH_GENERATOR | Dictionary. Arguments for the batch generator. See table above. |
SAVE
| Field | Description |
|---|---|
| DESCRIPTION | (optional) For any remark. |
| saved_logs_dir | String. Directory to save tensorflow logs. |
| saved_model_dir | String. Directory to save model and metadata. |
Files Produced by Training
The train function in train.py saves the following during training:
- HDFS models (extension .h5), which includes metadata + weights + architecture + optimizer state from the epoch with best validation performance.
- Tensorflow logs.
For k-fold cross validation, train with k_fold_CV function in cross_validation.py. This produces k models and k corresponding tensorflow logs.
Use save_model function in save.py to save freeze and optimize the HDFS model for fast inference. This will produce 2 files:
- A serialised model (extension .pb), which contains the weight + architecture.
- A metadata json file, which contains parameters required for preprocessing a trace.
Files Required for Inference
Data
Location trace. Not provided in this repo.
Model
A frozen tensorflow model (extension .pb) and its corresponding metadata json file.
Note: if not provided, this library loads the sample model in sample_model/.
Customize
Custom Architecture
You can feed your own model architecture to train function (see train.py) and k_fold_CV function (see cross_validation.py).
Model architecture should be wrapped inside a function that
- takes
n_classesandinput_shapeas argument, and - returns an uncompiled Keras model
Note: The input layer to the model must be named input, and the output layer must be named output; otherwise, freezing the model and inferencing will fail.
See architecture.py for an example.
Custom Metrics
You can use your own metric by adding to the get_custom_metrics function in custom_objects.py.
For an example of a custom stateful metric, see F1 object in metrics.py.
New Optimizer
You can expand the list of Keras optimizers supported by adding to the get_optimizers function in custom_objects.py.
Note: Currently there's no way to pass values to optimizer's argument. This requires API-breaking change to construct_model function in model.py and therefore is left as a to-do.
Custom Loss Function
You can use your own loss function by adding to the get_custom_losses function in custom_objects.py.
Note: Currently there's no way to pass values to a custom loss function. This requires API-breaking change to construct_model function in model.py and therefore is left as a to-do.
Future
Below is a list of minor changes that can make trace-classifier faster and more flexible:
- Remove all assumptions on column names.
- Select what to return from inference: probabilities and/or highest probability and/or just the class name that has the highest probability.
- Ability to pass arguments to the optimizer (see New Optimizer section)
- Ability to pass arguments to the custom loss function (see Custom Loss Function section)
- Remove assumptions on the name for the model's input and output layer (see Custom Architecture section)
- Compute exact class weight in k-fold cross validation by using phrase count (see training notebook in
sample_model/. - Support more normalization methods (see docstring of
create_word_vecsfunction inword_vec.py) - Add data aumentation to batch generator.
References
[1] Agafonkin V. 2016. Fast geodesic approximations with Cheap Ruler (link).
[2] Zhang Y, Wallace B. 2016. A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification (arXiv:1510.03820v4).
[3] Definition of micro vs. macro vs. weighted metric: see sklearn.metrics.f1_score
[4] Sokolova M, Japkowicz N, Szpakowicz S. 2006. Beyond Accuracy, F-score and ROC: a Family of Discriminant Measures for Performance Evaluation (link).
[5] Eesa A, Arabo W. 2017. A Normalization Methods for Backpropagation: A Comparative Study (link)
License
Copyright (c) 2018 Mapbox.
Distributed under the MIT License (MIT).
