Skip to content
master
Go to file
Code

Latest commit

* updated nlp

* additional utils and graphics for clustering and automl

* new clustering and automl notebooks

* changed namespace from .aml to .automl

* updated clustering notebook for new algo implementations

* updated automl notebook with telco data

* updated automl notebook with telco data

* addition of clustering and automl to readme

* Minor updates to wording in automl notebook

* typo

* automl demonstration added

* minor automl fixes

* automl updates

* clustering updates

* readme and requirement updates for clustering

* dendrogram removal from graphics

* pred true corrections

* Reviewed clustering

* updated notebooks and added changes for clustering

* updated clustering

* update links

* removed unused code

* tensorflow chk

* add setuptools

* trigger build

* added /

* added easy set up

* add sudo lib

* removed ml

* put upgrade first

* put upgrade after kxsys

* install ez_setup

* apt get

* sudo apt get

* RUN aot get

* added y

* added yy

* added space

* removed matploptlib requirement

* added freetype

* added pkg

* only pkg

* moved scipy

* moved numpy first

* moved numpy after keras

* separate everything

* add conda in front

* only to requirements

* removed /

* back to origin

* removed numpy requirement

* moved order

* moved all version requirements

* changed to conda

* changed to conda --file

* readded versions

* forge

* removed graphviz version

* removed graphviz conda

* changed to pip, update wrapt

* changed /

* added keras v

* added v for scikit and numpy

* added v for matplotlib

* added v for scipy

* added v for graphviz

* added v for pandas

* added conda install graphviz

* removed wrapt

* conda update all

* changed back to wrapt

* added ml/automl

* copied ml/automl

* rearranged

* rearranged

* rearranged 3

* removed conda

* fixed dockerfile

* removed print statements

Co-authored-by: Deanna Morgan <dmorgan1@kx.com>
Co-authored-by: dmorgankx <44678213+dmorgankx@users.noreply.github.com>
Co-authored-by: Conor McCarthy <cmccarthy1@kx.com>
Co-authored-by: Diane O Donoghue <dianeodonoghue@Dianes-MacBook-Pro.local>
ae088d1

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Kx Machine Learning Notebooks

The example machine learning notebooks demonstrate the benefits of using kdb+/q alongside the Kx interfaces embedPy and JupyterQ, the Kx Natural Language Processing (NLP), Machine Learning Toolkit (ML-Toolkit) and Automated Machine Learning libraries. These notebooks showcase how to solve a range of machine learning problems, from feature engineering and neural network design to the model training and testing.

embedPy

embedPy is part of the fusion for kdb+ initiative and allows the application of Python functions on kdb+ data within a q process. Python and kdb+/q developers can leverage the benefits of both technologies, pairing kdb+’s high-speed analytics with Python’s rich ecosystem of machine learning libraries including but not limited to scikit-learn, matplotlib and Tensorflow.

JupyterQ

JupyterQ is also part of the fusion for kdb+ initiative and provides users with a kdb+ kernel for the Jupyter project. This kernel allows users to create Jupyter Notebooks and additionally to leverage JupyterHub and JupyterLab. These technologies are ubiquitous within the data science community.

NLP

The Kx NLP library can be used to answer a variety of questions about unstructured text and can therefore be used to preprocess text data in preparation for model training. Input text data, in the form of emails, tweets, articles or novels, can be transformed to vectors, dictionaries and symbols which can be handled very effectively by q.

ML-Toolkit

The toolkit contains libraries and scripts that provide kdb+/q users with general-use functions and procedures to perform machine-learning tasks on a wide variety of datasets. This includes utility functions, the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm, cross validation and grid search procedures, and clustering algorithms.

AutoML

The Automated Machine Learning framework provides users with the ability to automate the process of applying machine learning techniques to real-world problems in kdb+/q. The pipeline comprises preprocessing, feature engineering, cross validation, model selection, hyperparameter tuning and report generation. As shown in the associated notebook, this framework is designed to be flexible to users with both novice and expert kdb+ or machine learning engineers alike.

Notebooks

The contents of the notebooks are as follows:

  1. Decision Trees: A decision tree is trained to detect if a patient has either benign or malignant cancer. The performance of the model is measured by computing a confusion matrix and ROC curve.

  2. Random Forests: Random forest and XGBoost classifiers are trained to identify satisfied and unsatisfied financial clients. Different parameters are tuned and tested, with classifier performance evaluated using the ROC curve.

  3. Neural Networks: A neural network is trained to identify samples of handwritten digits from the MNIST database. Performance is calculated for a test set of images, with a variety of plots used to show the results.

  4. Dimensionality Reduction: Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are used to reduce the dimensionality of the original dataset. Several plots are used to visualize reduced features and infer differences between the distinct groups present in the data.

  5. Feature Engineering: Examples of data preprocessing that can highly affect the performance of a model are demonstrated. The first section of the notebook focuses on the robustness of different scalers against k-nearest neighbours, while the second section demonstrates the importance of one-hot encoding labels when training a neural network.

  6. Feature Extraction and Selection: The three examples provided explain how to effectively use the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm to extract features and determine how significant each feature is in predicting a target vector. The examples make use of both random forest and gradient boosting models.

  7. Cross Validation: Cross validation procedures are demonstrated against a random forest classifier, with the aim of classifying breast cancer data. Results produced for the different cross validation methods available in the toolkit are compared.

  8. Natural Language Processing: Parsing, clustering, sentiment analysis and outlier detection are demonstrated on a range of corpora, including the novel Moby Dick, the emails of the Enron CEOs and the 2014 IEEE Vast Challenge articles.

  9. K-Nearest Neighbours: The notebook details the steps to follow in a machine learning problem, prior to model training. These include feature scaling, data splitting and parameter tuning - performed by measuring the accuracy of a k-nearest neighbours model for different values of parameter k.

  10. Automated Machine Learning: The notebook looks at predicting how likely a telecommunications customer is to churn based on behaviour. The data and associated target is used throughout the notebook and is passed into the AutoML pipeline in both its default configuration and custom user-defined configuration, with the steps in the pipeline explained throughout.

  11. Clustering: Examples of how to use the k-means, DBSCAN, affinity propagation, hierarchical and CURE algorithms available within the ML-Toolkit are provided. The notebook demonstrates how to effectively visualize results produced and make use of scoring functions contained within the toolkit. A real-world application is also included.

Requirements

Dependencies

Install the Python dependencies with

pip

pip install -r requirements.txt

or with conda

conda install --file requirements.txt

N.B. Additionally the following must be installed to ensure that all the notebooks can be run correctly.

  1. graphviz must be installed on the system running the notebooks.
  2. xgboost must be installed via either conda using the following command conda install -c anaconda py-xgboost or using the instructions provided here https://xgboost.readthedocs.io/en/latest/build.html

Docker

A prebuilt docker image is available with all the dependencies installed. If you have Docker installed run it with:

docker run -it -p 8888:8888 --name mymlnotebooks kxsys/mlnotebooks

Now point your browser at http://localhost:8888/tree/notebooks/

For subsequent runs, you will not be prompted to redo the license setup when calling:

docker start -ai mymlnotebooks

N.B. build instructions for the image are available

You can’t perform that action at this time.