Analytics/Kraken

Wikimedia Foundation engineering activity

Kraken

A robust, distributed computing and data services platform built on top of Hadoop.

Group:	Analytics/Engineering
Start:	2011-05-22
End:
Team:	Diederik van Liere, Andrew Otto, Dan Andreescu
Backlog:	Mingle
Status:	See updates

Kraken infrastructure documentations can be found on wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Kraken

Kraken is the code-name for the robust distributed computing and data-services platform under construction by the Wikimedia Analytics Team.

Rationale[edit | edit source]

The Wikimedia movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. It permeates nearly all our goals, yet our current analytics capabilities are underdeveloped: we lack infrastructure to capture editor, visitor, clickstream, and device data in a way that is easily accessible; our efforts are distributed among different departments; our data is fragmented over different systems and databases; our tools are ad-hoc.

Rather than merely improve existing jobs and data pipelines, the Analytics Team aims to construct a Data Services Platform capable of mining intelligence from all datastreams of interest, providing this insight in real time, and exposing it via an API to power applications, mash up into websites, and stream to devices.

Status[edit | edit source]

[edit status] • [add new]

2014-03-monthly:

We reached a milestone in our ability to deploy Java applications at the Foundation this month when we stood up an Archiva build artifact repository. This enables us to consistently deploy Java libraries and applications and will be used in Hadoop and Search initially.

The first Analytics use case for this system will be Camus, Linked-In's open source application for loading Kafka data into Hadoop. Once this is productized, we'll have the ability to regularly load log data from our servers into Hadoop for processing and analysis.

Documentation[edit | edit source]

Getting Access
Overview
- Software Overview
Data
Pixel Service Endpoint
Request Logging
Hardware Planning
Notes
Meeting Notes
- Security Review Meeting
- Architecture Review Meeting

Components[edit | edit source]

Request Logging[edit | edit source]

Public data import endpoint, and system for capturing the incoming firehose from our front-end servers.

Data Tools[edit | edit source]

Data processing library and toolkit provided & maintained by the Analytics team, esp for use with Kraken.

Infrastructure[edit | edit source]

Bucket for general cluster infrastructure and maintenance tasks.

Meeting Notes

Tasks

Fix the one busted Cisco machine (an07) [otto]
Data owner services -- export, dashboard, visualizations (on Hue?) [dsc, dan]
- Hue plugin for Limn integration? [dsc]
Get analysts experimenting with Hive/Pig [DvL]

Analytics/Kraken

Contents

Rationale[edit | edit source]

Status[edit | edit source]

Documentation[edit | edit source]

Components[edit | edit source]

Request Logging[edit | edit source]

Data Tools[edit | edit source]

Infrastructure[edit | edit source]

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Support

Development

MediaWiki.org

Print/export

Tools