Analytics/Kraken
Group: | Analytics/Engineering |
Start: | 2011-05-22 |
End: | |
Team: | Diederik van Liere, Andrew Otto, Dan Andreescu |
Backlog: | Mingle |
Status: | See updates |
Kraken infrastructure documentations can be found on wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Kraken
Kraken is the code-name for the robust distributed computing and data-services platform under construction by the Wikimedia Analytics Team.
Contents
Rationale[edit | edit source]
The Wikimedia movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. It permeates nearly all our goals, yet our current analytics capabilities are underdeveloped: we lack infrastructure to capture editor, visitor, clickstream, and device data in a way that is easily accessible; our efforts are distributed among different departments; our data is fragmented over different systems and databases; our tools are ad-hoc.
Rather than merely improve existing jobs and data pipelines, the Analytics Team aims to construct a Data Services Platform capable of mining intelligence from all datastreams of interest, providing this insight in real time, and exposing it via an API to power applications, mash up into websites, and stream to devices.
Status[edit | edit source]
The first Analytics use case for this system will be Camus, Linked-In's open source application for loading Kafka data into Hadoop. Once this is productized, we'll have the ability to regularly load log data from our servers into Hadoop for processing and analysis.
Documentation[edit | edit source]
- Getting Access
- Overview
- Data
- Pixel Service Endpoint
- Request Logging
- Hardware Planning
- Notes
- Meeting Notes
Components[edit | edit source]
Request Logging[edit | edit source]
Public data import endpoint, and system for capturing the incoming firehose from our front-end servers.
Data Tools[edit | edit source]
Data processing library and toolkit provided & maintained by the Analytics team, esp for use with Kraken.
Infrastructure[edit | edit source]
Bucket for general cluster infrastructure and maintenance tasks.
Meeting Notes
Tasks
- Fix the one busted Cisco machine (an07) [otto]
- Data owner services -- export, dashboard, visualizations (on Hue?) [dsc, dan]
- Hue plugin for Limn integration? [dsc]
- Get analysts experimenting with Hive/Pig [DvL]