Analytics/Kraken

From MediaWiki.org
Jump to: navigation, search

Kraken infrastructure documentations can be found on wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Kraken

Kraken is the code-name for the robust distributed computing and data-services platform under construction by the Wikimedia Analytics Team.


Rationale[edit | edit source]

The Wikimedia movement has a chronic need for analytics. We need it to understand our editors, to encourage growth, to engender diversity, to focus our resources, to improve our engineering efforts, and to measure our success. It permeates nearly all our goals, yet our current analytics capabilities are underdeveloped: we lack infrastructure to capture editor, visitor, clickstream, and device data in a way that is easily accessible; our efforts are distributed among different departments; our data is fragmented over different systems and databases; our tools are ad-hoc.

Rather than merely improve existing jobs and data pipelines, the Analytics Team aims to construct a Data Services Platform capable of mining intelligence from all datastreams of interest, providing this insight in real time, and exposing it via an API to power applications, mash up into websites, and stream to devices.

Status[edit | edit source]

2014-03-monthly:

We reached a milestone in our ability to deploy Java applications at the Foundation this month when we stood up an Archiva build artifact repository. This enables us to consistently deploy Java libraries and applications and will be used in Hadoop and Search initially.

The first Analytics use case for this system will be Camus, Linked-In's open source application for loading Kafka data into Hadoop. Once this is productized, we'll have the ability to regularly load log data from our servers into Hadoop for processing and analysis.


Documentation[edit | edit source]

Cluster Dataflow Diagram

Components[edit | edit source]

Request Logging[edit | edit source]

Public data import endpoint, and system for capturing the incoming firehose from our front-end servers.

Data Tools[edit | edit source]

Data processing library and toolkit provided & maintained by the Analytics team, esp for use with Kraken.

Infrastructure[edit | edit source]

Bucket for general cluster infrastructure and maintenance tasks.

Meeting Notes

Tasks

  • Fix the one busted Cisco machine (an07) [otto]
  • Data owner services -- export, dashboard, visualizations (on Hue?) [dsc, dan]
    • Hue plugin for Limn integration? [dsc]
  • Get analysts experimenting with Hive/Pig [DvL]