Posts from Engineering: hadoop

Scalding 0.9: Get it while it’s hot!

Tags:

It’s been just over two years since we open sourced Scalding and today we are very excited to release the 0.9 version. Scalding at Twitter powers everything from internal and external facing dashboards, to custom relevance and ad targeting algorithms, including many graph algorithms such as PageRank, approximate user cosine similarity and many more.

There have been a wide breadth of new features added to Scalding since the last release:Read more…

Dremel made simple with Parquet

Tags:

Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases.Read more…

Streaming MapReduce with Summingbird

Tags:

Today we are open sourcing Summingbird on GitHub under the ALv2.Read more…

Announcing Parquet 1.0: Columnar Storage for Hadoop

Tags:

In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.Read more…

hRaven and the @HadoopSummit

Tags:

Today marks the start of the Hadoop Summit, and we are thrilled to be a part of it. A few of our engineers will be participating in talks about our Hadoop usage at the summit:Read more…

Dimension Independent Similarity Computation (DISCO)

Tags:

MapReduce is a programming model for processing large data sets, typically used to do distributed computing on clusters of commodity computers. With large amount of processing power at hand, it’s very tempting to solve problems by brute force. However, we often combine clever sampling techniques with the power of MapReduce to extend its utility.Read more…

Visualizing Hadoop with HDFS-DU

Tags:

We are a heavy adopter of Apache Hadoop with a large set of data that resides in its clusters, so it’s important for us to understand how these resources are utilized. At our July Hack Week, we experimented with developing HDFS-DU to provide us an interactive visualization of the underlying Hadoop Distributed File System (HDFS).Read more…