Posts from Engineering: hadoop

Scalding 0.9: Get it while it’s hot!

Thursday, April 3, 2014 | By P. Oscar Boykin (@posco) [20:36 UTC]

Tags:

hadoop, open source, and scala

It’s been just over two years since we open sourced Scalding and today we are very excited to release the 0.9 version. Scalding at Twitter powers everything from internal and external facing dashboards, to custom relevance and ad targeting algorithms, including many graph algorithms such as PageRank, approximate user cosine similarity and many more.

There have been a wide breadth of new features added to Scalding since the last release:Read more…

Dremel made simple with Parquet

Wednesday, September 11, 2013 | By Julien Le Dem (@J_) [16:04 UTC]

Tags:

hadoop, java, and open source

Columnar storage is a popular technique to optimize analytical workloads in parallel RDBMs. The performance and compression benefits for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases.Read more…

Streaming MapReduce with Summingbird

Tuesday, September 3, 2013 | By Sam Ritchie (@sritchie) [15:47 UTC]

Tags:

hadoop, open source, and scala

Today we are open sourcing Summingbird on GitHub under the ALv2.Read more…

Announcing Parquet 1.0: Columnar Storage for Hadoop

Tuesday, July 30, 2013 | By Dmitriy Ryaboy (@squarecog) [17:41 UTC]

Tags:

hadoop and open source

In March we announced the Parquet project, the result of a collaboration between Twitter and Cloudera intended to create an open-source columnar storage format library for Apache Hadoop.Read more…

hRaven and the @HadoopSummit

Wednesday, June 26, 2013 | By Hadoop @Twitter (@twitterhadoop) [14:33 UTC]

Tags:

hadoop and open source

Today marks the start of the Hadoop Summit, and we are thrilled to be a part of it. A few of our engineers will be participating in talks about our Hadoop usage at the summit:Read more…

Dimension Independent Similarity Computation (DISCO)

Monday, November 12, 2012 | By Twitter (@twitter) [21:32 UTC]

Tags:

hadoop and research

MapReduce is a programming model for processing large data sets, typically used to do distributed computing on clusters of commodity computers. With large amount of processing power at hand, it’s very tempting to solve problems by brute force. However, we often combine clever sampling techniques with the power of MapReduce to extend its utility.Read more…

Visualizing Hadoop with HDFS-DU

Tuesday, August 7, 2012 | By Chris Aniszczyk (@cra) [18:06 UTC]

Tags:

hadoop, java, and open source

We are a heavy adopter of Apache Hadoop with a large set of data that resides in its clusters, so it’s important for us to understand how these resources are utilized. At our July Hack Week, we experimented with developing HDFS-DU to provide us an interactive visualization of the underlying Hadoop Distributed File System (HDFS).Read more…