21–24 Feb 2018
Bonn
Europe/Zurich timezone

This is a sandbox server intended for trying out Indico. It should not be used for real events and any events on this instance may be deleted without notice.

Large-Scale Machine Learning at Twitter

Not scheduled
15m
50 (Bonn)

50

Bonn

Speaker

Mr Lin Jimmy (Twitter)

Description

The success of data-driven solutions to di?cult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. We begin with an overview of this platform, which handles \traditional" data warehousing and business intelligence tasks for the organization. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused speci?cally on supervised classi?cation. In particular, we have identi?ed stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-de?ned functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-de?ned functions and the materialized output of other scripts.

Author

Mr Lin Jimmy (Twitter)

Presentation materials

There are no materials yet.