6-10 June, 2016
Technical Fellow Head, Big Data Engineering and Cloud Information Services Lab (CISL)
TITLE: Big Data @Microsoft
Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store any and all data, in anticipation of potential future strategic value. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of very large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation.
Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets.
However, as key customer and systems touch points are instrumented to log data, and Internet of Things applications become common, data in the enterprise is growing at a staggering pace, and the need to leverage different storage tiers (ranging from tape to main memory) is posing new challenges, leading to caching technologies, such as Spark. On the analytics side, the emergence of resource managers such as YARN has opened the door for analytics tools to bypass the Map-Reduce layer and directly exploit shared system resources while computing close to data copies. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit.
While Hadoop is widely recognized and used externally, Microsoft has long been at the forefront of Big Data analytics, with Cosmos and Scope supporting all internal customers. These internal services are a key part of our strategy going forward, and are enabling new state of the art externally facing services such as Azure Data Lake and more. I will examine these trends, and ground the talk by discussing the Microsoft Big Data stack.
Today, 50% of the world's population lives in cities and the number will grow to 70% by 2050. Cities are the loci of economic activity and the source of innovative solutions to 21st century challenges. At the same time, cities are also the cause of looming sustainability problems in transportation, resource consumption, housing affordability, and inadequate or aging infrastructure. The large volumes of urban data, along with vastly increased computing power and improved user interfaces enable analysts to better understand cities. Encouraging success stories show better operations, more informed planning, improved policies, and a better quality of life for residents. However, analyzing urban data often requires a staggering amount of work, from identifying relevant data sets, cleaning and integrating them, to performing exploratory analyses over complex, spatio-temporal data.
TITLE: Exploring Big Urban Data
Our long-term goal is to enable interdisciplinary teams to crack the code of cities by freely exploring the vast amounts of data cities generate. This talk describes challenges which have led us to fruitful research on data management, data analysis, and visualization techniques. I will present methods and systems we have developed to increase the level of interactivity, scalability, and usability for spatio-temporal analyses.
This work was supported in part by the National Science Foundation, a Google Faculty Research award, the Moore-Sloan Data Science Environment at NYU, IBM Faculty Awards, NYU School of Engineering and Center for Urban Science and Progress.
Professor of Computer Science and Engineering and Data Science New York University
Professor, Computer Science, University of California Santa Cruz
TITLE: Combining Statistics and Semantics to Turn Data into Knowledge
Addressing inherent uncertainty and exploiting structure are fundamental to turning data into knowledge. Statistical relational learning (SRL) builds on principles from probability theory and statistics to address uncertainty while incorporating tools from logic to represent structure.
In this talk I will overview our recent work on probabilistic soft logic (PSL), an SRL framework for collective, probabilistic reasoning in relational domains.
PSL is able to reason holistically about both entity attributes and relationships among the entities, along with ontological constraints.
The underlying mathematical framework supports extremely efficient inference.
Our recent results show that by building on state-of-the-art optimization methods in a distributed implementation, we can solve large-scale knowledge graph extraction problems with millions of random variables orders of magnitude faster than existing approaches.
Professor at the Faculty of Computer Science of the Free University of Bozen-Bolzano.
Member of the KRDB Research Centre for Knowledge and Data.
TITLE: Reasoning over Evolving Graph-structured Data Under Constraints
Graph-structured data are receiving increased attention in the database community. We argue that description Logics (DLs), which are studied extensively in Knowledge Representation, are tightly connected to graph-structured data, and provide indeed quite powerful mechanisms for expressing forms of constraints capturing domain knowledge. We draw interesting connections between expressive variants of DLs and path-constraints studied in databases, and derive new results on implication of such constraints. We then consider the challenging setting where graph-structured data evolve as a result of update operations that add and delete facts in the style of action languages, under DL constraints.
In this setting, we discuss two fundamental reasoning tasks, considering both lightweight and expressive variants of DLs: verification, i.e., checking the consistency of a sequence of operations with respect to constraints; and plan existence, i.e., existence of a sequence of operations leading to a goal state.