YOW! Data 2020 Day 1

Tue, Jun 30
Timezone: Australia/Sydney (AEST)
08:45

    Session Overviews and Introductions - 15 mins

09:00
  • Added to My Schedule
    keyboard_arrow_down
    Maryam Jahanshahi

    Maryam Jahanshahi - Applying Dynamic Embeddings in Natural Language Processing to track the Evolution of Tech Skills

    schedule  09:00 - 09:45 AM place Grand Ball Room 1 star_halfRate

    Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.

    In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous). I will discuss how my team implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last few years. I will compare data science skill sets in US jobs vs Australian roles, specifically focusing on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time.

09:45

    Break / Q&A with Maryam Jahanshahi - 25 mins

10:10
10:40

    Break / Q&A with Dr Eugene Dubossarsky - 25 mins

11:05
11:35

    Break / Q&A with Greg Roodt - 25 mins

12:00

    Virtual Lunch Break - 60 mins

YOW! Data 2020 Day 2

Wed, Jul 1
Timezone: Australia/Sydney (AEST)
08:45

    Session Overviews and Introductions - 15 mins

09:00
  • Added to My Schedule
    keyboard_arrow_down
    Karthik Ramasamy

    Karthik Ramasamy - Apache Pulsar: The Next Generation Messaging and Queuing System

    schedule  09:00 - 09:45 AM place Grand Ball Room 1 star_halfRate

    Apache Pulsar is the next generation messaging and queuing system with unique design trade-offs driven by the need for scalability and durability. Its two layered architecture of separating message storage from serving led to an implementation that unifies the flexibility and the high-level constructs of messaging, queuing and light weight computing with the scalable properties of log storage systems. This allows Apache Pulsar to be dynamically scaled up or down without any downtime. Using Apache BookKeeper as the underlying data storage, Pulsar guarantees data consistency and durability while maintaining strict SLAs for throughput and latency. Furthermore, Apache Pulsar integrates Pulsar Functions, a lambda style framework to write serverless functions to natively process data immediately upon arrival. This serverless stream processing approach is ideal for lightweight processing tasks like filtering, data routing and transformations. In this talk, we will give an overview about Apache Pulsar and delve into its unique architecture on messaging, storage and serverless data processing. We will also describe how Apache Pulsar is deployed in use case scenarios and explain how end-to-end streaming applications are written using Pulsar.

09:45

    Break / Q&A with Karthik Ramasamy - 25 mins

10:10
10:40

    Break / Q&A with Claire Carroll - 25 mins

16:00
  • Added to My Schedule
    keyboard_arrow_down
    Fabian Hueske

    Fabian Hueske - Stream Processing for Everyone with Continuous SQL Queries

    schedule  04:00 - 04:45 PM place Grand Ball Room 1 star_halfRate

    About four years ago, we started to add SQL support to Apache Flink with the primary goal to make stream processing technology accessible to non-developers. An important design decision to achieve this goal was to provide the same syntax and semantics for continuous streaming queries as for traditional batch SQL queries. Today, Flink runs hundreds of business critical streaming SQL queries at Alibaba, Criteo, DiDi, Huawei, Lyft, Uber, Yelp, and many other companies. Flink is obviously not the only system providing a SQL interface to process streaming data. There are several commercial and open source systems offering similar functionality. However, the syntax and semantics of the various streaming SQL offerings differ quite a lot.

    In late 2018, members of the Apache Calcite, Beam, and Flink communities set out to write a paper discussing their joint approach to streaming SQL.
    We submitted the paper "One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and Tables" to SIGMOD - the world's no. 1 database research conference - and it got accepted. Our goal was to get our approach validated by the database research community and to trigger a wider discussion about streaming SQL semantics. Today, the SQL Standards committee is discussing an extension of the standard to pinpoint the syntax and semantics of streaming SQL queries.

    In my talk, I will briefly introduce the motivation for SQL queries on streams. I'll present the three-part extension proposal that we discussed in our paper consisting of (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results. Finally, I'll discuss how these concepts are implemented in Apache Flink and show some streaming SQL queries in action.

16:45

    Break / Q&A with Fabian Hueske - 25 mins

17:10

    Virtual Happy Hour - 60 mins

YOW! Data 2020 Day 3

Thu, Jul 2
Timezone: Australia/Sydney (AEST)
08:45

    Session Overviews and Introductions - 15 mins

09:00
09:45

    Break / Q&A with Dean Wampler - 25 mins

10:10
  • Added to My Schedule
    keyboard_arrow_down
    Dr. Denis Bauer

    Dr. Denis Bauer - How COVID-19 has Accelerated the Journey to Data-driven Health Decisions

    schedule  10:10 - 10:40 AM place Grand Ball Room 1 star_halfRate

    The speed with which COVID-19 has taken over the world has raised the demand for data-
    driven health decisions and the shift towards virtual may actually enable the necessary data
    collection. This session talks about how CSIRO has leveraged cloud-native technologies to
    advance three areas of the COVID-19 response: firstly we worked with GISAID, the largest
    data resource for the virus causing COVID-19 and use standard health terminologies (FHIR)
    to help collect clinical patient data. This feeds into a Docker-based workflow that creates
    identifying “fingerprints” of the virus for guiding vaccine developments and investigating
    whether there are more pathogenic versions of the virus. Secondly, we developed a fully
    serverless web-service for tailoring diagnostics efforts, capable of differentiating between
    strains. Thirdly, we are creating a serverless COVID-19 analysis platform that allows
    distributed genomics and patient data to be shared and analysed in a privacy- and
    ownership-preserving manner and functioning as a surveillance system for detecting more
    virulent strains early.

10:40

    Break / Q&A with Dr. Denis Bauer - 25 mins

11:05
11:35

    Break / Q&A with Mat Kelcey - 25 mins