filter_list help_outline
  • Greg Roodt
    keyboard_arrow_down

    Greg Roodt - Data Maturity Levels

    Greg Roodt
    Greg Roodt
    Data Engineering Lead
    Canva
    schedule 3 months ago
    Sold Out!
    30 Mins
    Invited Talk
    Intermediate

    At a startup, typically the main concern is survival. Advanced analysis techniques and machine learning is often a luxury or even a distraction from the prime directive - don't die. However, as a startup grows,the data requirements evolve and eventually the startup morphs into a larger company where data is a core competitive advantage that drives decision making and product features.

    In this talk, I describe what this evolution looks like and provide a framework to evaluate the different data maturity levels that a company may be at. This framework can not only be applied to a growing company, it can also be applied to a team or department within an already established company.

  • Karthik Ramasamy
    keyboard_arrow_down

    Karthik Ramasamy - Apache Pulsar: The Next Generation Messaging and Queuing System

    45 Mins
    Invited Talk
    Intermediate

    Apache Pulsar is the next generation messaging and queuing system with unique design trade-offs driven by the need for scalability and durability. Its two layered architecture of separating message storage from serving led to an implementation that unifies the flexibility and the high-level constructs of messaging, queuing and light weight computing with the scalable properties of log storage systems. This allows Apache Pulsar to be dynamically scaled up or down without any downtime. Using Apache BookKeeper as the underlying data storage, Pulsar guarantees data consistency and durability while maintaining strict SLAs for throughput and latency. Furthermore, Apache Pulsar integrates Pulsar Functions, a lambda style framework to write serverless functions to natively process data immediately upon arrival. This serverless stream processing approach is ideal for lightweight processing tasks like filtering, data routing and transformations. In this talk, we will give an overview about Apache Pulsar and delve into its unique architecture on messaging, storage and serverless data processing. We will also describe how Apache Pulsar is deployed in use case scenarios and explain how end-to-end streaming applications are written using Pulsar.

  • Dean Wampler
    keyboard_arrow_down

    Dean Wampler - Cluster-wide Scaling of Machine Learning with Ray

    45 Mins
    Invited Talk
    Intermediate

    Popular ML techniques like Reinforcement learning (RL) and Hyperparameter Optimization (HPO) require a variety of computational patterns for data processing, simulation (e.g., game engines), model search, training, and serving, and other tasks. Few frameworks efficiently support all these patterns, especially when scaling to clusters.

    Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales applications from a laptop to a cluster. It was created to address the needs of reinforcement learning and hyperparameter tuning, in particular, but it is broadly applicable for almost any distributed Python-based application, with support for other languages forthcoming.

    I'll explain the problems Ray solves and how Ray works. Then I'll discuss RLlib and Tune, the RL and HPO systems implemented with Ray. You'll learn when to use Ray versus alternatives, and how to adopt it for your projects.

  • Claire Carroll
    keyboard_arrow_down

    Claire Carroll - How to be a more impactful data analyst

    30 Mins
    Invited Talk
    Intermediate

    As the sole analyst in a fast-growing Australian startup, I experienced the pain of the traditional analyst workflow — stuck on a hamster wheel of report requests, Excel worksheets that frequently broke, an ever-growing backlog, and numbers that never quite matched up.

    This story is familiar to almost any analyst. In this talk, I’ll draw on my own experience as well as similar experiences from others in the industry to share how I broke out of this cycle. You’ll learn how you can “scale yourself” by applying software engineering best practices to your analytics code, and how to turn this knowledge into an impactful analytics career.

  • Mat Kelcey
    keyboard_arrow_down

    Mat Kelcey - Self supervised learning & making use of unlabelled data.

    30 Mins
    Invited Talk
    Intermediate

    The general supervised learning problem starts with a labelled dataset. It's common though to additionally have a large collection of unlabelled data also. Self supervision techniques are a way to make use of this data to boost performance. In this talk we'll review some contrastive learning techniques that can either be used to provide weak labelled data or to act as a way of pre training for few-shot learning.

  • Dr. Denis Bauer
    keyboard_arrow_down

    Dr. Denis Bauer - How COVID-19 has Accelerated the Journey to Data-driven Health Decisions

    30 Mins
    Invited Talk
    Intermediate

    The speed with which COVID-19 has taken over the world has raised the demand for data-
    driven health decisions and the shift towards virtual may actually enable the necessary data
    collection. This session talks about how CSIRO has leveraged cloud-native technologies to
    advance three areas of the COVID-19 response: firstly we worked with GISAID, the largest
    data resource for the virus causing COVID-19 and use standard health terminologies (FHIR)
    to help collect clinical patient data. This feeds into a Docker-based workflow that creates
    identifying “fingerprints” of the virus for guiding vaccine developments and investigating
    whether there are more pathogenic versions of the virus. Secondly, we developed a fully
    serverless web-service for tailoring diagnostics efforts, capable of differentiating between
    strains. Thirdly, we are creating a serverless COVID-19 analysis platform that allows
    distributed genomics and patient data to be shared and analysed in a privacy- and
    ownership-preserving manner and functioning as a surveillance system for detecting more
    virulent strains early.

  • Dr Eugene Dubossarsky
    keyboard_arrow_down

    Dr Eugene Dubossarsky - The Data Literacy Revolution

    30 Mins
    Invited Talk
    Intermediate

    The popularity and ubiquity of data science, data analytics, AI and the trend towards digital transformation have led to massive, repeated failures in many businesses. Despite billions spent, hundreds of Ph.D.s hired, and much boasting in conference presentations, many enterprises are still struggling to leverage the value of these new technologies. The missing ingredient is the literacy of the rest of the organisation, particularly senior management.

    This presentation will describe this new literacy: “data literacy”, the analogy with computer literacy, and reasons why this skill set will soon be as essential to all professionals as computer literacy is today. It will address issues of automation, the advent of decision making as the key managerial activity and the resulting democratisation of AI and analytics, however still maintaining a class of data science and analytics experts. The presentation will address issues of mindset, as well as skill set, and the ways in which management engagement with data analytics must change to leverage its value.

  • Maryam Jahanshahi
    keyboard_arrow_down

    Maryam Jahanshahi - Applying Dynamic Embeddings in Natural Language Processing to track the Evolution of Tech Skills

    Maryam Jahanshahi
    Maryam Jahanshahi
    Research Scientist
    TapRecruit
    schedule 3 months ago
    Sold Out!
    45 Mins
    Invited Talk
    Intermediate

    Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.

    In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous). I will discuss how my team implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last few years. I will compare data science skill sets in US jobs vs Australian roles, specifically focusing on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time.

  • Fabian Hueske
    keyboard_arrow_down

    Fabian Hueske - Stream Processing for Everyone with Continuous SQL Queries

    45 Mins
    Invited Talk
    Intermediate

    About four years ago, we started to add SQL support to Apache Flink with the primary goal to make stream processing technology accessible to non-developers. An important design decision to achieve this goal was to provide the same syntax and semantics for continuous streaming queries as for traditional batch SQL queries. Today, Flink runs hundreds of business critical streaming SQL queries at Alibaba, Criteo, DiDi, Huawei, Lyft, Uber, Yelp, and many other companies. Flink is obviously not the only system providing a SQL interface to process streaming data. There are several commercial and open source systems offering similar functionality. However, the syntax and semantics of the various streaming SQL offerings differ quite a lot.

    In late 2018, members of the Apache Calcite, Beam, and Flink communities set out to write a paper discussing their joint approach to streaming SQL.
    We submitted the paper "One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and Tables" to SIGMOD - the world's no. 1 database research conference - and it got accepted. Our goal was to get our approach validated by the database research community and to trigger a wider discussion about streaming SQL semantics. Today, the SQL Standards committee is discussing an extension of the standard to pinpoint the syntax and semantics of streaming SQL queries.

    In my talk, I will briefly introduce the motivation for SQL queries on streams. I'll present the three-part extension proposal that we discussed in our paper consisting of (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results. Finally, I'll discuss how these concepts are implemented in Apache Flink and show some streaming SQL queries in action.

  • No more submissions exist.
Looking for your submitted proposals. Click here.