filter_list help_outline
  • Liked Fabian Hueske
    keyboard_arrow_down

    Fabian Hueske - Stream Processing for Everyone with Continuous SQL Queries

    45 Mins
    Keynote
    Intermediate

    About four years ago, we started to add SQL support to Apache Flink with the primary goal to make stream processing technology accessible to non-developers. An important design decision to achieve this goal was to provide the same syntax and semantics for continuous streaming queries as for traditional batch SQL queries. Today, Flink runs hundreds of business critical streaming SQL queries at Alibaba, Criteo, DiDi, Huawei, Lyft, Uber, Yelp, and many other companies. Flink is obviously not the only system providing a SQL interface to process streaming data. There are several commercial and open source systems offering similar functionality. However, the syntax and semantics of the various streaming SQL offerings differ quite a lot.

    In late 2018, members of the Apache Calcite, Beam, and Flink communities set out to write a paper discussing their joint approach to streaming SQL.
    We submitted the paper "One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and Tables" to SIGMOD - the world's no. 1 database research conference - and it got accepted. Our goal was to get our approach validated by the database research community and to trigger a wider discussion about streaming SQL semantics. Today, the SQL Standards committee is discussing an extension of the standard to pinpoint the syntax and semantics of streaming SQL queries.

    In my talk, I will briefly introduce the motivation for SQL queries on streams. I'll present the three-part extension proposal that we discussed in our paper consisting of (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results. Finally, I'll discuss how these concepts are implemented in Apache Flink and show some streaming SQL queries in action.

  • Liked Maryam Jahanshahi
    keyboard_arrow_down

    Maryam Jahanshahi - Applying Dynamic Embeddings in Natural Language Processing to track the Evolution of Tech Skills

    Maryam Jahanshahi
    Maryam Jahanshahi
    Research Scientist
    TapRecruit
    schedule 1 month ago
    Sold Out!
    45 Mins
    Invited Talk
    Intermediate

    Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.

    In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous). I will discuss how my team implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last few years. I will compare data science skill sets in US jobs vs Australian roles, specifically focusing on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time.

  • Liked Fabian Hueske
    keyboard_arrow_down

    Fabian Hueske - Workshop - Stream Processing with Apache Flink

    480 Mins
    Workshop
    Intermediate

    Apache Flink is a distributed stream processor that makes it easy to implement stateful stream processing applications and operate them at scale.

    In this workshop, you will learn the basics of stream processing with Apache Flink. You will implement a stream processing application that ingests events from Apache Kafka and submit it to a (Docker) local Flink cluster for execution. You will learn how to manage and operate a continuously running application and how to access job and framework metrics.

    In the afternoon, we will have a look at Flink's streaming SQL interface. You will submit SQL queries that are evaluated over unbounded data streams, producing results that are continuously updated as more and more data is ingested.

  • Liked Ruby Tahboub
    keyboard_arrow_down

    Ruby Tahboub - TBA

  • Liked Karthik Ramasamy
    keyboard_arrow_down

    Karthik Ramasamy - TBA

  • Liked J. Rosenbaum
    keyboard_arrow_down

    J. Rosenbaum - Parenting Neural Networks - what kind of AI do you want to raise?

    J. Rosenbaum
    J. Rosenbaum
    Artist
    J. Rosenbaum
    schedule 1 month ago
    Sold Out!
    30 Mins
    Talk
    Intermediate

    You put your bias in, you take your bias out, you put your bias in and you shake it all about! What can we do to turn it all around? Because that's what it's all about. I am a proud parent and a proud trainer of different Machine Learning systems. From classification to generation of text and images I have seen it go beautifully well and go horribly wrong. It's hilarious, rewarding, bewildering and sometimes frustrating but I wouldn't trade either role for the world. Because raising a good neural network is a lot like raising a child. They know only what they have been taught and you have to have patience and understanding of the things you wish to teach to ensure they learn the right lessons and becomes the best version of themselves. I will explore supervised vs unsupervised learning, image and text generation, image recognition and classification, creativity, bias and different forms of reinforcement. Come with me on a journey through the training of machines to learn some of the pitfalls and some of the proudest moments of being the parent of a bouncing baby neural network.

  • Liked Fiona Coath
    keyboard_arrow_down

    Fiona Coath - Social Implications of Bias in Machine Learning

    30 Mins
    Talk
    Intermediate

    The adoption of Machine Learning in decision making has amplified the risk of socially-biased outcomes. Everyone working on ML tools holds immense power over shaping the future of our world. However, we can use this power for good and train models that help to drive positive social change sooner. This talk will provide context for this issue, explore real-world examples and discuss ideas for potential solutions.

    Together let's discover:
    - How our datasets and algorithms share societies historical social biases.
    - Why this could cause results to be inaccurate and further exaggerate existing discrimination.
    - How we can measure the impact and change this from a risk into a powerful solution.

  • Liked Maulik Soneji
    keyboard_arrow_down

    Maulik Soneji / Chakravarthy Varaga - BEAST: Building an event processing library to handle millions of events

    30 Mins
    Talk
    Intermediate

    Building an event processing library comes with own baggage. At gojek, we created BEAST, our own event processing library to consume events from kafka and push it to bigquery.

    In this talk, will cover the learnings around:

    Why we had to build our custom event processing tool Beast – customising code for each input/output combination and old way of deployment – limitations with existing systems for our usecase

    Kafka: Setting quick context on kafka and consumer, how we had to use consumers without auto commit and manual sync.

    Reliability: Ensuring No data loss. How could we test the application for data loss and scenarios, How could we monitor data loss in bigquery, and alerting.

    Performance: How we achieved high performance, handling high throughput with acceptable latency.

    Architecture: How consumer & producer threads communicate using blocking queue. why we didn’t pick redis as store? why we couldn’t write in go language.

    Scalability: How we achieved scalability using kubernetes.

    Demo
    (https://github.com/gojek/beast)

  • Liked Hercules Konstantopoulos
    keyboard_arrow_down

    Hercules Konstantopoulos - A story that starts with Excel and ends with a data science platform.

    30 Mins
    Case Study
    Intermediate

    Data Science is a New Age field for a New Age sector. Tech is firmly on board, but what about traditional industries? Most of the corporate world last updated its OS before data science was even a buzzword! But wait, isn’t the corporate world how most of our basic services operate? Think transport and logistics, energy, government, food... literally keeping the lights on and our stomachs full.

    These industries are overdue for some disruption, but they are taking a while. First, they’re not seen as cool and are having trouble getting talented grads. Second, they are in a different mode of thinking when it comes to data governance: typically they will lock everything down because the (societal or business) risk of leakage is just too high. In such environments it takes a set of behavioural changes to safely steer the boat into the calm waters of data science.

    This is that story.

    With case studies from the public and private sectors, across industries and focus areas I will share my experience of what works (say, helping people adopt new tools) and what doesn’t (e.g., rolling our eyes at IT) in bringing about these transformations. And the wonderful part is that at the end of this story everybody wins.

  • Liked Simon Aubury
    keyboard_arrow_down

    Simon Aubury - Islands in the Stream - What country music can teach us about event driven systems

    30 Mins
    Talk
    Intermediate

    Event driven systems are all the rage. It's with good reason we're witnessing a transformation with businesses adopting event driven systems. But before we sail away to another world, let's avoid the common pitfalls of designing & running event driven systems.

    Islands in the Stream - what Kenny Rogers can teach us about event driven systems from the wisdom of a country music classic:-

    • Techniques for management of schemas across environments and divisions for the peace unknown of your compliance teams
    • Microservices and the honest love of self-contained services using domain driven design
    • No one in between us; best practices for using declarative high-level events across teams to deliver the right cross functional solution
    • The fine-toothed comb to keep the DBA's happy running CDC against tier-1 production systems
    • And we rely on each other by gracefully handling a mix of traditional batch and newly developed steaming systems
  • Liked Greg Roodt
    keyboard_arrow_down

    Greg Roodt - Building a scalable Data platform at Canva

    Greg Roodt
    Greg Roodt
    Data Engineering Lead
    Canva
    schedule 1 month ago
    Sold Out!
    30 Mins
    Case Study
    Intermediate

    Canva is a successful Australian startup with millions of users who have created billions of designs.

    The Data platform has a 4+ year history, supporting both Analysis and Data Science capabilities. The platform has evolved a lot from the early days of supporting a single Data Specialist to the current state where it supports 20+ Data Analysts and Data Scientists across the company (700+ people). Traditionally it has taken too long to deliver analysis or ship product features using applied Data Science, however as the Data Platform has evolved, it has led to greater autonomy and reduced cycle time, while scaling to larger team sizes.

    In this talk, I will explore the evolution of Canva's Data Platform. The focus will be on how we've evolved the platform to empower autonomy and improve productivity and compare this with the idea of a Data Mesh. I will look at some of the techniques, tools and technologies we are using as well as the challenges we are still facing.

  • Liked Claire Carroll
    keyboard_arrow_down

    Claire Carroll - How to be a more impactful data analyst

    30 Mins
    Talk
    Intermediate

    As the sole analyst in a fast-growing Australian startup, I experienced the pain of the traditional analyst workflow — stuck on a hamster wheel of report requests, Excel worksheets that frequently broke, an ever-growing backlog, and numbers that never quite matched up.

    This story is familiar to almost any analyst. In this talk, I’ll draw on my own experience as well as similar experiences from others in the industry to share how I broke out of this cycle. You’ll learn how you can “scale yourself” by applying software engineering best practices to your analytics code, and how to turn this knowledge into an impactful analytics career.

  • Liked James Strain
    keyboard_arrow_down

    James Strain / Wai Chee Yau - The hidden costs and trade offs of DynamoDB at scale

    30 Mins
    Case Study
    Intermediate

    Over the past two years Zendesk has been working on a high volume event streaming customer data platform built on AWS. The primary data store that we initially selected was AWS DynamoDB - a highly scalable, managed NoSQL database. Along the way we encountered a number of challenges developing and operating our platform on top of DynamoDB, which at times made us question our initial decision to use it. As a result, we have re-architected some components of the system to use a traditional relational datastore where we have found it’s a much better fit for the use case.

    In this talk we will share:

    • Practical tips on operating a large scale data intensive service processing tens of terabytes of data every month backed by DynamoDB
    • Tools and techniques for evaluating DynamoDB vs other data stores
    • Instances where we feel DynamoDB falls short, especially for global multi-tenant software deployments
    • Hidden operational and financial costs to consider when planning a workload backed by DynamoDB
  • Liked Noon van der Silk
    keyboard_arrow_down

    Noon van der Silk - Building an efficient labelling pipeline with prodigy

    Noon van der Silk
    Noon van der Silk
    Director
    Braneshop
    schedule 1 month ago
    Sold Out!
    30 Mins
    Demonstration
    Advanced

    Prodigy is an extensible tool for easily and efficiently labelling text and images. It also features active learning, which allows for it to very quickly start making suggestions while labelling. This capability makes it very practical for any in-house data labelling effort. In this talk, we'll see how to some of the out-of-the-box prodigy features, and then how to integrate a custom computer-vision model. We'll also look into the prodigy community, and discuss how to productionise it.

  • Liked Rose Skandari
    keyboard_arrow_down

    Rose Skandari - Failing Asset Prediction in Power Networks using AI

    Rose Skandari
    Rose Skandari
    Senior Data Scientist
    Powercor
    schedule 1 month ago
    Sold Out!
    30 Mins
    Case Study
    Intermediate

    Utilities like many other sectors are getting huge amount of data from variety of internal and external resources. Data from smart meters, SCADA, GIS, LIDAR, asset information, and many other resources along with concepts such as Smart Grids, Digital Substations, and augmentation of renewables make the Power industry a suitable platform for data analytics.

    In this session, I will explain some of the applications of data science in the energy industry and utilities. I will present a case study of finding failing assets like overhead fuses using anomaly detection techniques on smart meter data. I will cover data preparation, model selection, troubleshooting and parameter tuning from the site inspection results.

  • Liked Habiba Habiba
    keyboard_arrow_down

    Habiba Habiba - Interpretable machine learning on graphs with saliency maps

    Habiba Habiba
    Habiba Habiba
    Research Engineer
    CSIRO Data61
    schedule 1 month ago
    Sold Out!
    30 Mins
    Talk
    Intermediate

    Knowing how a machine learning model works is fundamentally the first step towards building better models.

    When it comes to understanding predictions generated by machine learning models, we must question: why was a certain prediction generated? If we cannot understand or interpret how or why a prediction was made then it is difficult to trust it, let alone act on it. This talk is about going a step beyond the quintessential practice of machine learning to actually explaining its outcomes.

    Specifically, we’ll discuss interpretability of outcomes of machine learning methods on graph-structured data. A standard Graph Neural Network (GNN) model will solve an archetypal graph machine learning problem of node classification by predicting the class label of a given target node based on both the features of nodes in the graph, and the structure of the graph. The interpretability problem on top of this is: which nodes, edges and features led to this prediction?

    We’ll also explore the use of saliency maps - a technique successfully used in computer vision applications - to interpret the decisions of GNN models in graph-structured data. Using this approach we are essentially answering a counterfactual question: how would the prediction change if these node features were different, or if these edges did not exist? Overall, such interpretations can potentially provide insights into the underlying mechanism of the GNN models.

  • Liked Rachel Ragell
    keyboard_arrow_down

    Rachel Ragell - Pathways to net zero emissions

    Rachel Ragell
    Rachel Ragell
    Lead Analyst
    Kinesis
    schedule 1 month ago
    Sold Out!
    30 Mins
    Talk
    Intermediate

    With an ever-growing global climate problem, many cities are setting net zero emissions targets for 2050 or earlier. To reach these targets, cities must reduce emissions from energy, transport and waste caused by residents, businesses and visitors.

    This talk works through the process of assessing possible pathways for cities to achieve net zero emissions. This involves:

    • leveraging available datasets to help cities understand the current emissions profile of their city, including understanding the impact in different locations, and across different sectors.
    • modelling land use and emissions growth to understand what future emissions profile may occur.
    • exploring the impact that new technologies, such as electric vehicles, or city-led policies, such as higher building standards, will have in reducing emissions.
    • empowering cities to track their progress and easily model necessary amendments to achieve their goals.
  • Liked Yanir Seroussi
    keyboard_arrow_down

    Yanir Seroussi - Confidence intervals aren't credible, but can you be confident in credible intervals?

    30 Mins
    Talk
    Intermediate

    Confidence intervals are very easy to misuse and misunderstand. Common misconceptions include the beliefs that the confidence level is the probability of the true parameter being in the interval, and that narrower intervals indicate more precise knowledge about the parameter. Bayesian credible intervals are often promoted as an alternative to confidence intervals, but they suffer from their own set of problems. This talk gives a brief overview of issues with confidence intervals, why credible intervals tend to be a better choice, and how they can be used in a confident manner.

  • Liked Xin Liang
    keyboard_arrow_down

    Xin Liang - Getting Over the Boring Stuff Quicker - Building a Semi-Automated Speech Audio Annotation Tool

    Xin Liang
    Xin Liang
    Machine Learning Engineer
    Eliiza
    schedule 1 month ago
    Sold Out!
    30 Mins
    Talk
    Intermediate

    Developing a new deep learning model requires a large amount of data to be collected and annotated. While the process of data collection can be expedited by making use of publicly available data, it can be time consuming to annotate and label the large amounts of data needed to train a high accuracy model.

    Annotation tools for audio data, especially speech data, are currently very limited. This talk explores the development of a tool that takes a novel ‘semi-automated’ approach to speech audio annotation. This new approach streamlines the normally monotonous process of manual annotation, by creating a modular system and graphical interface. It combines manual human annotation with automated annotation that leverages a mixture of technologies, including pre-trained models (such as Mozilla Deep Speech), existing speech-recognition APIs (such as the Google Cloud Speech API), and model training-inference loops.

    The talk will discuss the concepts and building blocks of such a semi-automated pipeline for data annotation. A live demo of the annotation interface will be shown.

  • Liked Xuanyi Chew
    keyboard_arrow_down

    Xuanyi Chew - Yepoko Lessons For Machine Learning on Small Data

    Xuanyi Chew
    Xuanyi Chew
    Chief Data Scientist
    Ordermentum
    schedule 1 month ago
    Sold Out!
    30 Mins
    Talk
    Intermediate

    Let's face it, in most companies, the amount of good data available to perform machine learning is very small. Most data are small data. So how can we do good machine learning on small data?

    In this talk I present a problem, followed by two methods of machine learning - one using deep learning and BERT and the other using simpler techniques. Then I compare and contrast them. Finally I present lessons when dealing with small data.

Looking for your submitted proposals. Click here.