ODSC India 2020
Tue, Nov 24
Timezone: Asia/Kolkata (IST)
Opening Keynote - 45 mins
Welcome Note - 15 mins
Coffee Break - 15 mins
Kuldeep Jiwani - Non-Parametric PDF estimation for advanced Anomaly Detection
Anomaly Detection have been one of most sought after analytical solutions for businesses operating in the domain of Network Operation, Service Operation, Manufacturing etc. and many other sectors where continuity of operations is essential. Any degradation in operational service or an outage, implies high losses and possible customer churn. The data in such real world applications is generally noisy, have complex patterns and often correlated.
There are techniques like Auto-Encoders available for modelling complex patterns, but they can't explain the cause in original feature space. The traditional univariate anomaly detection techniques uses the z-score and p-value methods. These rely upon unimodality and choice of correct parametric form. If assumptions are not satisfied then there would be a high number of False-Positives and False-Negatives.
This is where the need for estimating a PDF (Probability Density Function) arises that too without assuming a prior parametric form i.e. Non-Parametric approach. The PDF needs to be modelled as close to the true distribution as possible. That is it should have a low bias and low variance to avoid over-smoothing and under-smoothing. Only then we would have better chances of identifying true anomalies.
Approaches like KDE - Kernel Density Estimation assist in such non-parametric estimations. As per research the type of kernel has a lesser role to play than the bandwidth for a good PDF estimation. The default bandwidth selection technique used in both Python and R packages over-smooths the PDF and is not suitable for Anomaly Detection.
We will explain another method, where we run optimisation over a cost function based on modelling Gaussian kernel via FFT (Fast Fourier Transform), to obtain the appropriate bandwidth. Then we will show how we can apply it for Anomaly Detection even when the data is multi-modal (have multiple peaks) and the distribution can be of any shape.
Based on research paper under publication "Optimal Kernel Density Estimation using FFT based cost function", currently scheduled for ICDM 2020, New York
Akshay Bahadur - Indian Sign Language Recognition (ISLAR)
Sample this – two cities in India; Mumbai and Pune, though only 80kms apart have a distinctly varied spoken dialect. Even stranger is the fact that their sign languages are also distinct, having some very varied signs for the same objects/expressions/phrases. While regional diversification in spoken languages and scripts are well known and widely documented, apparently, this has percolated in sign language as well, essentially resulting in multiple sign languages across the country. To help overcome these inconsistencies and to standardize sign language in India, I am collaborating with the Centre for Research and Development of Deaf & Mute (an NGO in Pune) and Google. Adopting a two-pronged approach: a) I have developed an Indian Sign Language Recognition System (ISLAR) which utilizes Artificial Intelligence to accurately identify signs and translate them into text/vocals in real-time, and b) have proposed standardization of sign languages across India to the Government of India and the Indian Sign Language Research and Training Centre.
As previously mentioned, the initiative aims to develop a lightweight machine-learning model, for 14 million speech/hearing impaired Indians, that is suitable for Indian conditions along with the flexibility to incorporate multiple signs for the same gesture. More importantly, unlike other implementations, which utilize additional external hardware, this approach, which utilizes a common surgical glove and a ubiquitous camera smartphone, has the potential of hardware-related savings at an all-India level. ISLAR received great attention from the open-source community with Google inviting me to their India and global headquarters in Bangalore and California, respectively, to interact with and share my work with the TensorFlow team.
Gunjan Dewan - Developing a match-making algorithm between customers and Go-Jek products!
20+ products. Millions of active customers. Insane amount of data and complex domain. Come join me in this talk to know the journey we at Gojek took to predict which of our products a user is most likely to use next.
A major problem we faced, as a company, was targeting our customers with promos and vouchers that were relevant to them. We developed a generalized model that takes into account the transaction history of users and gives a ranked list of our services that they are most likely to use next. From here on, we are able to determine the vouchers that we can target these customers with.
In this talk, I will be talking about how we used recommendation engines to solve this problem, the challenges we faced during the time and the impact it had on our conversion rates. I will also be talking about the different iterations we went through and how our problem statement evolved as we were solving the problem.
Debanjana Banerjee - CRESST:Complete Rare Event Specification using Stochastic Treatment
In the fast moving world today, rare events are becoming increasingly common. Ranging from studying incidents of safety hazards to identifying transaction fraud, they all fall under the radar of rare events. Identifying and studying rare events become of crucial importance, particularly when the underlying event conforms to a sensitive or an adverse issue. The thing to note here is, despite the probability of occurrence being very close to zero, the potential specification of the rare event could be quite extensive. For example, within the parent rare event of Product Safety, there could be multiple types of potential hazard (Fire, Electrical, Pharmaceutical, etc.), rendering the sub-classes rarer still. In this talk, we are going to discuss a novel algorithm designed to study a rare event and its sub-classes over time with primary focus on forecast and detecting anomalies.
The anomalies studied here are relative anomalies i.e., they may not contribute to the long-term trend of the rare time series but represent deviation from the base state as seen in the immediate past.
Relevant Sister Classes
Count Time Series
Non Homogeneous Poisson Process
Discrete Space Optimization
Sparsity Treatment in Text
Dynamic Time Warping
Relative Local Density
Density Based Clustering
Rajesh Shreedhar Bhat / Pranay Dugar - Text Extraction from Images using deep learning techniques
Extracting texts of various sizes, shapes and orientations from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in a natural scene, content moderation in social media platforms, etc. The text from the image can be a richer and more accurate source of data than human inputs which can be used in several applications like Attribute Extraction, Profanity Checks, etc.
Typically, Extracting Text is achieved in 2 stages:
Text detection: this module helps to know the regions in the input image where the text is present.
Text recognition: given the regions in the image where the text is present, this module gives the raw text out of it.
In this session, I will be talking about the Character level Text Detection for detecting normal and arbitrary shaped texts. Later will be discussing the CRNN-CTC network & the need for CTC loss to obtain the raw text from the images.
Ashay Tamhane - Food Recommendation at Swiggy
Do you like to explore new dishes every time you order food? Or do you stick to your usual favourites? Do you like a Raita with your Biryani, or just a soft drink? Do you prefer veg or non veg? Do these preferences change depending on time of day or day of the week?
These are among the several questions that need to be automatically inferred from data while building a food recommendation engine at scale. Right from ranking dishes on the menu to suggesting complimentary dishes on the cart, Data Science is at the very core of constantly improving our suggestions. In this talk, we will dive into multiple scenarios that pop up while building a food recommendation system in terms of handling data sparsity and handling new restaurants / users among other challenges that crop up due to scale. We will look at different algorithms that need to be explored in order to effectively handle these various challenges.
Venkata Pingali - Privacy-Law Aware ML Data Preparation
The new PDP (Personal Data Protection) Law, which is similar to GDPR
and CCPA, is being implemented in India. All enterprise data services
including analytics and data science within the scope of the law are
required to comply with the same. Almost all major geographies have now
passed similar laws. The expectation of responsible data handling from
organizations is also increasing.
Enrich, our product, is a high-trust data preparation platform for
enterprises that provides data input to analysts and models at scale
everyday. Such data preparation services are on organizations’
compliance and privacy-activity critical path because of their
‘fan-out’ nature. They provide a convenient location to enforce policy
and safety mechanisms.
In this talk we discuss some of the mechanisms that we are building
for clients in our data preparation platform, Enrich. They include
opensource compliance checklist to help with the process, ‘right to
forget’ service using anonymized lookup key service, and metadata
service to enable tracking of the datasets. The focus will be on the
generic capabilities, and not on Scribble or our product.
Note: Will update this over the next few days and weeks
Piyush Arora - Natural Language Querying for Industry Grade Data Analytics Systems
This talk focuses on the topic of querying industry grade big data systems. Enterprises have vast amount of information spread across structured data stores (relational databases, data warehouses, etc.). Descriptive analytics over this data is limited to experts familiar with complex querying languages (e.g., Structured Query Language) as well as metadata and schema associated with such large datastores. The ability to convert natural language questions to SQL statements would make descriptive analytics and reporting much easier and widespread. Problem of automatically converting natural language questions to SQL is well studied, viz., Natural Language Interface to Databases (NLIDB). We present our work on an end-to-end (E2E) system focussed on NLIDB.
We describe two main aspects of E2E NLIDB systems: i) Converting natural language to structured language and ii) understanding natural language. There is a plenitude of applications of such E2E systems across domains e.g., healthcare, finance, logistics, etc.
Priyanshu Jain - Automated Ticket Routing for Large Enterprises
Large enterprises that provide services to consumers may receive millions of customer complaint tickets every month. Handling these tickets on time is very critical, as this directly impacts the quality of service and network efficiency.
A ticket may be assigned to multiple teams before it gets resolved. Assigning a ticket to an appropriate group is usually done manually as the complaint information provided by the customer is not very specific and maybe inaccurate sometimes. This manual process incurs enormous labor costs and is very time inefficient as each ticket may end up in the queue for hours.
In this talk, we will present an approach to automate the process of ticket routing completely. We will start by discussing how we can use Markov Chains to model the flow of tickets across different teams. Next, we will discuss the feature engineering part and why Factorization Machine Models are essential for such a use case. This will be followed by a discussion on the learning of decision rule sets in a supervised manner. These decision rules can be used to traverse tickets across multiple teams in an automated fashion. Thus, automating the complete process of ticket routing. We will also discuss that the proposed framework can be validated easily by SMEs, unlike other AI solutions, thus, resulting in its quick acceptability in an organization. Finally, we will go through the different settings in which this solution can fit, therefore, resulting in its broad applicability.
The framework can provide substantial cost savings to enterprises. It can also reduce Response time to tickets significantly by almost eliminating the queue time. Overall, it can help large enterprises in
1. Saving costs by reducing the workforce of ticket handling team
2. Increasing revenue by improving quality of customer experience
Kuldeep Singh - Simplify Experimentation, Deployment and Collaboration for ML and AI Models
Machine Learning and AI are changing or would say have changed the way how businesses used to behave. However, the Data Science community is still lacking good practices for organizing their projects and effectively collaborating and experimenting quickly to reduce “time to market”.
During this session, we will learn about one such open-source tool “DVC”
which can help you in helping ML models shareable and reproducible.
It is designed to handle large files, data sets, machine learning models, metrics as well as code
Darshan Ganji / Deepesh Agrawal - On-Demand Accelerating Deep Neural Network Inference via Edge Computing
Deep Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on mobile phones and embedded systems with limited hardware resources and taking more time for Inference and Training. For many mobile-first companies such as Baidu and Facebook, various apps are updated via different app stores, and they are very sensitive to the size of the binary files. For example, App Store has the restriction “apps above 100 MB will not download until you connect to Wi-Fi”. As a result, a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB. It is challenging to run computation-intensive DNN-based tasks on mobile devices due to the limited computation resources.
This talk introduces the Algorithms and Hardware that can be used to accelerate the Inferencing or reduce the latency of deep learning workloads. We will discuss how to compress the Deep Neural Networks and techniques like Graph Fusion, Kernel Auto-Tuning for accelerating inference, as well as Data and model parallelization, automatic mixed precision, and other techniques for accelerating training. We will also discuss specialized hardware for deep learning such as GPUs, FPGAs, and ASICs, including the Tensor Cores in NVIDIA’s Volta GPUs as well as Google’s Tensor Processing Units (TPUs). We will also discuss the Deployment of the Large Size Deep Learning Models on the Edge devices like NVIDIA Jetson Nano, Google's Edge TPU(Coral).
Keywords: Graph Optimization, Tensor Fusion, Kernel Auto Tuning, Pruning, Weight sharing, quantization, low-rank approximations, binary networks, ternary networks, Winograd transformations, data parallelism, model parallelism, mixed precision, FP16, FP32, model distillation, Dense-Sparse-Dense training, NVIDIA Volta, Tensor Core, Google TPU.
Vinayaka Mayura G G - Metamorphic Testing for Machine Learning Models with Search Relevancy Example
Accuracy of a Model can be improved in several levels and multiple variables, boundaries and guidelines. With the well known problem statement and solution, it is difficult to evaluate for all the given cases the model would be predicting expected outcomes. Machine Learning Models are solving for the problems for which results are unknown, most of the times. This arises a problem of Test Oracle. Recent surveys and work have shown that this difficulty can be reduced by some of the blackbox testing techniques such as Metamorphic Testing, Fuzzing, Dual Coding et.,
Even though the output of a Model is not known, we can make few predictions based on the Metamorphic relations. A metamorphic relation refers to the relationship between the software input change and output change during multiple program executions. Many metamorphic relations are created based on the transformation from training data set or test data set. We further classify them into Coarse-grained Data transformation and Fine-grained data transformation.
We will discuss different transformations. Will go through the example of a Search relevancy problem and will analyse the application of Metamorphic testing to verify the Machine model built.
LunchBreak - 60 mins
POOJA BALUSANI - Model Interpretability and Explainable AI in Manufacturing
In this talk, we present an industrial use case on “anomaly detection” in steel mills based on IoT sensor data. In large steel mills and manufacturing plants, the top reasons for unplanned downtime are:
• Failure of critical asset
• Quality spec of the end product in line not being met
• Operational limits outside the recommended range (e.g. process, human-safety, equipment-safety, etc.)
Unplanned downtime or line stoppage leads to loss of production or throughput and revenue loss.
Anomaly detection can serve as an early warning system, providing alerts on anomalous behavior that could be detrimental to the equipment health or affect process quality. In this work, we are performing multi-variate anomaly detection on time-series sensor data in a steel mill to help the maintenance engineers and process operators take proactive actions and help reduce plant downtime. Anomaly is presented to the customer in terms of:
• “time-intervals” – startTime: endTime chunks that exhibit deviant behavior
• “anomaly-state” – type association of anomaly to a specific pattern or cluster state
• “anomaly-contribution” – priority association to sensor signals that exhibited deviant behavior within the multi-variate list (more like signal importance)
We shall introduce the approach, where we reformulate the unsupervised modeling to a supervised formulation to incorporate SHAP, LIME, and other explainable tools. We shall illustrate the steps to provide the above-mentioned meta-data for an anomaly to make it explainable and consumable for the end-customer.
Ujwala Musku - Supply Path Optimization in Video Advertising Landscape
In the programmatic era, with a lot of players in the market, it is quite complex for a buyer to reach the destination, namely advertising slot from the source, namely publisher. Auction Duplication, internal deals between DSP & SSP, and fraudulent activities are making the existing complex route even more complex day by day. Due to the aforementioned reasons, it is fairly evident that a single impression is being sold through multiple routes by multiple sellers at multiple prices. The new dilemma that has emerged recently is: Which route/path should the buyer choose and what should be the fair price to pay?
In this talk, we will discuss a framework that solves the problem of choosing the best path at the right price in programmatic Video Advertising. Initially, we will give an overview of all the different approaches tried i.e., Clustering, Classification Modelling, DEA, and Scoring based on Classification modeling. Out of these, DEA and Scoring Methodology had better results, and hence a detailed comparison of results and why a particular approach worked better will be illustrated. The final framework explains the two best-worked techniques: 1. Data Envelopment Analysis and 2.Scoring based on Classification Modeling. DEA is a non-parametric method used to rank the Unsupervised dataset of various supply paths by estimating the relative efficiencies. These efficiencies are calculated by comparing all the possible production frontiers of decision-making units (here supply paths). As a statistical and machine learning hybrid, the Scoring method calculates the score against each supply path, helping us decide whether a path is worth bidding.
The results of these models are compared with each other to choose the best one based on campaign KPI i.e., CPM (Cost per 1000 impressions) and CPCV (Cost per completed view of the video ad). A 4 - 8% improvement in CPM is observed in multiple test video ad campaigns, however, there is a dip in the number of impressions delivered. This is tackled by including impressions as an input in both the techniques. These clear improvements in CPM indicate that the technique results in better ROI compared to the heuristic approach. This approach can be used in various sectors like Banks (determining Credit Score) and Retail Industries(supply path optimization in Operations).
Dr. Manjeet Dahiya - Learning Maps from Geospatial Data Captured by Logistics Operations
Logistics operations produce a huge amount of geospatial data and this talk tells how we can use it to create a mapping service such as Google Maps and Here Maps!
E-commerce and logistics operations produce a vast amount of geospatial data while moving and delivering packages. As a logistics company supporting the e-commerce operations in multiple Asian countries, Delhivery produces over 50 million geo-coordinates daily. These geo-coordinates represent the movement of trucks and bikes or delivery events to the given postal addresses. The data has great potential to mine geospatial knowledge, and we demonstrate that a mapping service similar to Google Maps and Here Maps can be automatically built using the same. Specifically, we describe the learning of regional maps (localities, cities, etc) from the addresses labeled with geo-coordinates and the learning of roads from the geo-coordinates associated with movement.
We propose an algorithm to construct polygons and polylines of the map entities given a set of geo-coordinates. The algorithm involves non-parametric spatial probability modelling of the map entities followed by classification of the cells in a hexagonal grid to the respective map entity. We show that our algorithm is capable of handling noise, which is significantly high in our setting due to various reasons such as scale and device issues. A property about the noise and the correct information is presented such that our algorithm infers a correct map entity. We quantitatively measure the accuracy of our system by comparing its output with the available ground truth. We will showcase some localities that have incorrect polygons in Google Maps whereas we can learn the correct version by our data and algorithm. We also discuss multiple applications of the generated maps in the context of e-commerce and logistics operations.
A part of this work was accepted for publication at ACM/SIGAPP Symposium On Applied Computing 2020:
"Learning Locality Maps from Noisy Geospatial Labels. In SAC 2020 at Brno, Czech Republic"
Soham Chakraborty - A Spurious Outlier Detection System For High Frequency Time Series Data
As we are living in the age of IoT, more and more processes are using information gathered from well placed sensors to infer and predict better about their businesses. These sensor data are typically continuous and of enormous volume. Like any other data sources, they are also contaminated by noise (outliers) which may or may not be preventable. Presence of these outlier points will adversely affect the performance of any analytical model. Note that we are differentiating between contextual anomalies and noisy outliers. Former is of importance to us to build predictive models. Here we propose an integrated and scalable approach to detect spurious outliers. The main modules of this proposed system are taken from the literature. But to our knowledge, no such concerted approach exists where an end-to-end robust system is proposed like here. Even though this method was developed specifically using manufacturing IoT data, this is equally applicable for any domain dealing with time series data like CPG, Retail, Healthcare, Agrotech etc.
Soumya Jain - Unsupervised learning approach for identifying retail store employees using footfall data
Analysis of customer visits (or footfall) in the store traced via geolocation enabled devices, helps digital firms understand customers and their buying behavior better. Insights gained through geo footfall analysis help clients and advertisers make an informed decision, choose profitable regions, recognize relevant advertising opportunities and analyze their competitors to increase the success rate. But all this information can be disingenuous if people who walk past the store without entering, and staff of the store are not excluded. Therefore, two groups of people contributing to the footfall at the store can be considered outliers - people passing by the store, and employees of the store. The behavior of these outliers is expected to be different from the actual customers.
Since the data collected by geofencing the stores and pings from the SDK of the geo-enabled devices do not contribute much in tagging these outliers exclusively, these outliers are not very evident and cannot be removed by extreme value analysis. To tackle this problem we have formulated a multivariate approach to identify and remove these outliers from our source data. As we have no labeled data that marks a footfall as an employee or customer, we are using an unsupervised outlier detection model using the DBSCAN algorithm to provide a coherent and complete dataset with the labeled outliers. In this process, different techniques were taken into consideration to handle the effectiveness of features. Features like time spent by a visitor in and around the stores compared to other locations, monthly visit frequency, daily visit frequency, etc. were dominant in tagging the outliers.
Discovering the structure of data was another key step to optimize parameters of the DBSCAN algorithm for our use case namely, epsilon and minimal points.
Finally, the evaluation was done against the results obtained with that of the k-means algorithm, which showed that DBSCAN has a higher detection rate and a low rate of false positives in discovering outliers for the given problem statement.
Amogh Kamat Tarcar - Privacy Preserving Machine Learning Techniques
Privacy preserving machine learning is an emerging field which is in active research. The most prolific successful machine learning models today are built by aggregating all data together at a central location. While centralised techniques are great , there are plenty of scenarios such as user privacy, legal concerns ,business competitiveness or bandwidth limitations ,wherein data cannot be aggregated together. Federated Learningcan help overcome all these challenges with its decentralised strategy for building machine learning models. Paired with privacy preserving techniques such as encryption and differential privacy, Federated Learning presents a promising new way for advancing machine learning solutions.
In this talk I’ll be bringing the audience upto speed with the progress in Privacy preserving machine learning while discussing platforms for developing models and present a demo on healthcare use cases.
Aravind Kondamudi / Upasana Roy Chowdhury - AI in Manufacturing - Improving Process using Prescriptive Analytics
With the rise of Industry 4.0, computation power, data warehousing and automation, factories have been increasingly becoming intelligent. Preventive maintenance of Machines and predicting the failures have become an increasingly common sight. AI has also empowered in planning and logistics, where the quantity of item to be manufactured and the timing of it, have been decided through the outputs of ML models. Now the manufacturers are increasingly focused on improving the quality of the process and the throughput through sustainable methods as rising global warming is a concern. To improve the efficiency and to make the process sustainable, Machine Learning models coupled with optimization are used for Prescriptive Analytics. Data of the industrial process is often huge data with many process and control variables involved. Understanding the variables requires domain knowledge expertise coupled with feature engineering techniques. A search-based optimization can be used for finding the Pareto optimal solution with objectives to maximize the KPI and finding the support in historical data. Identifying the interaction effects is done by learning the data through a prediction model. The performance after the process is predicted using modelling for the KPI. Sensitivity analysis was conducted to understand the effect of variables on the uncertainty of model output and the KPI. The process, then optimized for maximizing throughput provides prescriptive analytics thereby improving the performance and reducing energy consumption.
Kavita Dwivedi - Portfolio Valuation for a Retail Bank using Monte Carlo Simulation and Forecasting for Risk Measurement
Banks today need to have a very good assessment of their portfolio value at any point in time . This is both a regulatory requirement and an operational metrics which helps banks to assess risk of their portfolio and also calculate the Capital Adequacy that they need to maintain at portfolio levels , product levels and all of these aggregated at Bank level.
This presentation will walk you through a case study which will discuss in detail how we went about calculating Portfolio value for a Home loan on a sample data . The bank wanted a scientific /statistical approach to this as they could take this to regulators for approval and thus convince them about the capital that they have for a particular portfolio.
The other interesting dimension was that in case the bank wants to sell a particular loan book to another bank /third party financial institutions they would be able to quote a price within the confidence interval of the calculated price. The same model/tool could be also shared with the buyer to convince them on quoted price and will make the negotiation and selling smooth.
We have used Monte Carlo Simulation on historical data of the portfolio to measure the Portfolio Value for the next 5 years of a Home loan Portfolio. It is a two step modeling process with Machine Learning Models to predict default and then further using simulation to calculate Portfolio value year on year for next 5 yrs taking in account diminishing returns too.
The presentation will take you through the approach and modeling process and how Monte Carlo Simulation helped us deliver the same to Customer with high accuracy and confidence level.
This is a real case study and will focus on why Risk Measurement is important and why Basel , CCAR implementation across banks worldwide helps the Central Banks to manage risks in case of a financial downturn or Black Swan events.
Parthiban Srinivasan - Coronavirus: Through The Lens Of AI
In a global pandemic such as COVID-19, technology, artificial intelligence, and data science have become critical to helping societies effectively deal with the outbreak. In this talk, I will discuss three case studies of how AI is being used in Corona Virus research. The first part of the talk will discuss about how deep learning model detected COVID-19 caused pneumonia from computed tomography (CT) scans with comparable performance to expert radiologists. To be more specific, I will discuss about UNet++ architecture that was implemented by researchers for evaluating lung infection in COVID-19 CT images. The second part of the talk will be devoted to recent attempts in natural language processing to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up. To be precise, BERT literature search engine for COVID-19 literature.will be discussed .
The third part of the talk deals with deep learning based generative modeling framework to design drug candidates specific to a given target protein sequence. One of the most important COVID-19 protein targets is the 3C-like protease for which the crystal structure is known. We present different deep learning models designed for generating novel drug molecules with multiple desirable properties. The deep learning framework involves Variational Autoencoder, Generative Adversarial Networks, Reinforcement Learning, and Transfer Learning. The generated molecules might serve as a blueprint for creating drugs that can potentially bind to the viral protein with high target affinity, as well as high drug-likeliness. Last but not the least, this talk will also touch upon how the world community responded by making the data available to the researchers which enabled the data scientists to explore and support the scientific community.
Coffee Break - 15 mins
Dr. Sri Vallabha Deevi - Machine health monitoring with AI
Predictive maintenance is the most recent technique in maintenance engineering. Machine operational parameters are used to assess the health of equipment and decide on maintenance schedule. In Aviation, aircraft engine manufacturers continuously monitor their engine parameters in flight to evaluate performance and deviations from normal.
Application of AI in this field enables measurement of behavior that is not observable using traditional means. AI based monitoring provides the edge required to operate in Industry 4.0 where connected machines do away with buffers in between processes and any unscheduled downtime of one machine effects the entire production chain.
This demonstration will walk you through the development of AI models using IoT data for one of the largest metal manufacturing company in India. It will help you master different types of AI models to answer questions like
- When do I plan the maintenance of a given equipment?
- Will a component last till the next maintenance cycle or do I replace it during the current maintenance?
- How to identify faulty equipment in the long production line?
Dat Tran / Tanuj Jain - imagededup - Finding duplicate images made easy!
The problem of finding duplicates in an image collection is widespread. Many online businesses rely on image galleries to deliver a good customer experience and consequently, generate more revenue. Hence, the image galleries need to be of the highest quality. Presence of duplicates in such galleries could potentially degrade the customer experience. Additionally, image-based machine learning models could generate misleading results due to the duplicates present in the training/evaluation/test sets.
Therefore, finding and removing duplicates is an important requirement across several use cases. In this talk, we want to present imagededup, a Python package we built to solve the problem of finding exact and near duplicates in an image collection. We will speak about the motivation behind building it, its functionality and also give a demo.
Anuj Gupta - Data Augmentation for NLP
It is a well known fact that the more data we have, the better performance ML models can achieve. However, getting a large amount of training data annotated is a luxury most practitioners cannot afford. Computer vision has circumvented this via data augmentation techniques and has reaped rich benefits. Can NLP not do the same? In this talk we will look at various techniques available for practitioners to augment data for their NLP application and various bells and whistles around these techniques.
In the area of AI, it is a well established fact that data beats algorithms i.e. large amounts of data with a simple algorithm often yields far superior results as compared to the best algorithm with little data. This is especially true for Deep learning algorithms that are known to be data guzzlers. Getting data labeled at scale is a luxury most practitioners cannot afford. What does one do in such a scenario?
This is where Data augmentation comes into play. Data augmentation is a set of techniques to increase the size of datasets and introduce more variability in the data. This helps to train better and more robust models. Data augmentation is very popular in the area of computer vision. From simple techniques like rotation, translation, adding salt etc to GANs, we have a whole range of techniques to augment images. It is a well known fact that augmentation is one of the key anchors when it comes to success of computer vision models in industrial applications.
Most natural language processing (NLP) projects in industry still suffer from data scarcity. This is where recent advances in data augmentation for NLP can come very helpful. When it comes to NLP, data augmentation is not that straight forward. You want to augment data while keeping the syntactic and semantic properties of the text. In this talk we will take a deep dive into the world of various techniques that are available to practitioners to augment data for NLP. The talk is meant for Data Scientists, NLP engineers, ML engineers and industry leaders working on NLP problems.
Piyush Makhija - Normalizing User-Generated Text Data
A large fraction of work in NLP work in academia and research groups deals with clean datasets that are much more structured and free of noise. However, when it comes to building real-world NLP applications, one often has to collect data from applications such as chats, user-discussion forums, social-media conversations, etc. Invariably all NLP applications in industrial settings that have to deal with much more noisy and varying data - data with spelling mistakes, typos, acronyms, emojis, embedded metadata, etc.
There is a high level of disparity between the data SOTA language models were trained on & the data these models are expected to work on in practice. This renders most commercial NLP applications working with noisy data unable to take advantage of SOTA advances in the field of language computation.
Handcrafting rules and heuristics to correct this data on a large scale might not be a scalable option for most industrial applications. Most SOTA models in NLP are not designed keeping in mind noise in the data. They often give a substandard performance on noisy data.
In this talk, we share our approach, experience, and learnings from designing a robust system to clean noise in data, without handcrafting the rules, using Machine Translation, and effectively making downstream NLP tasks easier to perform.
This work is motivated by our business use case where we are building a conversational system over WhatsApp to screen candidates for blue-collar jobs. Our candidate user base often comes from tier-2 and tier-3 cities of India. Their responses to our conversational bot are mostly a code mix of Hindi and English coupled with non-canonical text (ex: typos, non-standard syntactic constructions, spelling variations, phonetic substitutions, foreign language words in a non-native script, grammatically incorrect text, colloquialisms, abbreviations, etc). The raw text our system gets is far from clean well-formatted text and text normalization becomes a necessity to process it any further.
This talk is meant for computational language researchers/NLP practitioners, ML engineers, data scientists, senior leaders of AI/ML/DS groups & linguists working with non-canonical resource-rich, resource-constrained i.e. vernacular & code-mixed languages.
Closing Keynote - 45 mins
Closing Talk - 15 mins
ODSC India 2020
Wed, Nov 25
Ashwathi Nambiar / lazy desk - Quantization To The Rescue: An Edge AI Story
Over the last decade, deep neural networks have brought in a resurgence in artificial intelligence, with machines outperforming humans in some of the most popular image recognition problems. But all that jazz comes with its costs – high compute complexity and large memory requirements. These requirements translate to higher power consumption resulting in steep electricity bills and a sizeable carbon footprint. Optimizing model size and complexity thus becomes a necessity for a sustainable future for AI.
Memory and compute complexity optimizations also bring in the promise of unimaginable possibilities with edge AI - self-driving cars, predictive maintenance, smart speakers, body monitoring are only the beginning. The smartphone market, with its reach to nearly 4 billion people, is only a fraction of the potential edge devices waiting to be truly ‘smart’. Think smart hospitals or mining, oil and gas industrial automation and so much more.
In this session we will talk about,
- Challenges in deep neural network (DNN) deployment on embedded systems with resource constraints
- Quantization, which has been popularly used in mathematics and digital signal processing to map values from a large often continuous set to values in a countable smaller set, now reimagined as a possible solution for compressing DNNs and accelerating inference.
It is gaining popularity not only with machine learning frameworks like MATLAB, TensorFlow and PyTorch but also amidst hardware toolchains like NVIDIA® TensorRT and Xilinx® DNNDK. The core idea behind quantization is the resiliency of neural networks to noise. Deep neural networks, in particular, are trained to pick up key patterns and ignore noise. This means that the networks can cope with small changes resulting from quantization error, as backed by research indicating minimal impact of quantization on overall accuracy of the network. This, coupled with significant reduction in memory footprint, power consumption, and gains in computational speed, makes quantization an efficient approach for deploying neural networks to embedded hardware.
- Example of a quantization solution for an object detection problem
lazy desk - Deep Reinforcement Learning Based RecSys Using Distributed Q Table
Recommendation systems (RecSys) are the core engine for any personalized experience on eCommerce and online media websites. Most of the companies leverage RecSys to increase user interaction, to enrich shopping potential and to generate upsell & cross-sell opportunities. Amazon uses recommendations as a targeted marketing tool throughout its website that contributes 35% of its total revenue generation . Netflix users watch ~75% of the recommended content and artwork . Spotify employs a recommendation system to update personal playlists every week so that users won’t miss newly released music by artists they like. This has helped Spotify to increase its number of monthly users from 75 million to 100 million at a time . YouTube's personalized recommendation helps users to find relevant videos quickly and easily which account for around 60% of video clicks from the homepage .
In general, RecSys generates recommendations based on user browsing history and preferences, past purchases and item metadata. It turns out most existing recommendation systems are based on three paradigms: collaborative filtering (CF) and its variants, content-based recommendation engines, and hybrid recommendation engines that combine content-based and CF or exploit more information about users in content-based recommendation. However, they suffer from limitations like rapidly changing user data, user preferences, static recommendations, grey sheep, cold start and malicious user.
Classical RecSys algorithm like content-based recommendation performs great on item to item similarities but will only recommend items related to one category and may not recommend anything in other categories as the user never viewed those items before. Collaborative filtering solves this problem by exploiting the user's behavior and preferences over the items in recommending items to the new users. However, collaborative filtering suffers from a few drawbacks like cold start, popularity bias, and sparsity. The classical recommendation models consider the recommendation as a static process. We can solve the static recommendation on rapidly changing user data by RL. RL based RecSys captures the user’s temporal intentions and responds promptly. However, as the user action and items matrix size increases, it becomes difficult to provide recommendations using RL. Deep RL based solutions like actor-critic and deep Q-networks overcome all the aforementioned drawbacks.
Present systems suffer from two limitations, firstly considering the recommendation as a static procedure and ignoring the dynamic interactive nature between users and the recommender systems. Also, most of the works focus on the immediate feedback of recommended items and neglecting the long-term rewards based on reinforcement learning. We propose a recommendation system that uses the Q-learning method. We use ε-greedy policy combined with Q learning, a powerful method of reinforcement learning that handles those issues proficiently and gives the customer more chance to explore new pages or new products that are not so popular. Usually while implementing Reinforcement Learning (RL) to real-world problems both the state space and the action space are very vast. Therefore, to address the aforementioned challenges, we propose the multiple/distributed Q table approaches which can deal with large state-action space and that aides in actualizing the Q learning algorithm in the recommendation and huge state-action space.
- "Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modelling": https://arxiv.org/pdf/1810.12027.pdf
- "Deep Reinforcement Learning for Page-wise Recommendations": https://arxiv.org/pdf/1805.02343.pdf
- "Deep Reinforcement Learning for List-wise Recommendations": https://arxiv.org/pdf/1801.00209.pdf
- "Deep Reinforcement Learning Based RecSys Using Distributed Q Table": http://www.ieomsociety.org/ieom2020/papers/274.pdf
Sharmistha Chatterjee - Machine Learning and Data Governance in Telecom Industry
The key aspect in solving ML problems in telecom industry lies in continuous data collection and evaluation from different categories of customers and networks so as to track and dive into varying performance metrics. The KPIs form the basis of network monitoring helping network/telecom operators to automatically add and scale network resources. Such smart automated systems are built with the objective of increasing customer engagement through enhanced customer experience and tracking customer behavior anomaly with timely detection and correction. Further the system is designed to scale and serve current LTE, 4G and upcoming 5G networks with minimal non-effective cell site visits and quick identification of Root Cause Analysis (RCA).
Network congestion has remained an ever-increasing problem. Operators have attempted a variety of strategies to match the network demand capacity with existing infrastructure, as the cost of deploying additional network capacities is expensive. To keep the cost under control, operators apply control measures to attempt to allocate bandwidth fairly among users and throttle the bandwidth of users that consume excessive bandwidth. This approach had limited success. Alternatively, techniques that utilize extra bandwidth for quality of experience (QOE) efficiency by over-provisioning the network has proved to be ineffective and inefficient due to lack of proper estimation.
The evolution of 5G networks, would lead manufacturers and telecom operators to use high-data transfer rates, wide network coverage, low latency to build smart factories using automation, artificial intelligence and Internet of Things (IoT). The application of advanced data science and AI can provide better predictive insights to improve network capacity-planning accuracy. Better network provisioning would yield better network utilization for both next-generation networks based on 5G technology and current LTE and 4G networks. Further AI models can be designed to link application throughput with network performance, prompting users to plan their daily usage based on their current location and total monthly budget.
In this talk, we will understand the current challenges in the telecom industry, the need for an AIOPS platform, and the mission held by telecom operators, communication service providers across the world for designing such AI frameworks, platforms, and best practices. We will see how increasing operator collaborations are helping to create, deploy and produce AI platforms for different AI use-cases. We will study one industrial use-case (with code) based on real-time field research to predict network capacity. In this respect, we will investigate how deep learning networks can be used to train large volumes of data at scale (millions of network cells), and how its use can help the upcoming 5G networks. We will also examine an end-to-end pipeline of hosting the scalable framework on Google Cloud. As data volume is huge and data needs to be stored in highly secured systems, we build our high-performing system with extra security features that can process millions of request in an order of few mili-secs. As the session highlights parameters and metrics in creating a LSTM based neural network, it also discusses the challenges and some of the key aspects involved in designing and scaling the system.