ODSC India 2020
Mon, Nov 23
Timezone: Asia/Kolkata (IST)
Opening Keynote - 45 mins
Welcome Note - 15 mins
Coffee Break - 15 mins
Kuldeep Jiwani - Non-Parametric PDF estimation for advanced Anomaly Detection
Anomaly Detection have been one of most sought after analytical solutions for businesses operating in the domain of Network Operation, Service Operation, Manufacturing etc. and many other sectors where continuity of operations is essential. Any degradation in operational service or an outage, implies high losses and possible customer churn. The data in such real world applications is generally noisy, have complex patterns and often correlated.
There are techniques like Auto-Encoders available for modelling complex patterns, but they can't explain the cause in original feature space. The traditional univariate anomaly detection techniques uses the z-score and p-value methods. These rely upon unimodality and choice of correct parametric form. If assumptions are not satisfied then there would be a high number of False-Positives and False-Negatives.
This is where the need for estimating a PDF (Probability Density Function) arises that too without assuming a prior parametric form i.e. Non-Parametric approach. The PDF needs to be modelled as close to the true distribution as possible. That is it should have a low bias and low variance to avoid over-smoothing and under-smoothing. Only then we would have better chances of identifying true anomalies.
Approaches like KDE - Kernel Density Estimation assist in such non-parametric estimations. As per research the type of kernel has a lesser role to play than the bandwidth for a good PDF estimation. The default bandwidth selection technique used in both Python and R packages over-smooths the PDF and is not suitable for Anomaly Detection.
We will explain another method, where we run optimisation over a cost function based on modelling Gaussian kernel via FFT (Fast Fourier Transform), to obtain the appropriate bandwidth. Then we will show how we can apply it for Anomaly Detection even when the data is multi-modal (have multiple peaks) and the distribution can be of any shape.
Based on research paper under publication "Optimal Kernel Density Estimation using FFT based cost function", currently scheduled for ICDM 2020, New York
Akshay Bahadur - Indian Sign Language Recognition (ISLAR)
Sample this – two cities in India; Mumbai and Pune, though only 80kms apart have a distinctly varied spoken dialect. Even stranger is the fact that their sign languages are also distinct, having some very varied signs for the same objects/expressions/phrases. While regional diversification in spoken languages and scripts are well known and widely documented, apparently, this has percolated in sign language as well, essentially resulting in multiple sign languages across the country. To help overcome these inconsistencies and to standardize sign language in India, I am collaborating with the Centre for Research and Development of Deaf & Mute (an NGO in Pune) and Google. Adopting a two-pronged approach: a) I have developed an Indian Sign Language Recognition System (ISLAR) which utilizes Artificial Intelligence to accurately identify signs and translate them into text/vocals in real-time, and b) have proposed standardization of sign languages across India to the Government of India and the Indian Sign Language Research and Training Centre.
As previously mentioned, the initiative aims to develop a lightweight machine-learning model, for 14 million speech/hearing impaired Indians, that is suitable for Indian conditions along with the flexibility to incorporate multiple signs for the same gesture. More importantly, unlike other implementations, which utilize additional external hardware, this approach, which utilizes a common surgical glove and a ubiquitous camera smartphone, has the potential of hardware-related savings at an all-India level. ISLAR received great attention from the open-source community with Google inviting me to their India and global headquarters in Bangalore and California, respectively, to interact with and share my work with the TensorFlow team.
Gunjan Dewan - Developing a match-making algorithm between customers and Go-Jek products!
20+ products. Millions of active customers. Insane amount of data and complex domain. Come join me in this talk to know the journey we at Gojek took to predict which of our products a user is most likely to use next.
A major problem we faced, as a company, was targeting our customers with promos and vouchers that were relevant to them. We developed a generalized model that takes into account the transaction history of users and gives a ranked list of our services that they are most likely to use next. From here on, we are able to determine the vouchers that we can target these customers with.
In this talk, I will be talking about how we used recommendation engines to solve this problem, the challenges we faced during the time and the impact it had on our conversion rates. I will also be talking about the different iterations we went through and how our problem statement evolved as we were solving the problem.
Venkata Pingali - Privacy-Law Aware ML Data Preparation
The new PDP (Personal Data Protection) Law, which is similar to GDPR
and CCPA, is being implemented in India. All enterprise data services
including analytics and data science within the scope of the law are
required to comply with the same. Almost all major geographies have now
passed similar laws. The expectation of responsible data handling from
organizations is also increasing.
Enrich, our product, is a high-trust data preparation platform for
enterprises that provides data input to analysts and models at scale
everyday. Such data preparation services are on organizations’
compliance and privacy-activity critical path because of their
‘fan-out’ nature. They provide a convenient location to enforce policy
and safety mechanisms.
In this talk we discuss some of the mechanisms that we are building
for clients in our data preparation platform, Enrich. They include
opensource compliance checklist to help with the process, ‘right to
forget’ service using anonymized lookup key service, and metadata
service to enable tracking of the datasets. The focus will be on the
generic capabilities, and not on Scribble or our product.
Note: Will update this over the next few days and weeks
Piyush Arora - Natural Language Querying for Industry Grade Data Analytics Systems
This talk focuses on the topic of querying industry grade big data systems. Enterprises have vast amount of information spread across structured data stores (relational databases, data warehouses, etc.). Descriptive analytics over this data is limited to experts familiar with complex querying languages (e.g., Structured Query Language) as well as metadata and schema associated with such large datastores. The ability to convert natural language questions to SQL statements would make descriptive analytics and reporting much easier and widespread. Problem of automatically converting natural language questions to SQL is well studied, viz., Natural Language Interface to Databases (NLIDB). We present our work on an end-to-end (E2E) system focussed on NLIDB.
We describe two main aspects of E2E NLIDB systems: i) Converting natural language to structured language and ii) understanding natural language. There is a plenitude of applications of such E2E systems across domains e.g., healthcare, finance, logistics, etc.
Priyanshu Jain - Automated Ticket Routing for Large Enterprises
Large enterprises that provide services to consumers may receive millions of customer complaint tickets every month. Handling these tickets on time is very critical, as this directly impacts the quality of service and network efficiency.
A ticket may be assigned to multiple teams before it gets resolved. Assigning a ticket to an appropriate group is usually done manually as the complaint information provided by the customer is not very specific and maybe inaccurate sometimes. This manual process incurs enormous labor costs and is very time inefficient as each ticket may end up in the queue for hours.
In this talk, we will present an approach to automate the process of ticket routing completely. We will start by discussing how we can use Markov Chains to model the flow of tickets across different teams. Next, we will discuss the feature engineering part and why Factorization Machine Models are essential for such a use case. This will be followed by a discussion on the learning of decision rule sets in a supervised manner. These decision rules can be used to traverse tickets across multiple teams in an automated fashion. Thus, automating the complete process of ticket routing. We will also discuss that the proposed framework can be validated easily by SMEs, unlike other AI solutions, thus, resulting in its quick acceptability in an organization. Finally, we will go through the different settings in which this solution can fit, therefore, resulting in its broad applicability.
The framework can provide substantial cost savings to enterprises. It can also reduce Response time to tickets significantly by almost eliminating the queue time. Overall, it can help large enterprises in
1. Saving costs by reducing the workforce of ticket handling team
2. Increasing revenue by improving quality of customer experience
Kuldeep Singh - Simplify Experimentation, Deployment and Collaboration for ML and AI Models
Machine Learning and AI are changing or would say have changed the way how businesses used to behave. However, the Data Science community is still lacking good practices for organizing their projects and effectively collaborating and experimenting quickly to reduce “time to market”.
During this session, we will learn about one such open-source tool “DVC”
which can help you in helping ML models shareable and reproducible.
It is designed to handle large files, data sets, machine learning models, metrics as well as code
Darshan Ganji / Deepesh Agrawal - On-Demand Accelerating Deep Neural Network Inference via Edge Computing
Deep Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on mobile phones and embedded systems with limited hardware resources and taking more time for Inference and Training. For many mobile-first companies such as Baidu and Facebook, various apps are updated via different app stores, and they are very sensitive to the size of the binary files. For example, App Store has the restriction “apps above 100 MB will not download until you connect to Wi-Fi”. As a result, a feature that increases the binary size by 100MB will receive much more scrutiny than one that increases it by 10MB. It is challenging to run computation-intensive DNN-based tasks on mobile devices due to the limited computation resources.
This talk introduces the Algorithms and Hardware that can be used to accelerate the Inferencing or reduce the latency of deep learning workloads. We will discuss how to compress the Deep Neural Networks and techniques like Graph Fusion, Kernel Auto-Tuning for accelerating inference, as well as Data and model parallelization, automatic mixed precision, and other techniques for accelerating training. We will also discuss specialized hardware for deep learning such as GPUs, FPGAs, and ASICs, including the Tensor Cores in NVIDIA’s Volta GPUs as well as Google’s Tensor Processing Units (TPUs). We will also discuss the Deployment of the Large Size Deep Learning Models on the Edge devices like NVIDIA Jetson Nano, Google's Edge TPU(Coral).
Keywords: Graph Optimization, Tensor Fusion, Kernel Auto Tuning, Pruning, Weight sharing, quantization, low-rank approximations, binary networks, ternary networks, Winograd transformations, data parallelism, model parallelism, mixed precision, FP16, FP32, model distillation, Dense-Sparse-Dense training, NVIDIA Volta, Tensor Core, Google TPU.
Vinayaka Mayura G G - Metamorphic Testing for Machine Learning Models with Search Relevancy Example
Accuracy of a Model can be improved in several levels and multiple variables, boundaries and guidelines. With the well known problem statement and solution, it is difficult to evaluate for all the given cases the model would be predicting expected outcomes. Machine Learning Models are solving for the problems for which results are unknown, most of the times. This arises a problem of Test Oracle. Recent surveys and work have shown that this difficulty can be reduced by some of the blackbox testing techniques such as Metamorphic Testing, Fuzzing, Dual Coding et.,
Even though the output of a Model is not known, we can make few predictions based on the Metamorphic relations. A metamorphic relation refers to the relationship between the software input change and output change during multiple program executions. Many metamorphic relations are created based on the transformation from training data set or test data set. We further classify them into Coarse-grained Data transformation and Fine-grained data transformation.
We will discuss different transformations. Will go through the example of a Search relevancy problem and will analyse the application of Metamorphic testing to verify the Machine model built.
LunchBreak - 60 mins
Parthiban Srinivasan - Coronavirus: Through The Lens Of AI
In a global pandemic such as COVID-19, technology, artificial intelligence, and data science have become critical to helping societies effectively deal with the outbreak. In this talk, I will discuss three case studies of how AI is being used in Corona Virus research. The first part of the talk will discuss about how deep learning model detected COVID-19 caused pneumonia from computed tomography (CT) scans with comparable performance to expert radiologists. To be more specific, I will discuss about UNet++ architecture that was implemented by researchers for evaluating lung infection in COVID-19 CT images. The second part of the talk will be devoted to recent attempts in natural language processing to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up. To be precise, BERT literature search engine for COVID-19 literature.will be discussed .
The third part of the talk deals with deep learning based generative modeling framework to design drug candidates specific to a given target protein sequence. One of the most important COVID-19 protein targets is the 3C-like protease for which the crystal structure is known. We present different deep learning models designed for generating novel drug molecules with multiple desirable properties. The deep learning framework involves Variational Autoencoder, Generative Adversarial Networks, Reinforcement Learning, and Transfer Learning. The generated molecules might serve as a blueprint for creating drugs that can potentially bind to the viral protein with high target affinity, as well as high drug-likeliness. Last but not the least, this talk will also touch upon how the world community responded by making the data available to the researchers which enabled the data scientists to explore and support the scientific community.
Coffee Break - 15 mins
Dr. Sri Vallabha Deevi - Machine health monitoring with AI
Predictive maintenance is the most recent technique in maintenance engineering. Machine operational parameters are used to assess the health of equipment and decide on maintenance schedule. In Aviation, aircraft engine manufacturers continuously monitor their engine parameters in flight to evaluate performance and deviations from normal.
Application of AI in this field enables measurement of behavior that is not observable using traditional means. AI based monitoring provides the edge required to operate in Industry 4.0 where connected machines do away with buffers in between processes and any unscheduled downtime of one machine effects the entire production chain.
This demonstration will walk you through the development of AI models using IoT data for one of the largest metal manufacturing company in India. It will help you master different types of AI models to answer questions like
- When do I plan the maintenance of a given equipment?
- Will a component last till the next maintenance cycle or do I replace it during the current maintenance?
- How to identify faulty equipment in the long production line?
Dat Tran / Tanuj Jain - imagededup - Finding duplicate images made easy!
The problem of finding duplicates in an image collection is widespread. Many online businesses rely on image galleries to deliver a good customer experience and consequently, generate more revenue. Hence, the image galleries need to be of the highest quality. Presence of duplicates in such galleries could potentially degrade the customer experience. Additionally, image-based machine learning models could generate misleading results due to the duplicates present in the training/evaluation/test sets.
Therefore, finding and removing duplicates is an important requirement across several use cases. In this talk, we want to present imagededup, a Python package we built to solve the problem of finding exact and near duplicates in an image collection. We will speak about the motivation behind building it, its functionality and also give a demo.
Anuj Gupta - Data Augmentation for NLP
It is a well known fact that the more data we have, the better performance ML models can achieve. However, getting a large amount of training data annotated is a luxury most practitioners cannot afford. Computer vision has circumvented this via data augmentation techniques and has reaped rich benefits. Can NLP not do the same? In this talk we will look at various techniques available for practitioners to augment data for their NLP application and various bells and whistles around these techniques.
In the area of AI, it is a well established fact that data beats algorithms i.e. large amounts of data with a simple algorithm often yields far superior results as compared to the best algorithm with little data. This is especially true for Deep learning algorithms that are known to be data guzzlers. Getting data labeled at scale is a luxury most practitioners cannot afford. What does one do in such a scenario?
This is where Data augmentation comes into play. Data augmentation is a set of techniques to increase the size of datasets and introduce more variability in the data. This helps to train better and more robust models. Data augmentation is very popular in the area of computer vision. From simple techniques like rotation, translation, adding salt etc to GANs, we have a whole range of techniques to augment images. It is a well known fact that augmentation is one of the key anchors when it comes to success of computer vision models in industrial applications.
Most natural language processing (NLP) projects in industry still suffer from data scarcity. This is where recent advances in data augmentation for NLP can come very helpful. When it comes to NLP, data augmentation is not that straight forward. You want to augment data while keeping the syntactic and semantic properties of the text. In this talk we will take a deep dive into the world of various techniques that are available to practitioners to augment data for NLP. The talk is meant for Data Scientists, NLP engineers, ML engineers and industry leaders working on NLP problems.
Piyush Makhija / Jaydeep Kulkarni - Normalizing User-Generated Text Data
A large fraction of work in NLP work in academia and research groups deals with clean datasets that are much more structured and free of noise. However, when it comes to building real-world NLP applications, one often has to collect data from applications such as chats, user-discussion forums, social-media conversations, etc. Invariably all NLP applications in industrial settings that have to deal with much more noisy and varying data - data with spelling mistakes, typos, acronyms, emojis, embedded metadata, etc.
There is a high level of disparity between the data SOTA language models were trained on & the data these models are expected to work on in practice. This renders most commercial NLP applications working with noisy data unable to take advantage of SOTA advances in the field of language computation.
Handcrafting rules and heuristics to correct this data on a large scale might not be a scalable option for most industrial applications. Most SOTA models in NLP are not designed keeping in mind noise in the data. They often give a substandard performance on noisy data.
In this talk, we share our approach, experience, and learnings from designing a robust system to clean noise in data, without handcrafting the rules, using Machine Translation, and effectively making downstream NLP tasks easier to perform.
This work is motivated by our business use case where we are building a conversational system over WhatsApp to screen candidates for blue-collar jobs. Our candidate user base often comes from tier-2 and tier-3 cities of India. Their responses to our conversational bot are mostly a code mix of Hindi and English coupled with non-canonical text (ex: typos, non-standard syntactic constructions, spelling variations, phonetic substitutions, foreign language words in a non-native script, grammatically incorrect text, colloquialisms, abbreviations, etc). The raw text our system gets is far from clean well-formatted text and text normalization becomes a necessity to process it any further.
This talk is meant for computational language researchers/NLP practitioners, ML engineers, data scientists, senior leaders of AI/ML/DS groups & linguists working with non-canonical resource-rich, resource-constrained i.e. vernacular & code-mixed languages.
Dr. Manjeet Dahiya / Anand Bagmar - Learning Maps from Geospatial Data Captured by Logistics Operations
Logistics operations produce a huge amount of geospatial data and this talk tells how we can use it to create a mapping service such as Google Maps and Here Maps!
E-commerce and logistics operations produce a vast amount of geospatial data while moving and delivering packages. As a logistics company supporting the e-commerce operations in multiple Asian countries, Delhivery produces over 50 million geo-coordinates daily. These geo-coordinates represent the movement of trucks and bikes or delivery events to the given postal addresses. The data has great potential to mine geospatial knowledge, and we demonstrate that a mapping service similar to Google Maps and Here Maps can be automatically built using the same. Specifically, we describe the learning of regional maps (localities, cities, etc) from the addresses labeled with geo-coordinates and the learning of roads from the geo-coordinates associated with movement.
We propose an algorithm to construct polygons and polylines of the map entities given a set of geo-coordinates. The algorithm involves non-parametric spatial probability modelling of the map entities followed by classification of the cells in a hexagonal grid to the respective map entity. We show that our algorithm is capable of handling noise, which is significantly high in our setting due to various reasons such as scale and device issues. A property about the noise and the correct information is presented such that our algorithm infers a correct map entity. We quantitatively measure the accuracy of our system by comparing its output with the available ground truth. We will showcase some localities that have incorrect polygons in Google Maps whereas we can learn the correct version by our data and algorithm. We also discuss multiple applications of the generated maps in the context of e-commerce and logistics operations.
A part of this work was accepted for publication at ACM/SIGAPP Symposium On Applied Computing 2020:
"Learning Locality Maps from Noisy Geospatial Labels. In SAC 2020 at Brno, Czech Republic"
Soham Chakraborty - A Spurious Outlier Detection System For High Frequency Time Series Data
As we are living in the age of IoT, more and more processes are using information gathered from well placed sensors to infer and predict better about their businesses. These sensor data are typically continuous and of enormous volume. Like any other data sources, they are also contaminated by noise (outliers) which may or may not be preventable. Presence of these outlier points will adversely affect the performance of any analytical model. Note that we are differentiating between contextual anomalies and noisy outliers. Former is of importance to us to build predictive models. Here we propose an integrated and scalable approach to detect spurious outliers. The main modules of this proposed system are taken from the literature. But to our knowledge, no such concerted approach exists where an end-to-end robust system is proposed like here. Even though this method was developed specifically using manufacturing IoT data, this is equally applicable for any domain dealing with time series data like CPG, Retail, Healthcare, Agrotech etc.
Soumya Jain - Unsupervised learning approach for identifying retail store employees using footfall data
Analysis of customer visits (or footfall) in the store traced via geolocation enabled devices, helps digital firms understand customers and their buying behavior better. Insights gained through geo footfall analysis help clients and advertisers make an informed decision, choose profitable regions, recognize relevant advertising opportunities and analyze their competitors to increase the success rate. But all this information can be disingenuous if people who walk past the store without entering, and staff of the store are not excluded. Therefore, two groups of people contributing to the footfall at the store can be considered outliers - people passing by the store, and employees of the store. The behavior of these outliers is expected to be different from the actual customers.
Since the data collected by geofencing the stores and pings from the SDK of the geo-enabled devices do not contribute much in tagging these outliers exclusively, these outliers are not very evident and cannot be removed by extreme value analysis. To tackle this problem we have formulated a multivariate approach to identify and remove these outliers from our source data. As we have no labeled data that marks a footfall as an employee or customer, we are using an unsupervised outlier detection model using the DBSCAN algorithm to provide a coherent and complete dataset with the labeled outliers. In this process, different techniques were taken into consideration to handle the effectiveness of features. Features like time spent by a visitor in and around the stores compared to other locations, monthly visit frequency, daily visit frequency, etc. were dominant in tagging the outliers.
Discovering the structure of data was another key step to optimize parameters of the DBSCAN algorithm for our use case namely, epsilon and minimal points.
Finally, the evaluation was done against the results obtained with that of the k-means algorithm, which showed that DBSCAN has a higher detection rate and a low rate of false positives in discovering outliers for the given problem statement.
Amogh Kamat Tarcar - Privacy Preserving Machine Learning Techniques
Privacy preserving machine learning is an emerging field which is in active research. The most prolific successful machine learning models today are built by aggregating all data together at a central location. While centralised techniques are great , there are plenty of scenarios such as user privacy, legal concerns ,business competitiveness or bandwidth limitations ,wherein data cannot be aggregated together. Federated Learningcan help overcome all these challenges with its decentralised strategy for building machine learning models. Paired with privacy preserving techniques such as encryption and differential privacy, Federated Learning presents a promising new way for advancing machine learning solutions.
In this talk I’ll be bringing the audience upto speed with the progress in Privacy preserving machine learning while discussing platforms for developing models and present a demo on healthcare use cases.
Ujwala Musku - Supply Path Optimization in Video Advertising Landscape
In the programmatic era, with a lot of players in the market, it is quite complex for a buyer to reach the destination, namely advertising slot from the source, namely publisher. Auction Duplication, internal deals between DSP & SSP, and fraudulent activities are making the existing complex route even more complex day by day. Due to the aforementioned reasons, it is fairly evident that a single impression is being sold through multiple routes by multiple sellers at multiple prices. The new dilemma that has emerged recently is: Which route/path should the buyer choose and what should be the fair price to pay?
In this talk, we will discuss a framework that solves the problem of choosing the best path at the right price in programmatic Video Advertising. Initially, we will give an overview of all the different approaches tried i.e., Clustering, Classification Modelling, DEA, and Scoring based on Classification modeling. Out of these, DEA and Scoring Methodology had better results, and hence a detailed comparison of results and why a particular approach worked better will be illustrated. The final framework explains the two best-worked techniques: 1. Data Envelopment Analysis and 2.Scoring based on Classification Modeling. DEA is a non-parametric method used to rank the Unsupervised dataset of various supply paths by estimating the relative efficiencies. These efficiencies are calculated by comparing all the possible production frontiers of decision-making units (here supply paths). As a statistical and machine learning hybrid, the Scoring method calculates the score against each supply path, helping us decide whether a path is worth bidding.
The results of these models are compared with each other to choose the best one based on campaign KPI i.e., CPM (Cost per 1000 impressions) and CPCV (Cost per completed view of the video ad). A 4 - 8% improvement in CPM is observed in multiple test video ad campaigns, however, there is a dip in the number of impressions delivered. This is tackled by including impressions as an input in both the techniques. These clear improvements in CPM indicate that the technique results in better ROI compared to the heuristic approach. This approach can be used in various sectors like Banks (determining Credit Score) and Retail Industries(supply path optimization in Operations).
POOJA BALUSANI - Model Interpretability and Explainable AI in Manufacturing
In this talk, we present an industrial use case on “anomaly detection” in steel mills based on IoT sensor data. In large steel mills and manufacturing plants, the top reasons for unplanned downtime are:
• Failure of critical asset
• Quality spec of the end product in line not being met
• Operational limits outside the recommended range (e.g. process, human-safety, equipment-safety, etc.)
Unplanned downtime or line stoppage leads to loss of production or throughput and revenue loss.
Anomaly detection can serve as an early warning system, providing alerts on anomalous behavior that could be detrimental to the equipment health or affect process quality. In this work, we are performing multi-variate anomaly detection on time-series sensor data in a steel mill to help the maintenance engineers and process operators take proactive actions and help reduce plant downtime. Anomaly is presented to the customer in terms of:
• “time-intervals” – startTime: endTime chunks that exhibit deviant behavior
• “anomaly-state” – type association of anomaly to a specific pattern or cluster state
• “anomaly-contribution” – priority association to sensor signals that exhibited deviant behavior within the multi-variate list (more like signal importance)
We shall introduce the approach, where we reformulate the unsupervised modeling to a supervised formulation to incorporate SHAP, LIME, and other explainable tools. We shall illustrate the steps to provide the above-mentioned meta-data for an anomaly to make it explainable and consumable for the end-customer.
Debanjana Banerjee / Sandeep Shetty - CRESST:Complete Rare Event Specification using Stochastic Treatment
In the fast moving world today, rare events are becoming increasingly common. Ranging from studying incidents of safety hazards to identifying transaction fraud, they all fall under the radar of rare events. Identifying and studying rare events become of crucial importance, particularly when the underlying event conforms to a sensitive or an adverse issue. The thing to note here is, despite the probability of occurrence being very close to zero, the potential specification of the rare event could be quite extensive. For example, within the parent rare event of Product Safety, there could be multiple types of potential hazard (Fire, Electrical, Pharmaceutical, etc.), rendering the sub-classes rarer still. In this talk, we are going to discuss a novel algorithm designed to study a rare event and its sub-classes over time with primary focus on forecast and detecting anomalies.
The anomalies studied here are relative anomalies i.e., they may not contribute to the long-term trend of the rare time series but represent deviation from the base state as seen in the immediate past.
Relevant Sister Classes
Count Time Series
Non Homogeneous Poisson Process
Discrete Space Optimization
Sparsity Treatment in Text
Dynamic Time Warping
Relative Local Density
Density Based Clustering
Rajesh Shreedhar Bhat / Pranay Dugar - Text Extraction from Images using deep learning techniques
Extracting texts of various sizes, shapes and orientations from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in a natural scene, content moderation in social media platforms, etc. The text from the image can be a richer and more accurate source of data than human inputs which can be used in several applications like Attribute Extraction, Profanity Checks, etc.
Typically, Extracting Text is achieved in 2 stages:
Text detection: this module helps to know the regions in the input image where the text is present.
Text recognition: given the regions in the image where the text is present, this module gives the raw text out of it.
In this session, I will be talking about the Character level Text Detection for detecting normal and arbitrary shaped texts. Later will be discussing the CRNN-CTC network & the need for CTC loss to obtain the raw text from the images.
Ashay Tamhane - Food Recommendation at Swiggy
Do you like to explore new dishes every time you order food? Or do you stick to your usual favourites? Do you like a Raita with your Biryani, or just a soft drink? Do you prefer veg or non veg? Do these preferences change depending on time of day or day of the week?
These are among the several questions that need to be automatically inferred from data while building a food recommendation engine at scale. Right from ranking dishes on the menu to suggesting complimentary dishes on the cart, Data Science is at the very core of constantly improving our suggestions. In this talk, we will dive into multiple scenarios that pop up while building a food recommendation system in terms of handling data sparsity and handling new restaurants / users among other challenges that crop up due to scale. We will look at different algorithms that need to be explored in order to effectively handle these various challenges.
Aravind Kondamudi / Sandeep Shetty / Upasana Roy Chowdhury - AI in Manufacturing - Improving Process using Prescriptive Analytics
With the rise of Industry 4.0, computation power, data warehousing and automation, factories have been increasingly becoming intelligent. Preventive maintenance of Machines and predicting the failures have become an increasingly common sight. AI has also empowered in planning and logistics, where the quantity of item to be manufactured and the timing of it, have been decided through the outputs of ML models. Now the manufacturers are increasingly focused on improving the quality of the process and the throughput through sustainable methods as rising global warming is a concern. To improve the efficiency and to make the process sustainable, Machine Learning models coupled with optimization are used for Prescriptive Analytics. Data of the industrial process is often huge data with many process and control variables involved. Understanding the variables requires domain knowledge expertise coupled with feature engineering techniques. A search-based optimization can be used for finding the Pareto optimal solution with objectives to maximize the KPI and finding the support in historical data. Identifying the interaction effects is done by learning the data through a prediction model. The performance after the process is predicted using modelling for the KPI. Sensitivity analysis was conducted to understand the effect of variables on the uncertainty of model output and the KPI. The process, then optimized for maximizing throughput provides prescriptive analytics thereby improving the performance and reducing energy consumption.
Sandeep Shetty - Portfolio Valuation for a Retail Bank using Monte Carlo Simulation and Forecasting for Risk Measurement
Banks today need to have a very good assessment of their portfolio value at any point in time . This is both a regulatory requirement and an operational metrics which helps banks to assess risk of their portfolio and also calculate the Capital Adequacy that they need to maintain at portfolio levels , product levels and all of these aggregated at Bank level.
This presentation will walk you through a case study which will discuss in detail how we went about calculating Portfolio value for a Home loan on a sample data . The bank wanted a scientific /statistical approach to this as they could take this to regulators for approval and thus convince them about the capital that they have for a particular portfolio.
The other interesting dimension was that in case the bank wants to sell a particular loan book to another bank /third party financial institutions they would be able to quote a price within the confidence interval of the calculated price. The same model/tool could be also shared with the buyer to convince them on quoted price and will make the negotiation and selling smooth.
We have used Monte Carlo Simulation on historical data of the portfolio to measure the Portfolio Value for the next 5 years of a Home loan Portfolio. It is a two step modeling process with Machine Learning Models to predict default and then further using simulation to calculate Portfolio value year on year for next 5 yrs taking in account diminishing returns too.
The presentation will take you through the approach and modeling process and how Monte Carlo Simulation helped us deliver the same to Customer with high accuracy and confidence level.
This is a real case study and will focus on why Risk Measurement is important and why Basel , CCAR implementation across banks worldwide helps the Central Banks to manage risks in case of a financial downturn or Black Swan events.
Please share your feedback for:
Please share your feedback for: