Alexis Networks

Alexis Networks
Understanding Anomaly Detection

Wednesday, July 8, 2020

How insurers can use big data to manage the COVID-19 pandemic

Caught between higher demand and tightening supply, insurers are looking to technology to compute adequate rates and come up with new types of coverage. 


To take advantage of big data technology, insurance managers and underwriters need to first understand how to leverage its power in supporting their specific needs. (Bigstock/ALM Media archives)



Insurers are stuck between a rock and a hard place.

The recent COVID-19 pandemic has forced much of the world to grind to a halt. As companies struggle to manage the impacts on their supply chains and bottom line, greater pressure is placed on insurers to support their clients while also mitigating the downside for their own businesses.

On one end, the needs of insureds are higher. Business operations are global and more complex; all the while, clients expect more bespoke service and support.

At the same time, markets are hardening due to catastrophe loss experience and higher reinsurance costs. Even before the novel coronavirus outbreak, insurers were struggling to keep up with increased severity, frequency, and uncertainty of natural catastrophes and other emerging risks. 

Caught between higher demands and tightening supply, insurers have begun looking to next-generation technologies to compute adequate rates and come up with new types of coverage such as parametric insurance while also ensuring their own underwriting profitability. 

Big data is one area drawing greater attention, and big data analytics can improve strategic decision making. Yet, it isn’t as simple as collecting vast amounts of information. To take advantage of big data technology, insurance managers and underwriters need to first understand how to leverage its power in supporting their specific needs. 

Personalize and optimize

According to a 2019 Deloitte survey, only 40% of organizations utilize big data and analytics despite more than two-thirds of responders believe the emerging tech would dramatically increase operational efficiency and enhance risk analysis and detection. 

Big data analytics allows large and small insurers to see the big picture and enhance underwriting capabilities by providing a faster and deeper understanding of risk exposure. By incorporating data from internal and external sources, as well as real-time monitoring of catastrophic impacts, infection rates and prevention methods, insurers are able to compute adequate premium rates for their clients and allow real-time risk assurance. 

As P&C premium rates continue to rise, case-by-case analysis adds a valuable layer during renewal conversation with clients who have unique expectations and demands. Insurers can understand their clients’ loss history and status of current claims, differentiate their risk against their total portfolio, and develop a high-quality underwriting submission tailored to industry trends and challenges. 

Close the gap

Insurers are facing pressures from all sides to close the protection gap between insureds and economic losses. This gap has continued to widen as the scale of serious natural catastrophes have increased, and black swan events like COVID-19 have emerged, challenging us to react to an unprecedented situation and to understand the potential impacts on the various dimensions: sanitary, economic and financial. 

Truly understanding current risks is vital to narrow the gap, and as data becomes increasingly open and transparent, insurers that use technologies to leverage that information can identify opportunities and mitigate damages. 

A critical benefit of big data analytics is in improving the accuracy of loss projections and producing real-time insights of events as they occur. Climate risk is constantly evolving, and past observations are not necessarily representative of the potential damages of today’s risk. Modern tools and platforms open up possibilities to maneuver through the complexity that defines the new world of insurance, especially during a pandemic when information is coming from a multitude of sources. 

Accurate and timely projects have cascading effects and material impacts on reserves, renewal premiums and T&Cs. Investors, ratings agencies and reinsurers look to risk resilience and strong mitigation processes. Therefore, insurers that invest in big data analytics tools can showcase their own ability to pay future claims and demonstrate the capacity to responsibly underwrite risk. 

Halt the DRIP

The number of companies using big data analytics is projected to more than double in the coming years. Staying ahead of the curve is absolutely essential for insurers to ensure longevity. 

As the author and business management guru Robert Waterman observed 35 years ago, companies are “data-rich and information poor.” In spite of (or as a result of) having more access to data, business leaders still struggle with incorporating data into their business strategies. Top management can often rely excessively on experience over data due to frustrating data silos, poor integrations with existing systems, or outdated technologies. 

The “Data-Rich, Information Poor” (DRIP) status must change as the insurance world heads into the new decade. More attention should be placed in making sense of the immense amount of public and private data insurers have access to, and insurance companies that fall behind on innovation will inevitably see their own demise. Companies that halt the DRIP and make financial, social, and cultural investments to become information-rich will be capable of adapting and taking advantage of the new normal. 

Moving forward

In a global and increasingly interconnected world, risks do not spread in isolation; their speed and impact are magnified because they are intertwined. This trend will not reverse, so it is on all those in the insurance industry to adapt. We already fear the ripple effects of the current pandemic on business continuity, supply chains, and public health. As the world continues to face greater levels of risk, big data analytics technology is, and forever will be, crucial in helping insurance companies gain actionable insights as the volume of external risk data explodes; in monitoring and mitigating large-scale events as they unfold; and developing business intelligence in real-time on emerging risks. 



Wednesday, June 17, 2020

Anomaly Detection by Aashrith Sai

 

Aashrith Sai


 
About Myself
Currently pursuing my masters with keen focus on Deep learning and data science. I am passionate about data at every stage of the data life-cycle in a project. I have a profound understanding about several classification and regression algorithms with comprehensive understanding of data science project pipeline, feature engineering, Data Visualization, Data-Modeling and Statistics. I am currently working as a Deep Learning Intern at Alexis Networks and developing AI solutions for human anomalies by tracking their actions. 

Linked-In       |         Email Me


Anomaly detection by Aashrith Sai
According to me anomaly is anything that is deviating from what is expected or normal or standard. In every organization or a business, everything is planned to meet the expectations of the clients or consumers. In all these scenarios one must be aware of the anomalies to successfully execute their plan. The more we know about the probability of an event happening in the future, the more prepared we can be in the present to handle the situation in future.


Real-Time Scenario: Let us look at an example of a grocery store.

Fig1: Every weekend there are a lot of customers.

Fig 2: The weekend when there was a thunderstorm.
                                                                     

As the number of customers are very high on the weekends the grocery store owner hires more staff during the weekends and purchases more goods from the vendors. The owner would have saved a lot of money if he had a prior idea of the thunderstorm affecting his sales. In this case a thunderstorm is an anomaly. Lot of resources can be saved and plans can be executed more efficiently if we have the information about the anomalies happening in future, present and past.


Internship experience at Alexis:

Alexis Networks is the first company I am working with in the US. Through the last few weeks I have been given an opportunity to explore a lot of new technologies in the field of machine learning, cloud and many more that are essential for product development. Through this opportunity, I have discovered my new interests towards Amazon AWS which is driving me to work on getting certified as an AWS machine learning developer. I am thoroughly enjoying the freedom given to me to explore new paths to have a complete learning experience.
My excitement towards new projects, products and tasks reasserted the fact that I am in the right direction of exploring technology and complementing my skill-set to become successful in my career. 


What have you learned thus far compared to what you initially expected?

I always wanted to work on products that would impact people to make their life more easier and comfortable. At Alexis, I am working on exactly what I wanted to do. I wanted to get a real time experience with deep learning, which I have developed in the initial few weeks of my internship to detect human anomalies. I have been working and learning more about AWS cloud for machine learning and deep learning applications. I am learning how to package and deliver products to the clients from the stage of development to deployment. I am really enjoying this process and never expected that this opportunity would open up so many paths to learn and explore.

My cough and sneeze detection model:



What has your experience been like during COVID19 and how has it impacted you?

It is always good to have people around you and discuss new ideas and share each other's experiences. I am glad that though this situation is keeping us all at our home, we can still connect with each other and discuss the blockers using social media and virtual meets. I am really happy on how everyone is available and contributing to make my internship experience complete.



Wednesday, October 30, 2019

Discovering the Power of AWS Glue in Large Scale Data Analysis

The Challenge
A modern technology-driven business is awash in mountains of sorted and unsorted data that can be challenging to transform into something that allows for deep analysis. From a development perspective, defining, building, and normalizing or cataloging data in a way that allows for easier analysis is a daunting task, especially if new data sets are identified as needing analysis.  
Developers or database administrators have to create processes to extract the information from whichever repository it is stored in, transform it into a common set of data types or even fields, and ultimately load that information in an indexed format to a location that can then be accessed by business intelligence toolsets. 
It is this Extract/Transform/Load (ETL) process that typically is a bottleneck in the overall analysis process. Besides the need to create a new process each time a new dataset is identified, the need to set up routines to run these ELT processes at set times when certain conditions are met can be a maintenance nightmare.  
The Benefits of AWS Glue
With the introduction by Amazon Web Services (AWS)  of a service called AWS Glue, this formerly painstaking task has been eliminated.  By integrating closely with other key AWS services, such as DynamoDB and other RDS database interfaces, Glue allows an organization to simply point to the location where the raw data resides and Glue will take care of the extraction, transformation and loading of the data into a format whereby data analysis tools can access it.
Moreover, since AWS Glue provides a serverless environment where clients only pay for the resources they use at the time Glue is invoked to process data, and by generating ETL code that is customizable, reusable and portable, it gives developers in an organization more freedom to sculpt processes that suit each particular need.
How Alexis Networks Applied AWS Glue to Two Different Use Cases
Alexis Networks, an AWS Certified Premier Partner, was asked by two clients on ways they could best utilize AWS Glue to solve ongoing challenges they were experiencing within their organizations.  Even though each client had very different concerns when attempting to aggregate their data sources together, Alexis Networks was able to utilize AWS Glue in a very similar manner to overcome their challenges.
Client 1: How to Manage and Aggregate Widely Different Data Sources
The first client needed to compile a lengthy amount of data from numerous data sources into one location so that it could be analyzed.  Their challenge was that they were not able to accurately predict decisions based on the customer data. For them, the complexity of the data meant a significant amount of resources were needed to cleanse the data and they struggled to find ways to bring the data together in a fashion that was easily understood.
With AWS Glue, Alexis Networks was able to define custom scripts that could identify all of the various data sources the client was reliant on and bring them together, cleanse and transform the data, enrich the datasets, and then divide them into Hive tables for use later on.  This meant that the complexity of the data sources no longer stood as a barrier for the client and ultimately led to a clearer understanding of customer predictability.
Client 2: Extreme Load Latency Means Greater Costs and Inefficiencies
The second client had challenges in ingesting all of the data they needed to.  With only two instances of their data available at any point in time, a bottleneck occurred within RDS each time they needed to consolidate the data into a centralized location.  This in turn meant that hours would be spent as the data was aggregated together which meant that other activities around reporting were impacted, ultimately leading to higher operational costs.  They were interested in finding ways to speed the process up and reduce overall costs.
Alexis Networks implemented AWS Glue for this client as well.  Because of its inherent distributed nature within the AWS Service ecosystem, Glue was able to take all of the client’s large volume of data from disparate sources and load them within minutes instead of hours.  Because AWS only charges for the time that Glue operates, and because the low cost for utilizing the service is built into the pricing model, the client was able to recognize an immediate reduction in overall operational costs while streamlining the data ingestion process.
Architecture Diagram (For Client 1 and Client 2)
alt
Alexis Networks Charts the Path
Like any of the AWS Services, not every service fits every client’s need.  Depending on the complexity and specific needs of the client, Alexis Networks can assess the need and determine which approach and which services are in the best interest of the client for long-term success.
With AWS Glue, the implementations Alexis Networks executed showed that both clients were able to recognize the immediate benefit of the solution. Delivering Glue meant very low costs for ongoing operations, high data ingestion speeds because of the distributed nature of the AWS landscape, and consistent cleansing and transformation paradigms that were reproducible using customizable ETL code which was reusable going forward. Leveraging Alexis Networks as a partner in your own company’s journey means that the outcome will benefit your organization, your infrastructure, and your customers for years to come.

Basics of Spark Performance Tuning & Introducing SparkLens

Apache Spark has a colossal importance in the Big Data field and unless one is living under a rock, every Big Data professional might have used Spark for data processing. Spark may sometimes appear to be a beast that’s difficult to tame, in terms of configuration and tuning queries. Mistakes like long running operations, inefficient queries, high concurrency will individually or collectively affect the Spark job.
Before going further let’s go through some basic jargons of Spark:
Executor: An executor is a single JVM process which is launched for an application on a worker node. Executor runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. A single node can run multiple executors, and executors for an application can span multiple worker nodes.
Task: A task is a unit of work that can be run on a partition of a distributed dataset, and gets executed on a single executor. The unit of parallel execution is at the task level. All the tasks within a single stage can be executed in parallel.
Partitions: A partition is a small chunk of a large distributed data set. Spark manages data using partitions that help parallelize data processing with minimal data shuffle across the executors.
Cores: A core is a basic computation unit of CPU and a CPU may have one or more cores to perform tasks at a given time. The more cores we have, the more work we can do. In Spark, this controls the number of parallel tasks an executor can run.
Cluster Manager: An external service for acquiring resources on cluster (e.g.
standalone manager, Mesos, YARN). Spark is agnostic to a cluster manager as long as it can acquire executor processes, and they can communicate. 


Few steps to improve the efficiency of Spark Jobs:

Use the cache: Spark provides its own native caching mechanisms, which can be used through different methods such as .persist(), .cache(), and CACHE TABLE. This native caching is effective with small data sets as well as in ETL pipelines, where you need to cache intermediate results. However, Spark native caching currently does not work well with partitioning, since a cached table does not retain the partitioning data. A more generic and reliable caching technique is the storage layer caching.
Use memory efficiently: Spark operates by placing data in memory, so managing memory resources is a key aspect of optimizing the execution of Spark jobs. There are several techniques you can apply to use your cluster’s memory efficiently.
  • Prefer smaller data partitions and account for data size, types, and distribution in your partitioning strategy.
  • Consider the newer, more efficient Kryo data serialization, rather than the default Java serialization.
  • Prefer using YARN, as it separates Spark-submit by batch.
To address ‘out of memory’ messages, try:
  • Review DAG Management Shuffles. Reduce by map-side reducing, pre-partition (or bucketize) source data, maximize single shuffles, and reduce the amount of data sent.
  • Prefer ReduceByKey with its fixed memory limit to GroupByKey, which provides
    aggregations, windowing, and other functions but it has an unbounded memory limit.
  • Prefer TreeReduce, which does more work on the executors or partitions, to Reduce, which does all work on the driver.
  • Leverage DataFrames rather than the lower-level RDD objects.
  • Create ComplexTypes that encapsulate actions, such as “Top N”, various aggregations, or windowing operations.

What is SparkLens?

SparkLens is a profiling and performance prediction tool for Spark with built-in Spark Scheduler simulator. Its primary goal is to make it easy to understand the scalability limits of Spark applications. It helps in understanding how efficiently a
given Spark application uses the compute resources provided. Maybe your application will run faster with more executors, and maybe it won’t. SparkLens can answer this question by looking at a single run of your application.
Efficiency Statistics by SparkLens
The total Spark application wall clock time can be divided into time spent in driver and time spent in executors. When a Spark application spends too much time in the driver, it wastes the executors’ compute time. Executors can also waste compute time, because of lack of tasks or skew. And finally, critical path time is the minimum time that this application will take even if we give it infinite executors. Ideal application time is computed by assuming ideal partitioning (tasks == cores and no skew) of data in all stages of the application. 

Fig 1: The chart depicts the time taken for the job execution on the driver and on the executors

Fig 2: This chart lays out the actual time and ideal time comparison for a Spark job with resources available

Simulation data by SparkLens
Using the fine-grained task level data and the relationship between stages, SparkLens can simulate how the application will behave when the number of executors is changed. Specifically, SparkLens will predict wall clock time and
cluster utilization. Note that cluster utilization is not cluster cpu utilization. It only means some task was scheduled on a core. The cpu utilization will further depend upon if the task is cpu bound or IO bound.
Ideal Executors Data by SparkLens
If autoscaling or dynamic allocation is enabled, we can see how many executors were available at any given time. SparkLens plots the executors used by different Spark jobs within the application, and what is the minimal number of executors (ideal) which could have finished the same work in the same amount of wall clock time.
The four important metrics printed per stage by SparkLens are:
• PRatio: Parallelism in the stage. The higher, the better.
• TaskSkew: Skew in the stage, specifically the ratio of largest to medial task times.
The lower, the better.
• StageSkew: Ratio of largest task to total stage time. The lower, the better.
• OIRatio: Output to input ratio in the stage.
This has been a short guide to point out the main concerns one should be aware of, when tuning a Spark application – most importantly, data serialization and memory tuning. We hope it was informative!
References: Microsoft Blog, SparkLens Documentation 




How insurers can use big data to manage the COVID-19 pandemic

Caught between higher demand and tightening supply, insurers are looking to technology to compute adequate rates and come up with new types ...