Alexis Networks

Alexis Networks
Understanding Anomaly Detection

Wednesday, October 30, 2019

Discovering the Power of AWS Glue in Large Scale Data Analysis

The Challenge
A modern technology-driven business is awash in mountains of sorted and unsorted data that can be challenging to transform into something that allows for deep analysis. From a development perspective, defining, building, and normalizing or cataloging data in a way that allows for easier analysis is a daunting task, especially if new data sets are identified as needing analysis.  
Developers or database administrators have to create processes to extract the information from whichever repository it is stored in, transform it into a common set of data types or even fields, and ultimately load that information in an indexed format to a location that can then be accessed by business intelligence toolsets. 
It is this Extract/Transform/Load (ETL) process that typically is a bottleneck in the overall analysis process. Besides the need to create a new process each time a new dataset is identified, the need to set up routines to run these ELT processes at set times when certain conditions are met can be a maintenance nightmare.  
The Benefits of AWS Glue
With the introduction by Amazon Web Services (AWS)  of a service called AWS Glue, this formerly painstaking task has been eliminated.  By integrating closely with other key AWS services, such as DynamoDB and other RDS database interfaces, Glue allows an organization to simply point to the location where the raw data resides and Glue will take care of the extraction, transformation and loading of the data into a format whereby data analysis tools can access it.
Moreover, since AWS Glue provides a serverless environment where clients only pay for the resources they use at the time Glue is invoked to process data, and by generating ETL code that is customizable, reusable and portable, it gives developers in an organization more freedom to sculpt processes that suit each particular need.
How Alexis Networks Applied AWS Glue to Two Different Use Cases
Alexis Networks, an AWS Certified Premier Partner, was asked by two clients on ways they could best utilize AWS Glue to solve ongoing challenges they were experiencing within their organizations.  Even though each client had very different concerns when attempting to aggregate their data sources together, Alexis Networks was able to utilize AWS Glue in a very similar manner to overcome their challenges.
Client 1: How to Manage and Aggregate Widely Different Data Sources
The first client needed to compile a lengthy amount of data from numerous data sources into one location so that it could be analyzed.  Their challenge was that they were not able to accurately predict decisions based on the customer data. For them, the complexity of the data meant a significant amount of resources were needed to cleanse the data and they struggled to find ways to bring the data together in a fashion that was easily understood.
With AWS Glue, Alexis Networks was able to define custom scripts that could identify all of the various data sources the client was reliant on and bring them together, cleanse and transform the data, enrich the datasets, and then divide them into Hive tables for use later on.  This meant that the complexity of the data sources no longer stood as a barrier for the client and ultimately led to a clearer understanding of customer predictability.
Client 2: Extreme Load Latency Means Greater Costs and Inefficiencies
The second client had challenges in ingesting all of the data they needed to.  With only two instances of their data available at any point in time, a bottleneck occurred within RDS each time they needed to consolidate the data into a centralized location.  This in turn meant that hours would be spent as the data was aggregated together which meant that other activities around reporting were impacted, ultimately leading to higher operational costs.  They were interested in finding ways to speed the process up and reduce overall costs.
Alexis Networks implemented AWS Glue for this client as well.  Because of its inherent distributed nature within the AWS Service ecosystem, Glue was able to take all of the client’s large volume of data from disparate sources and load them within minutes instead of hours.  Because AWS only charges for the time that Glue operates, and because the low cost for utilizing the service is built into the pricing model, the client was able to recognize an immediate reduction in overall operational costs while streamlining the data ingestion process.
Architecture Diagram (For Client 1 and Client 2)
alt
Alexis Networks Charts the Path
Like any of the AWS Services, not every service fits every client’s need.  Depending on the complexity and specific needs of the client, Alexis Networks can assess the need and determine which approach and which services are in the best interest of the client for long-term success.
With AWS Glue, the implementations Alexis Networks executed showed that both clients were able to recognize the immediate benefit of the solution. Delivering Glue meant very low costs for ongoing operations, high data ingestion speeds because of the distributed nature of the AWS landscape, and consistent cleansing and transformation paradigms that were reproducible using customizable ETL code which was reusable going forward. Leveraging Alexis Networks as a partner in your own company’s journey means that the outcome will benefit your organization, your infrastructure, and your customers for years to come.

Basics of Spark Performance Tuning & Introducing SparkLens

Apache Spark has a colossal importance in the Big Data field and unless one is living under a rock, every Big Data professional might have used Spark for data processing. Spark may sometimes appear to be a beast that’s difficult to tame, in terms of configuration and tuning queries. Mistakes like long running operations, inefficient queries, high concurrency will individually or collectively affect the Spark job.
Before going further let’s go through some basic jargons of Spark:
Executor: An executor is a single JVM process which is launched for an application on a worker node. Executor runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. A single node can run multiple executors, and executors for an application can span multiple worker nodes.
Task: A task is a unit of work that can be run on a partition of a distributed dataset, and gets executed on a single executor. The unit of parallel execution is at the task level. All the tasks within a single stage can be executed in parallel.
Partitions: A partition is a small chunk of a large distributed data set. Spark manages data using partitions that help parallelize data processing with minimal data shuffle across the executors.
Cores: A core is a basic computation unit of CPU and a CPU may have one or more cores to perform tasks at a given time. The more cores we have, the more work we can do. In Spark, this controls the number of parallel tasks an executor can run.
Cluster Manager: An external service for acquiring resources on cluster (e.g.
standalone manager, Mesos, YARN). Spark is agnostic to a cluster manager as long as it can acquire executor processes, and they can communicate. 


Few steps to improve the efficiency of Spark Jobs:

Use the cache: Spark provides its own native caching mechanisms, which can be used through different methods such as .persist(), .cache(), and CACHE TABLE. This native caching is effective with small data sets as well as in ETL pipelines, where you need to cache intermediate results. However, Spark native caching currently does not work well with partitioning, since a cached table does not retain the partitioning data. A more generic and reliable caching technique is the storage layer caching.
Use memory efficiently: Spark operates by placing data in memory, so managing memory resources is a key aspect of optimizing the execution of Spark jobs. There are several techniques you can apply to use your cluster’s memory efficiently.
  • Prefer smaller data partitions and account for data size, types, and distribution in your partitioning strategy.
  • Consider the newer, more efficient Kryo data serialization, rather than the default Java serialization.
  • Prefer using YARN, as it separates Spark-submit by batch.
To address ‘out of memory’ messages, try:
  • Review DAG Management Shuffles. Reduce by map-side reducing, pre-partition (or bucketize) source data, maximize single shuffles, and reduce the amount of data sent.
  • Prefer ReduceByKey with its fixed memory limit to GroupByKey, which provides
    aggregations, windowing, and other functions but it has an unbounded memory limit.
  • Prefer TreeReduce, which does more work on the executors or partitions, to Reduce, which does all work on the driver.
  • Leverage DataFrames rather than the lower-level RDD objects.
  • Create ComplexTypes that encapsulate actions, such as “Top N”, various aggregations, or windowing operations.

What is SparkLens?

SparkLens is a profiling and performance prediction tool for Spark with built-in Spark Scheduler simulator. Its primary goal is to make it easy to understand the scalability limits of Spark applications. It helps in understanding how efficiently a
given Spark application uses the compute resources provided. Maybe your application will run faster with more executors, and maybe it won’t. SparkLens can answer this question by looking at a single run of your application.
Efficiency Statistics by SparkLens
The total Spark application wall clock time can be divided into time spent in driver and time spent in executors. When a Spark application spends too much time in the driver, it wastes the executors’ compute time. Executors can also waste compute time, because of lack of tasks or skew. And finally, critical path time is the minimum time that this application will take even if we give it infinite executors. Ideal application time is computed by assuming ideal partitioning (tasks == cores and no skew) of data in all stages of the application. 

Fig 1: The chart depicts the time taken for the job execution on the driver and on the executors

Fig 2: This chart lays out the actual time and ideal time comparison for a Spark job with resources available

Simulation data by SparkLens
Using the fine-grained task level data and the relationship between stages, SparkLens can simulate how the application will behave when the number of executors is changed. Specifically, SparkLens will predict wall clock time and
cluster utilization. Note that cluster utilization is not cluster cpu utilization. It only means some task was scheduled on a core. The cpu utilization will further depend upon if the task is cpu bound or IO bound.
Ideal Executors Data by SparkLens
If autoscaling or dynamic allocation is enabled, we can see how many executors were available at any given time. SparkLens plots the executors used by different Spark jobs within the application, and what is the minimal number of executors (ideal) which could have finished the same work in the same amount of wall clock time.
The four important metrics printed per stage by SparkLens are:
• PRatio: Parallelism in the stage. The higher, the better.
• TaskSkew: Skew in the stage, specifically the ratio of largest to medial task times.
The lower, the better.
• StageSkew: Ratio of largest task to total stage time. The lower, the better.
• OIRatio: Output to input ratio in the stage.
This has been a short guide to point out the main concerns one should be aware of, when tuning a Spark application – most importantly, data serialization and memory tuning. We hope it was informative!
References: Microsoft Blog, SparkLens Documentation 




How insurers can use big data to manage the COVID-19 pandemic

Caught between higher demand and tightening supply, insurers are looking to technology to compute adequate rates and come up with new types ...