XN Project for GE Aviation

ALY6080

Danlin Shen

Northeastern University

February 12, 2020

Danlin Shen

Data Preview:

Let’s take a overall look from the four GE Aviation files:
Table heat_scores:
SHARED_INDICATION_NAME = SHARED_INDICATOR_APPLICATION + SHARED_INDICATOR_TYPE + SHARED_INDICATOR_ACTIVITY + SHARED_INDICATOR_SUFFIX

1
SELECT * FROM heat_scores LIMIT 10 \G

Table obfuscated_demo_data has a description file Data Dictionary - Demographics. Expect the following columns:

  • DEPARTMENT_NAME
  • JOB_TITLE
  • LEGAL_ENTITY_NAME
  • DIVISION_NAME
  • BUSINESS_SEGMENT
1
SELECT * FROM obfuscated_demo_data LIMIT 10 \G

Table indicator_data has a description file Data Dictionary - Alerts. Expect three columns that are stored in table heat_scores:

  • indicators
  • indicator_source
  • indicator_heat_score.
1
SELECT * FROM indicator_data LIMIT 10 \G

Table indicator_data is related to heat_scores:

  • indicator_data: indicator_pairs
  • heat_scores: SHARED_INDICATION_NAME

SELECT one SHARED_INDICATION_NAME in table heat_scores to compare with the countpart in table indicator_data:

1
2
SELECT * FROM heat_scores LIMIT 10 \G
SELECT * FROM indicator_data WHERE indicator_pairs = "App_1_Atomic_Email_PRE_2016_Q3" \G


What are the current alerts and indicators?
We have only three categories of alerts: SELECT DISTINCT alert_category FROM indicator_data;

  • Atomic
  • Heat
  • Other

And alertes have been classified as time range: SELECT DISTINCT alert_type FROM indicator_data;

  • Atomic
  • Daily_Heat
  • Weekly_Heat
  • Monthly_Heat
  • NA

We have three indicator types:

  • Atomic
  • Heat
  • Threshold


As

1.Briefly assess data files in terms of #items, sizes and variable entities.

Table indicator_data is related to heat_scores:

  • indicator_data: indicator_pairs
  • heat_scores: SHARED_INDICATION_NAME

SELECT one SHARED_INDICATION_NAME in table heat_scores to compare with the countpart in table indicator_data:

1
2
SELECT * FROM heat_scores LIMIT 10 \G
SELECT * FROM indicator_data WHERE indicator_pairs = "App_1_Atomic_Email_PRE_2016_Q3" \G

Table indicator_data has data that records each case. Each row has a score and average to measure how serious is the case.


2.Integrate data files to support the analytics – describe the resultant tables and how they are needed for analysis.

Work flow:

  • Uses a static table with indicator and value.

  • As the indicator fires, the heat score value is added to the employee’s
    accumulated heat score.

  • Once the accumulated heat score reaches a threshold, an alert is raised.

    Therefore, classification is the final measurement for each case. There are other factors will influence classification.


3.Perform Exploratory Data Analysis (EDA) to better understand the business motivation: Heat score distributions per indicator types, thresholds for ”triggering” alerts, false positives and true positives etc.

Here, the cases has been classified by Alert Type and Classification. The thresholds is extremely low, but the heat score is extremely high, which means “Environment changes rapidly but scores do not reflect changes (need dynamic heat scores)”.


4.Provide a summary of steps that potential analytics may be done to answer project questions (predictive models for heat scores and/or thresholds for significant events e.g.,TP/Low and TP/High)

  1. Data understand
  2. Data preparing, including data cleasning
  3. Modling. Probably checking independence, calculating p-value, and building decision tree.
  4. Through decision tree we may make a conclusion.

5.Characterize the model DV’s and IV’s
DV’s: Score, Ave Score, Risk Factor
IV’s: employee, HRU, Career Band, Job Title, Industry Focus, Business Segment, Time, City, State, Country


W2

Dis

Introduction
Basically, there are four scenarios:

  • TP/DE: No risk but the system raise an alert, which means fake alert.
  • FP: No risk and the system does not raise any alert.
  • TP/LOW: True risk and the analysts classify them as low risk.
  • TP/HIGH: True risk and the analysts classify them as high risk.


What kind of unsupervised learning would you propose and what would that help your overall analytics?

I prefer the k-means clustering. Because datasets from GE Aviation are really complex. In the initial phase, I am not sure which variables are valuable. That’s why I need to do Machine Learning. I will try to analyze the GE Aviation datasets and find out the relationship in each variable and how does the classification system work. In k-means clustering, it is possible to set up the best number of clusters, which could help avoid errors. Also, it supports standardization and normalization, which can help me improve the accuracy of the model.


List 3 notable insights or observations you have obtained from your analytics that will help GE to better detect potential IP risks and/or save efforts of the analysts in investigating FP or TP/DE indicators for ATOMIC i.e., single indicator events.

Analyzing the distribution of FP or TP/DE indicators.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Import modules
import pandas as pd
import numpy as np
import plotly.express as px

# Load data in data.frame style
ge_indicator_data = pd.read_csv('GE obfuscated_demo_data/indicator_pairs_data updated.csv')

# Modify the classification from ['FP', 'TP/DE', 'TP/LOW', 'TP/HIGH'] to [0, 1, 2, 3]
ge_indicator_data.replace(['FP', 'TP/DE', 'TP/LOW', 'TP/HIGH'], [0, 1, 2, 3], inplace = True)

# px.parallel_categories
fig = px.parallel_categories(ge_indicator_data, color = "classification", color_continuous_scale = px.colors.sequential.Inferno)
fig.show()

This chart is parallel category. “In a parallel categories (or parallel sets) plot, each row of data_frame is grouped with other rows that share the same values of dimensions and then plotted as a polyline mark through a set of parallel axes, one for each of the dimensions.” I use classification as the color to mark the data. For example, classification - 0, equaled to classification - FP, highlight every columns which belong to FP in a black color.

According to the classification - 0 (FP) chart, there are 59315 records. And most of them are related to hru - None, hru - 11 , and hru - 13. The most import thing is: almost all classification - 0 (FP) are related to Atomic alert.

According to the classification - 1 (TP/DE) chart, there are 65219 records. And most of them are related to hru - None, hru - 11 , and hru - 13. Half of the classification - 1 (TP/DE) are related to Atomic alert; the other part is related to Heat alter.

Python
1
2
3
4
5
6
7
# Filter person_type not null records
ge_i_o[ge_i_o['person_type'].notnull()]

# px.scatter
fig = px.scatter(ge_i_o[ge_i_o['person_type'].notnull()], x = "job_function", y = 'career_band' , size = "classification", color = "classification",
size_max = 15)
fig.show()

When I exploring the relationship between job_function and career_band, I notice that some combination provides more TP/HIGH and TP/LOW records but others are not.


Module III

Perform relevant analytics and show results (not more than 5 pages and not more than 5 charts and tables) (2/10-2/23)

Source Code and Full Report: https://github.com/ZacksAmber/Code/blob/master/Python/GE%20Aviation/GE%20Aviation.ipynb

1

Perform unsupervised learning on ATOMIC Alerts Only to see how many types of risk groups are there and does the clustering result help your understanding of the project?

The number of clusters should be 3, which means there three types of risk groups. The result help me understanding the project. Because classification - TP/DE is a wrong indicator. After filtering it and classification - FP, the relationships between each variables become clearly.

Python
1
2
3
4
# Number of clusters = 3
fig = px.line(x = ks, y = inertias, render_mode = 'svg',
title = 'Number of Clusters')
fig.show()


2 & 3

Perform predictive analytics to predict High risks (TP/HIGH), Notable Risks (TP/HIGH & LOW) for Atomic Indicators.

Show results and performance measures such as TP, FP, FN,TN, confusion table and lift charts and what threshold of probabilities would you use to trigger the alerts and optimize the results for different business considerations?

First, I join the table indicator_data and obfuscated_demo_data Then I extract the columns I need.

Python
1
2
3
4
5
6
# Join data.frame indicator_data and obfuscated_demo_data as ge_i_o
ge_i_o = ge_indicator_data.join(ge_obfuscated_demo_data.set_index('employee_id'), on = 'employee_id')

# Extract the columns I need and create a new data.frame.
ge_samples = ge_i_o[['score', 'risk_factor', 'avg_score', 'hru', 'alert_category', 'person_type', 'person_status', 'ge_hire_year',
'function_group', 'job_function', 'career_band', 'industry_focus_name', 'classification']]

After transferring the variables from categorical data to dummy data, the initial result is still not accurate due to the scale.

The I Try to standardize the variables, and figure out a very accurate k-means clustering model - 73.3%.


4

How do your results compare with the existing system?

I think my model is better because I drop several confused variables and keep a high accuracy.