My Data Science Blog

Featured

My First Blog Post

Be yourself; Everyone else is already taken.
— Oscar Wilde.

This is the first post on my new blog. I’m just getting this new blog going, so stay tuned for more. Subscribe below to get notified when I post new updates.

Identifying Hot Leads with Logistic Regression!

An education company, X Education, sells online courses to Industry Professionals. X Education gets a lot of leads, however, its conversion rate is very poor . The CEO of X Education wants to identify the most promising leads, the ones which are most likely to convert to paying customers. The target lead conversion rate is set to be around 80%

What we have is a leads dataset with about 9000 entries and 37 columns. Each row also informs if that particular lead was converted or not. The task is to come up with a model that identifies the most promising leads (Hot Leads!) so that the company could focus on those.

Dataset

Variables	Description
Prospect ID	A unique ID with which the customer is identified.
Lead Number	A lead number assigned to each lead procured.
Lead Origin	The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc.
Lead Source	The source of the lead. Includes Google, Organic Search, Olark Chat, etc.
Do Not Email	An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not.
Do Not Call	An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not.
Converted	The target variable. Indicates whether a lead has been successfully converted or not.
TotalVisits	The total number of visits made by the customer on the website.
Total Time Spent on Website	The total time spent by the customer on the website.
Page Views Per Visit	Average number of pages on the website viewed during the visits.
Last Activity	Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc.
Country	The country of the customer.
Specialization	The industry domain in which the customer worked before. Includes the level ‘Select Specialization’ which means the customer had not selected this option while filling the form.
How did you hear about X Education	The source from which the customer heard about X Education.
What is your current occupation	Indicates whether the customer is a student, umemployed or employed.
What matters most to you in choosing this course	An option selected by the customer indicating what is their main motto behind doing this course.
Search	Indicating whether the customer had seen the ad in any of the listed items.
Magazine
Newspaper Article
X Education Forums
Newspaper
Digital Advertisement
Through Recommendations	Indicates whether the customer came in through recommendations.
Receive More Updates About Our Courses	Indicates whether the customer chose to receive more updates about the courses.
Tags	Tags assigned to customers indicating the current status of the lead.
Lead Quality	Indicates the quality of lead based on the data and intuition the the employee who has been assigned to the lead.
Update me on Supply Chain Content	Indicates whether the customer wants updates on the Supply Chain Content.
Get updates on DM Content	Indicates whether the customer wants updates on the DM Content.
Lead Profile	A lead level assigned to each customer based on their profile.
City	The city of the customer.
Asymmetrique Activity Index	An index and score assigned to each customer based on their activity and their profile
Asymmetrique Profile Index
Asymmetrique Activity Score
Asymmetrique Profile Score
I agree to pay the amount through cheque	Indicates whether the customer has agreed to pay the amount through cheque or not.
a free copy of Mastering The Interview	Indicates whether the customer wants a free copy of ‘Mastering the Interview’ or not.
Last Notable Activity	The last notable acitivity performed by the student.

Methodology

• Data Cleaning and Preparation – First, Replace the placeholder values with null. Then, identify and remove columns which are almost completely null. Impute the ones with lesser null percentage. Do the same for rows. Transform the data and split into train and test datasets for the next step
• Model building and Prediction – Create a Logistic Regression model based on train dataset. Fine-tune the model, using RFE, VIF and comparing the output accuracy and specificity.
• Calculate Lead Scores – Based on the model prediction, calculate the Lead score per lead

METHODOLOGY DETAILS

Data Cleaning and Preparation
Data Cleaning was the first step carried out. This process consisted of:
• Replacing ‘Select’ values with null
• Analyzing data spread and truncating columns with more than 50% empty data
• Deleting rows with empty data
• Variable Encoding (Binary and One-Hot Encoding)
• Test-Train Split
Feature Scaling
• As the scale of some variables (‘TotalVisits’, ‘Total Time Spend on Website’, ‘Page Views Per Visit’, ‘Asymmetrique Activity Score’, ‘Asymmetrique Profile Score’) is different from the rest of the features, there was a need to re-scale these variables.
Model Building and Feature Elimination
• After this, the first Logistic Regression Model was trained on the train dataset. However, as there were > 100 features, there was a need to eliminate the unnecessary ones and keep only the essential features
• For this purpose, RFE was used, with VIF complementing the model fine-tuning process

FINAL MODEL

The above model can be accepted as p values for all variables < 0.05 and VIFs <5

On using the model to do predictions on the training set with a random cutoff conversion probability of 0.5, we get the below

The model accuracy on the train set is 0.92

Plotting the ROC Curve

Area under the ROC curve =0.97

Finding Optimal Cutoff Point

From the above, the optimal cutoff = 0.4

Overall accuracy at 0.4 on the train set = 0.92

sensitivity=0.91

specificity = 0.92

Making predictions on the test set

We get the below on applying the model to the test set

Accuracy on the test set = 0.92

sensitivity = 0.91

specificity = 0.92

Assigning Lead Scores

Finally, we assign lead scores to identify hot leads

Finally, we have a model that assigns lead scores to all potential leads. Based on identified optimal cut off, all leads with lead scores > 40 should be called!

PCA and clustering

An international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. It runs a lot of operational projects from time to time along with advocacy drives to raise awareness as well as for funding purposes.

After the recent funding programmes, they have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. The significant issues that come while making this decision are mostly related to choosing the countries that are in the direst need of aid.

The task is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then suggest the countries which the CEO needs to focus on the most

Dataset

We first need to check the data for missing values and then standardize the data using StandardScaler().

Once the data is scaled, we can apply PCA on the same. Since this dataset has 9 columns other that the identifier column ,we get 9 principal components.

The explained variance ratios of the principal components are shown in the below image

The Scree plot is shown below

From the above, we can move ahead with 2 PCs as they explain most of the variance

The Hopkins statistic on the above PC dataset comes out to be 0.827726854010938 indicating good clustering tendencies

OUTLIER Treatment

We need to remove outliers as the clustering algorithms to be applied are sensitive to outliers. Below image shows PC1 and PC2 with outliers

PC1 and PC2 plotted after outlier removal (statistical)

Before applying K means and other clustering algorithms, we need to find the appropriate number of clusters, this is done using Silhouette score analysis and plotting the Elbow curve

In this case

•For n_clusters=2, the silhouette score is 0.5112110146750702

•For n_clusters=3, the silhouette score is 0.5401358971649539

•For n_clusters=4, the silhouette score is 0.4896931359873917

•For n_clusters=5, the silhouette score is 0.4838697032217766

•For n_clusters=6, the silhouette score is 0.46871302056666836

•For n_clusters=7, the silhouette score is 0.48031466804791656

•For n_clusters=8, the silhouette score is 0.46813909324910224

From the Silhouette score and the elbow curve,the appropriate number of clusters seems to be 3

Applying K means clustering with 3 clusters we get the following clusters

Analysing the three clusters across gdpp, income and child mortaility, we get the following

cluster_id=1 is the cluster with the lowest gdpp countries

cluster_id=1 is the cluster of countries with the highest child mortality rates

cluster_id=1 is the cluster of countries with the lowest income per person

Overall, countries in cluster one seem to be the ones most in need of aid

Introduce Yourself (Example Post)

This is an example post, originally published as part of Blogging University. Enroll in one of our ten programs, and start your blog right.

You’re going to publish a post today. Don’t worry about how your blog looks. Don’t worry if you haven’t given it a name yet, or you’re feeling overwhelmed. Just click the “New Post” button, and tell us why you’re here.

Why do this?

Because it gives new readers context. What are you about? Why should they read your blog?
Because it will help you focus you own ideas about your blog and what you’d like to do with it.

The post can be short or long, a personal intro to your life or a bloggy mission statement, a manifesto for the future or a simple outline of your the types of things you hope to publish.

To help you get started, here are a few questions:

Why are you blogging publicly, rather than keeping a personal journal?
What topics do you think you’ll write about?
Who would you love to connect with via your blog?
If you blog successfully throughout the next year, what would you hope to have accomplished?

You’re not locked into any of this; one of the wonderful things about blogs is how they constantly evolve as we learn, grow, and interact with one another — but it’s good to know where and why you started, and articulating your goals may just give you a few other post ideas.

Can’t think how to get started? Just write the first thing that pops into your head. Anne Lamott, author of a book on writing we love, says that you need to give yourself permission to write a “crappy first draft”. Anne makes a great point — just start writing, and worry about editing it later.

When you’re ready to publish, give your post three to five tags that describe your blog’s focus — writing, photography, fiction, parenting, food, cars, movies, sports, whatever. These tags will help others who care about your topics find you in the Reader. Make sure one of the tags is “zerotohero,” so other new bloggers can find you, too.