Estimating Medical Insurance Charges

7 min readFeb 11, 2022

Introduction
Data Collection
Exploratory Data Analysis
Preprocessing
Experimentation and Results
Conclusion

1. Introduction

Health insurance or medical insurance is a type of insurance that covers the whole or a part of the risk of a person incurring medical expenses. By estimating the overall risk of health risk and health system expenses over the risk pool, an insurer can develop a routine finance structure, such as a monthly premium or payroll tax, to provide the money to pay for the health care benefits specified in the insurance agreement. — Wikipedia

In this blog, the cost borne by insurance companies for their customers who took insurance is predicted. Different ML Regression algorithms like Linear Regression, kNN Regression, SVM Regression, Random Forest Regression, XGBoost Regression were used for prediction.

2. Data Collection

The data is taken from Kaggle. Following are the columns present ans their description:

age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance

3. Exploratory Data Analysis

df = pd.read_csv("insurance.csv")
df.head()

df.info()

Output:

RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)

We can see that, there are 1338 rows.

print(f"minimum age: {df['age'].min()}, maximum age: {df['age'].max()}")
sns.histplot(df, x = "age", bins = 23, kde = True)
plt.show()

Output:

minimum age: 18, maximum age: 64

From the above figure, we can see that, in most of the age bin ranges, the distribution of population in age bins is almost uniform except in first and last bin.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "age", hue = "sex", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, in each bin, male and female are almost equal.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "age", hue = "smoker", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, in each bin, the population of smokers is around 1/3rd of non-smokers.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "age", hue = "region", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, in each bin, there are almost equal population from every region.

plt.figure(figsize = (10, 5))
sns.displot(df, x = "bmi", hue = "sex", kind = "kde")
plt.show()

From the above figure, we can see that, male and female bmi distribution is similar.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "region", hue = "sex", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, from each region the male and female population is almost equal.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "region", hue = "smoker", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, in each region, the ratio of smoker to non-smoker is almost equal.

print(f"minimum charge: {df['charges'].min()}, maximum charge: {df['charges'].max()}")
sns.displot(df, x = "charges", kde = True)
plt.show()

Output:

The distribution of the medical charges looks like a pareto distribution. So, instead of considering mean squared error as metric, mean absolute percentage error metric is considered.

plt.figure(figsize = (10, 5))
plt.scatter(df["age"], df["charges"])
plt.xlabel("age")
plt.ylabel("charges")
plt.title("charges vs age scatter plot")
plt.show()

From the above figure, we can see that, as the age increases the charges increases. We can see an upward trend in charges. This is obvious as the age increases the chance of getting illness increases and medical charges increases.

plt.figure(figsize = (10, 5))
sns.displot(df, x = "charges", hue = "smoker", kind = "kde")
plt.show()

From the above figure, we can see that, smokers pay high charges when compared to non-smokers.

4. Preprocessing

Let us convert the categorical variables to one hot encoding.

df_expanded = pd.get_dummies(df)
df_expanded.drop(columns = ["sex_male", "smoker_no"], inplace = True)
df_expanded.head()

The above dataset is divided into 75% train and 25% test sets.

Let us standardize the train and test sets.

x_train = train.drop(columns = ["charges"])
x_test = test.drop(columns = ["charges"])
y_train = np.array(train["charges"]).reshape(-1, 1)
y_test = np.array(test["charges"]).reshape(-1, 1)x_scaler = StandardScaler()
y_scaler = StandardScaler()x_scaler.fit(x_train)
y_scaler.fit(y_train)x_train_scaled = x_scaler.transform(x_train)
x_test_scaled = x_scaler.transform(x_test)y_train_scaled = y_scaler.transform(y_train)
y_test_scaled = y_scaler.transform(y_test)x_train_scaled.shape, x_test_scaled.shape, y_train_scaled.shape, y_test_scaled.shape

Output:

(1003, 9), (335, 9), (1003, 1), (335, 1)

5. Experimentation and Results

5.1 Linear Regression

Let us train Linear Regression with L1 and L2 Regularization (ElasticNet) and find out MAPE.

model = ElasticNet(l1_ratio = 0.1, alpha = 1)
model.fit(x_train_scaled, y_train_scaled)y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))mean_absolute_percentage_error(y_test, y_pred)

Output:

0.9429

Let us see the feature importance

From feature importance, we can see that, a smoker pays more charges and has highest influence on medical charges. Intuitively, there is a higher probability that a smoker has health issues and goes to hospital, so medical charges increases.
As age increases, the probability of getting health issues increases, so this effects medical charges.
BMI also has an effect on charges as observed from feature importance.
Important thing to note is our model doesn’t have bias towards gender, number of children and region.

5.2 kNN Regression

Let us train kNN Regression and find out MAPE.

model = KNeighborsRegressor(n_neighbors = 17, weights = "distance")
model.fit(x_train_scaled, y_train_scaled)y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))mean_absolute_percentage_error(y_test, y_pred)

Output:

0.3386

5.3 SVM Regression

Let us train SVM Regression and find out MAPE.

model = SVR(C = 1, kernel = "poly")
model.fit(x_train_scaled, y_train_scaled)y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))mean_absolute_percentage_error(y_test, y_pred)

Output:

0.2112

5.4 Random Forest Regression

Let us train Random Forest Regression and find out MAPE.

model = RandomForestRegressor(n_estimators = 75, max_features = "log2")
model.fit(x_train_scaled, y_train_scaled.ravel())y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))mean_absolute_percentage_error(y_test, y_pred)

Output:

0.3064

Let us see feature importance

From feature importance, we can see that, a smoker pays more charges and has highest influence on medical charges. Intuitively, there is a higher probability that a smoker has health issues and goes to hospital, so medical charges increases.
As age increases, the probability of getting health issues increases, so this effects medical charges.
BMI also has an effect on charges as observed from feature importance.
Important thing to note is our model has a very small bias towards gender and region when compared to smoker.

5.5 XGBoost Regression

Let us train XGBoost Regression and find out MAPE.

model = xgb.XGBRegressor(n_estimators = 50, max_depth = 4)
model.fit(x_train_scaled, y_train_scaled.ravel())y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))mean_absolute_percentage_error(y_test, y_pred)

Output:

0.2663

Let us see feature importance

From feature importance, we can see that, a smoker pays more charges and has highest influence on medical charges. Intuitively, there is a higher probability that a smoker has health issues and goes to hospital, so medical charges increases.
As age increases, the probability of getting health issues increases, so this effects medical charges.
BMI also has an effect on charges as observed from feature importance.
Important thing to note is our model has a very very small bias towards gender and region when compared to smoker.

5.6 Results

+----------------------------+-----------+
|           Model            | test MAPE |
+----------------------------+-----------+
| Linear Regression with L1  |   0.9429  |
|   and L2 regularization    |           |
|       kNN Regressor        |   0.3386  |
|       SVM Regressor        |   0.2112  |
|  Random Forest Regressor   |   0.3064  |
|     XGBoost Regressor      |   0.2663  |
+----------------------------+-----------+

6. Conclusion

In this blog, different experiments are done to predict the medical charges. The SVM Regression model worked well with MAPE of 0.2112. In insurance industry, it is very important to know which features increase our costs. Even though SVM gave least MAPE, I would like to go with XGBoost as we can get feature importance with XGBoost model.

Code Reference

GitHub - nagi1995/Medical-Cost-Personal-Datasets

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Contact Links

Email: binginagesh@gmail.com
LinkedIn: https://www.linkedin.com/in/bingi-nagesh-5b0412b7/

Estimating Medical Insurance Charges

Contents

1. Introduction

2. Data Collection

3. Exploratory Data Analysis

4. Preprocessing

5. Experimentation and Results

6. Conclusion

Code Reference

GitHub - nagi1995/Medical-Cost-Personal-Datasets

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Contact Links

Written by Bingi Nagesh

No responses yet