Estimating Medical Insurance Charges

Bingi Nagesh
7 min readFeb 11, 2022

--

Contents

  1. Introduction
  2. Data Collection
  3. Exploratory Data Analysis
  4. Preprocessing
  5. Experimentation and Results
  6. Conclusion

Code Reference

Contact Links

1. Introduction

Health insurance or medical insurance is a type of insurance that covers the whole or a part of the risk of a person incurring medical expenses. By estimating the overall risk of health risk and health system expenses over the risk pool, an insurer can develop a routine finance structure, such as a monthly premium or payroll tax, to provide the money to pay for the health care benefits specified in the insurance agreement. — Wikipedia

In this blog, the cost borne by insurance companies for their customers who took insurance is predicted. Different ML Regression algorithms like Linear Regression, kNN Regression, SVM Regression, Random Forest Regression, XGBoost Regression were used for prediction.

2. Data Collection

The data is taken from Kaggle. Following are the columns present ans their description:

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

3. Exploratory Data Analysis

df = pd.read_csv("insurance.csv")
df.head()
df.info()

Output:

RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)

We can see that, there are 1338 rows.

print(f"minimum age: {df['age'].min()}, maximum age: {df['age'].max()}")
sns.histplot(df, x = "age", bins = 23, kde = True)
plt.show()

Output:

minimum age: 18, maximum age: 64

From the above figure, we can see that, in most of the age bin ranges, the distribution of population in age bins is almost uniform except in first and last bin.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "age", hue = "sex", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, in each bin, male and female are almost equal.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "age", hue = "smoker", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, in each bin, the population of smokers is around 1/3rd of non-smokers.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "age", hue = "region", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, in each bin, there are almost equal population from every region.

plt.figure(figsize = (10, 5))
sns.displot(df, x = "bmi", hue = "sex", kind = "kde")
plt.show()

From the above figure, we can see that, male and female bmi distribution is similar.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "region", hue = "sex", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, from each region the male and female population is almost equal.

plt.figure(figsize = (10, 5))
sns.histplot(df, x = "region", hue = "smoker", multiple = "dodge", shrink = .8)
plt.show()

From the above figure, we can see that, in each region, the ratio of smoker to non-smoker is almost equal.

print(f"minimum charge: {df['charges'].min()}, maximum charge: {df['charges'].max()}")
sns.displot(df, x = "charges", kde = True)
plt.show()

Output:

The distribution of the medical charges looks like a pareto distribution. So, instead of considering mean squared error as metric, mean absolute percentage error metric is considered.

plt.figure(figsize = (10, 5))
plt.scatter(df["age"], df["charges"])
plt.xlabel("age")
plt.ylabel("charges")
plt.title("charges vs age scatter plot")
plt.show()

From the above figure, we can see that, as the age increases the charges increases. We can see an upward trend in charges. This is obvious as the age increases the chance of getting illness increases and medical charges increases.

plt.figure(figsize = (10, 5))
sns.displot(df, x = "charges", hue = "smoker", kind = "kde")
plt.show()

From the above figure, we can see that, smokers pay high charges when compared to non-smokers.

4. Preprocessing

Let us convert the categorical variables to one hot encoding.

df_expanded = pd.get_dummies(df)
df_expanded.drop(columns = ["sex_male", "smoker_no"], inplace = True)
df_expanded.head()

The above dataset is divided into 75% train and 25% test sets.

Let us standardize the train and test sets.

x_train = train.drop(columns = ["charges"])
x_test = test.drop(columns = ["charges"])
y_train = np.array(train["charges"]).reshape(-1, 1)
y_test = np.array(test["charges"]).reshape(-1, 1)
x_scaler = StandardScaler()
y_scaler = StandardScaler()
x_scaler.fit(x_train)
y_scaler.fit(y_train)
x_train_scaled = x_scaler.transform(x_train)
x_test_scaled = x_scaler.transform(x_test)
y_train_scaled = y_scaler.transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
x_train_scaled.shape, x_test_scaled.shape, y_train_scaled.shape, y_test_scaled.shape

Output:

(1003, 9), (335, 9), (1003, 1), (335, 1)

5. Experimentation and Results

5.1 Linear Regression

Let us train Linear Regression with L1 and L2 Regularization (ElasticNet) and find out MAPE.

model = ElasticNet(l1_ratio = 0.1, alpha = 1)
model.fit(x_train_scaled, y_train_scaled)
y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))
mean_absolute_percentage_error(y_test, y_pred)

Output:

0.9429

Let us see the feature importance

  • From feature importance, we can see that, a smoker pays more charges and has highest influence on medical charges. Intuitively, there is a higher probability that a smoker has health issues and goes to hospital, so medical charges increases.
  • As age increases, the probability of getting health issues increases, so this effects medical charges.
  • BMI also has an effect on charges as observed from feature importance.
  • Important thing to note is our model doesn’t have bias towards gender, number of children and region.

5.2 kNN Regression

Let us train kNN Regression and find out MAPE.

model = KNeighborsRegressor(n_neighbors = 17, weights = "distance")
model.fit(x_train_scaled, y_train_scaled)
y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))
mean_absolute_percentage_error(y_test, y_pred)

Output:

0.3386

5.3 SVM Regression

Let us train SVM Regression and find out MAPE.

model = SVR(C = 1, kernel = "poly")
model.fit(x_train_scaled, y_train_scaled)
y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))
mean_absolute_percentage_error(y_test, y_pred)

Output:

0.2112

5.4 Random Forest Regression

Let us train Random Forest Regression and find out MAPE.

model = RandomForestRegressor(n_estimators = 75, max_features = "log2")
model.fit(x_train_scaled, y_train_scaled.ravel())
y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))
mean_absolute_percentage_error(y_test, y_pred)

Output:

0.3064

Let us see feature importance

  • From feature importance, we can see that, a smoker pays more charges and has highest influence on medical charges. Intuitively, there is a higher probability that a smoker has health issues and goes to hospital, so medical charges increases.
  • As age increases, the probability of getting health issues increases, so this effects medical charges.
  • BMI also has an effect on charges as observed from feature importance.
  • Important thing to note is our model has a very small bias towards gender and region when compared to smoker.

5.5 XGBoost Regression

Let us train XGBoost Regression and find out MAPE.

model = xgb.XGBRegressor(n_estimators = 50, max_depth = 4)
model.fit(x_train_scaled, y_train_scaled.ravel())
y_pred_scaled = model.predict(x_test_scaled)
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1))
mean_absolute_percentage_error(y_test, y_pred)

Output:

0.2663

Let us see feature importance

  • From feature importance, we can see that, a smoker pays more charges and has highest influence on medical charges. Intuitively, there is a higher probability that a smoker has health issues and goes to hospital, so medical charges increases.
  • As age increases, the probability of getting health issues increases, so this effects medical charges.
  • BMI also has an effect on charges as observed from feature importance.
  • Important thing to note is our model has a very very small bias towards gender and region when compared to smoker.

5.6 Results

+----------------------------+-----------+
| Model | test MAPE |
+----------------------------+-----------+
| Linear Regression with L1 | 0.9429 |
| and L2 regularization | |
| kNN Regressor | 0.3386 |
| SVM Regressor | 0.2112 |
| Random Forest Regressor | 0.3064 |
| XGBoost Regressor | 0.2663 |
+----------------------------+-----------+

6. Conclusion

In this blog, different experiments are done to predict the medical charges. The SVM Regression model worked well with MAPE of 0.2112. In insurance industry, it is very important to know which features increase our costs. Even though SVM gave least MAPE, I would like to go with XGBoost as we can get feature importance with XGBoost model.

Code Reference

Contact Links

Email: binginagesh@gmail.com
LinkedIn: https://www.linkedin.com/in/bingi-nagesh-5b0412b7/

--

--

No responses yet