Analyzing China's GDP growth
We will analyze China's GDP growth from the year 1960 to 2019.
- Plot the data
- Choosing a model
- Build the model
- Plot the model
- Train/Test Split the data
- Evaluate the model
In this blog post, we will analyze China's GDP growth from the year 1960 to 2019. If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('/content/china_gdp.csv')
df
plt.figure(figsize=(8,5))
x_data, y_data = (df['Year'].values, df['Value'].values)
plt.plot(x_data, y_data, 'ro')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
We can see that the growth starts off slow. Then, from 2005 onwards, the growth is very significant. It decelerates slightly after the period of the 2008 global recession.
X = np.arange(-5.0, 5.0, 0.1)
Y = 1.0 / (1.0 + np.exp(-X))
plt.plot(X, Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Independent Variable')
plt.show()
def sigmoid(x, Beta_1, Beta_2):
y = 1/ (1 + np.exp(-Beta_1 * (x - Beta_2)))
return y
Let's look at a sample sigmoid line that might fit with the data.
beta_1 = 0.1
beta_2 = 1990
# logistic function
Y_pred = sigmoid(x_data, beta_1, beta_2)
# plot initial prediction againts data points
plt.plot(x_data, Y_pred*15000000000000)
plt.plot(x_data, y_data, 'ro')
Our task is to find the best parameters for the model.
First, lets normalize our x and y.
xdata = x_data / max(x_data)
ydata = y_data / max(y_data)
How can we find the best parameters for our fit line?
We can use curve_fit
, which uses non-linear least squares to fit our sigmoid function to the data.
from scipy.optimize import curve_fit
popt, pcov = curve_fit(sigmoid, xdata, ydata)
# print the final parameters
print('beta_1=%f, beta_2=%f' % (popt[0], popt[1]))
x = np.linspace(1960, 2015, 55)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x, y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
msk = np.random.randn(len(df)) < 0.8
train_x = x_data[msk]
test_x = xdata[~msk]
train_y = y_data[msk]
test_y = ydata[~msk]
Build the model using the train set.
popt, pcov = curve_fit(sigmoid, train_x, train_y)
Predict GDP using the test set.
y_hat = sigmoid(test_x, *popt)
print('Mean absolute error: %.2f' % np.mean(np.absolute(y_hat - test_y)))
print('Residual sum of error (MSE): %.2f' % np.mean((y_hat - test_y)**2))
from sklearn.metrics import r2_score
print('R2-score: %.2f' % r2_score(y_hat, test_y))