In part one of the series, you learned how to perform exploratory analysis and visualization on a sample data. Continuing where you left off in part one, part two of the series will delve on how to plot the correlation between price and other features and predicting the price of a house using XGBoost.

Plotting the correlation between price and other features

Now that the initial exploratory analysis is done, you’ll have a better idea of how the different variables contribute to the price of each house. However, you have no idea of the importance of each variable when it comes to predicting prices. Since you have 21 variables, it becomes difficult to build models by incorporating all variables in one single model. Therefore, some variables may need to be discarded or neglected if they have lesser significance than other variables.

Correlation coefficients are used in statistics to measure how strong the relationship is between two variables. In particular, Pearson’s correlation coefficient is the most commonly used coefficient while performing linear regression. The correlation coefficient usually takes on a value between -1 and +1:

  • A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive increase of a fixed proportion in the other. For example, shoe sizes go up in (almost) perfect correlation with foot length.
  • A correlation coefficient of -1 means that for every positive increase in one variable, there is a negative decrease of a fixed proportion in the other. For example, the amount of gas in a tank decreases in (almost) perfect correlation with acceleration or the gear mechanism (more gas is used up by traveling for longer periods in first gear compared to fourth gear).
  • Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.
  1. Begin by dropping the id and date features from the dataset, as the ID variables are all unique and have no values in your analysis while the dates require a different function to handle them correctly:
x_df = dataframe.drop(['id','date',], axis = 1)


  1. Copy the dependent variable (house prices, in this case) into a new dataframe:
y = dataframe[['price']].copy()
y_df = pd.DataFrame(y)


  1. The correlation between price and every other variable can be manually found using this script:
print('Price Vs Bedrooms: %s' % x_df['price'].corr(x_df['bedrooms']))
print('Price Vs Bathrooms: %s' % x_df['price'].corr(x_df['bathrooms']))
print('Price Vs Living Area: %s' % x_df['price'].corr(x_df['sqft_living']))
print('Price Vs Plot Area: %s' % x_df['price'].corr(x_df['sqft_lot']))
print('Price Vs No. of floors: %s' % x_df['price'].corr(x_df['floors']))
print('Price Vs Waterfront property: %s' % x_df['price'].corr(x_df['waterfront']))
print('Price Vs View: %s' % x_df['price'].corr(x_df['view']))
print('Price Vs Grade: %s' % x_df['price'].corr(x_df['grade']))
print('Price Vs Condition: %s' % x_df['price'].corr(x_df['condition']))
print('Price Vs Sqft Above: %s' % x_df['price'].corr(x_df['sqft_above']))
print('Price Vs Basement Area: %s' % x_df['price'].corr(x_df['sqft_basement']))
print('Price Vs Year Built: %s' % x_df['price'].corr(x_df['yr_built']))
print('Price Vs Year Renovated: %s' % x_df['price'].corr(x_df['yr_renovated']))
print('Price Vs Zipcode: %s' % x_df['price'].corr(x_df['zipcode']))
print('Price Vs Latitude: %s' % x_df['price'].corr(x_df['lat']))
print('Price Vs Longitude: %s' % x_df['price'].corr(x_df['long']))


  1. An easier way to find the correlation between one variable and all other variables (or columns) in a dataframe is done using just one line in the following manner:


  2. The correlated variables may be plotted using the seaborn library and the following script:
y_vars=['bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront','view',
size = 5)


  1. After dropping the id and date variables, the new dataframe, which is named x_df, contains 19 variables or columns. For the purpose of this article, only the first ten entries are printed out:

First 10 entries of output

  1. On creating a new dataframe with only the dependent variable (price), you will see an output as follows. This new dataframe is named y_df. Again, only the first ten entries of the price column are printed for illustration purposes:

  1. Here is the correlation between price and other variables:

Notice that the sqft_living variable is most highly correlated with the price and has a correlation coefficient of 0.702035. The next most highly correlated variable is grade, with a correlation coefficient of 0.667434 followed by sqft_above, which has a correlation coefficient of 0.605567. Zipcode is the least correlated variable with price and has a correlation coefficient of -0.053202.

The correlation coefficients found using the simplified code gives the exact same values but also gives the correlation of price with itself, which turns out to be a value of 1.0000, as expected. This is illustrated in the following screenshot:

Now let’s build a simple linear model to predict house prices using all the features in the current dataframe. You can then evaluate the model and try to improve the accuracy using a more complex model in the latter half of the section.

  1. To get started, drop the Price column from the x_df dataframe and save it into a new dataframe named x_df2 using the following script:
x_df2 = x_df.drop(['price'], axis = 1)


  1. Declare a variable named reg and equate it to the LinearRegression() function from the Scikit Learn library using the following script:
  1. Split the dataset into test and train using the following script:
x_train,x_test,y_train,y_test = train_test_split(x_df2,y_df,test_size=0.4,random_state=4)


  1. Fit the model over the training set using the following script:,y_train)


  1. Print the coefficients generated from applying linear regression to the training and test sets by using the reg.coef_command.
  2. Take a look at the column of predictions generated by the model using the following script:
  1. Print the accuracy of the model using the following command:


The output after fitting the regression model to the training sets must look as follows:

The reg.coeff_ command generates 18 coefficients, one for each variable in the dataset:

The coefficients of features/variables with the most positive values have a higher significance on price predictions when compared to the coefficients of features/variables which have negative values. This is the main importance of the regression coefficients. On printing the predictions, you must see an output which is an array of values from 1 to 21,612, one value for each row in the dataset, as shown in the following screenshot:

Finally, on printing the accuracy of the model, you can obtain an accuracy of 70.37%, which is not bad for a linear model:

The linear model does alright at its first attempt, but if you want your model to be more accurate, you’ll have to use a more complex model with some non-linearities in order to fit well to all the data points. XGBoost is the model used in this section in order to try and improve the accuracy obtained through linear regression:

  1. Import the XGBoost library using the import xgboost command.
  2. In case this produces an error, you will have to do a pip install of the library through the terminal. This can be done by opening up a new terminal window and issuing the following command:

/usr/bin/ruby -e “$(curl -fsSL”

  1. At this stage, you must see an output as follows:

  1. At this stage, you will be prompted to enter your password. After homebrew is installed, you will see an output as follows:

  1. Next, install Python using the command brew install python
  2. Check your installation using the brew doctor command and follow homebrew’s suggestions.
  3. Once Homebrew is installed, do a pip install of XGBoost using the command pip install xgboost
  4. Once it finishes installing, you should be able to import XGBoost into the IPython environment.

Once XGBoost is imported successfully into the Jupyter environment, you will be able to use the functions within the library to declare and store the model using the following steps:

  1. Declare a variable named new_model to store the model and declare all its hyperparameters using the following command:
new_model = xgboost.XGBRegressor(n_estimators=750, learning_rate=0.09, gamma=0, subsample=0.65, colsample_bytree=1, max_depth=7)
  1. The output of the preceding command must look as follows:

  1. Split the data into test and training sets and fit the new model to the split data using the following command:
from sklearn.model_selection import train_test_split
traindf, testdf = train_test_split(x_train, test_size = 0.2),y_train)


  1. At this point, you will see an output as follows:

  1. Finally, use the newly fitted model to predict the house prices and evaluate the new model using the following command:
from sklearn.metrics import explained_variance_score
predictions = new_model.predict(x_test)


  1. On executing the preceding commands, you must see an output as follows:

Notice that the new model’s accuracy is now 87.79 %, which is approximately 88%. This is considered optimal.

In this case, the number of estimators is set to 750. After experimenting between 100 to 1,000, it was determined that 750 estimators gave the most optimal accuracy. The learning rate is set to 0.09. Subsample rate is set at 65%. Max_depth is set at 7. There didn’t seem to be too much influence of max_depth over the model’s accuracy. However, the accuracy did show improvement in using slower learning rates. By experimenting with various hyperparameters, the accuracy improved to 89%.

If you found this article helpful, you can explore Apache Spark Deep Learning Cookbook to gain expertise in training and deploying efficient deep learning models on Apache Spark. With the help of this book, you’ll be able to work through specific recipes to generate outcomes for deep learning algorithms, without getting bogged down in theory.