GE2213 Understanding Uncertainty and Statistical Reasoning Regression Analysis Case Study: Residential Property Price Data Analysis 1 Outline • • • • • • • Overview of regression analysis Simple linear regression model Determination coefficient Correlation coefficient Multiple regression model 3 special regression models Case study: Residential property price data analysis 2 Overview of Regression Analysis • Input • Dependent (or called response) variable, π • The variable we wish to estimate or predict • Independent (or called explanatory) variable, π • The variable used to connect (sometimes, also explain) the dependent variable • Output • A linear function that allows us to • Provide estimation: Estimate the average value of the dependent variable based on the corresponding value of the independent variable, usually in the form of a confidence interval • Provide prediction: Predict the value of the dependent variable based on the corresponding value of the independent variable, usually in the form of a prediction interval (which must be wider than the corresponding confidence interval) 3 Overview of Regression Analysis • It is quite possible to develop a relation (e.g. linear, quadratic) between X and Y that there is no causality (i.e. no reason and effect) at all. The relationship exists owing to casualty (i.e. happening by chance). • e.g. A straight line may appear to provide a good model for relating monthly output Y of a steel mill to the weight X of computer printouts appearing on a manager's desk during the month, but their causal relationship is tenuous. A 3rd variable may have caused the change in both X and Y, producing the relationship that we have observed. • Causality can be inferred only when analysis uncovers some plausible reasons for its existence. For instance, workers work faster as the queue length increases, and so a linear relationship is at least plausible. 4 Overview of Regression Analysis Cont’d e.g. two quadratic below 5 Simple Linear Regression Model • A simple linear regression model consists of two components • Regression line: A straight line describes the dependence of a single variable π on one variable πΏ • Random error: The unavoidable deviation is between the actual value and the expected value Population Intercept Dependent Variable ππ = Population Slope Coefficient Independent Variable Random Error π·π + π·π πΏπ + πΊπ Regression Line 6 6 Simple Linear Regression Model Cont’d π ππ ππ ππ ππ = ππ + ππ πΏπ πΏπ • π0 represents the sample intercept • π1 represents the sample slope coefficient πΏ 7 Simple Linear Regression Model – Method of Least Squares Cont’d • π0 and π1 are estimated using the method of least squares, which minimizes the Sum of Squared Errors (SSE) π π ππ2 = πππΈ = π=1 π (ππ − ππ )2 = π=1 [ππ − π0 + π1 ππ ]2 π=1 8 Simple Linear Regression Model – Method of Least Squares Cont’d • The formulae of π0 and π1 can be obtained by first partially differentiating SSE with respect to π0 and π1 , and next solving π π 2 π π π=1 ππ0 π = −2 π=1 ππ − π0 + π1 ππ =0 and π π 2 π=1 ππ ππ1 π = −2 ππ ππ − π0 + π1 ππ =0 π=1 simultaneously 9 Simple Linear Regression Model – Method of Least Squares Cont’d • The formulae of π0 and π1 are π0 = π − π1 π and π1 = π π=1(ππ − π)(ππ − π 2 (π − π) π π=1 π) 10 Simple Linear Regression Model – Determination Coefficient Cont’d • Measures the proportion of variation in π that is removed by the independent variable π, using the simple linear regression model • Sample determination coefficient π2 = removed variation total variation = π 2 π=1(ππ −π) π 2 π=1(ππ −π) • A unitless number between 0 and +1, inclusive • “Magnitude” measures how good the regression model fits • Closer to +1, the better the regression model fits • Also known as goodness-of-fit measure 11 Simple Linear Regression Model – Correlation Coefficient Cont’d • Describe the trend and strength of a linear relationship between two numerical variables • Sample correlation coefficient either π = − π 2 or π = + π 2 • A unitless number between -1 and +1, inclusive • “Sign” indicates the trend (down = negative r values / up = positive r values) of a linear relationship • “Magnitude” measures the strength of a linear relationship • Closer to -1, the stronger negative linear relationship • Closer to +1, the stronger positive linear relationship • Closer to 0, no linear relationship 12 Scatter Plots against Correlation Coefficients π π π r = -1 πΏ πΏ r = -0.6 π Cont’d r=0 πΏ π r = +0.6 πΏ r = +1 πΏ 13 Simple Linear Regression Model + Correlation Coefficient in Calculator • Data set X 5 5 5 10 10 10 15 15 15 20 20 20 Y 1.6 2.2 1.4 1.9 2.4 2.6 2.3 2.7 2.8 2.6 2.9 3.1 14 Determination or Correlation Coefficient – Example • Data set: The sales of vitamin supplement and the corresponding advertising expense Month 1 2 3 4 5 6 7 8 9 10 Sales (HK$’000) 963.500 893.000 1057.250 1183.250 1419.500 1547.750 1580.000 1071.500 1078.250 1122.500 Advertising Expense (HK$’000) 37.427 40.850 41.431 44.842 51.788 63.760 63.572 44.686 48.959 50.056 15 Determination or Correlation Coefficient – Example Cont’d • Scatter diagram 16 Determination or Correlation Coefficient – Example Cont’d • Sample determination coefficient between sales of vitamin supplement and advertising expense is π 2 = 0.8617 (you should use a hand calculator to check) which indicates that 86.17% variation in π removed by the independent variable π, using the regression model • Sample correlation coefficient between sales of vitamin supplement and advertising expense is π = +0.9283 (you should use a hand calculator to check) • There is a very strong positive linear relationship between sales of vitamin supplement and advertising expense 17 Simple Linear Regression Model – Example Cont’d • We wish to examine the linear dependency of the sales of vitamin supplement on advertising expenses • 10 monthly records were randomly selected Month 1 2 3 4 5 6 7 8 9 10 Sales (HK$’000) 963.500 893.000 1057.250 1183.250 1419.500 1547.750 1580.000 1071.500 1078.250 1122.500 Advertising Expense (HK$’000) 37.427 40.850 41.431 44.842 51.788 63.760 63.572 44.686 48.959 50.056 • Find a straight line function that best fits the data 18 Simple Linear Regression Model – Example Cont’d • The sample simple linear regression model π = −15.339 + 24.765π (you should use a hand calculator to check) where π = sales (in HK$’000) π = advertising expense (in HK$’000) • For each HK$1000 (= 1 × 1000) increase in advertising expense, the model estimates or predicts that the sales of vitamin supplement increases by HK$24,765 (= 24.765 × 1000) 19 Simple Linear Regression Model – Example Cont’d • What is the estimated or predicted amount of sales if the advertising expense is HK$42,000? π = −15.339 + 24.765π = −15.339 + 24.765 42 = 1024.791 (in HK$’000, i.e. HK$1,024,791) 20 Simple Linear Regression Model in Microsoft Excel • Step 1: Add the “Analysis ToolPak” • File ο Options ο Add-Ins ο Click “Go” at the bottom ο Check “Analysis ToolPak” and click “OK” • “Data Analysis” button can be found in the “Data” menu bar 21 Simple Linear Regression Model in Microsoft Excel Cont’d • Step 2: Develop regression model • Invoke the Data Analysis function, select “Regression” and click “OK” 22 Simple Linear Regression Model in Microsoft Excel Cont’d • Step 3: Input the necessary information 23 Simple Linear Regression Model in Microsoft Excel Cont’d • Step 4: Obtain the output |π| π2 86.17% (about 86.2%) of the variation in sales of vitamin supplement can be removed by the variability in advertising expense 24 π0 and π1 Multiple Regression Model • A multiple regression model is to connect one dependent variable with two or more independent variables in a linear function Population Intercept Population Slope Coefficients ππ = π·π + π·π πΏππ + π·π πΏππ + β― + π·π² πΏπ²π + πΊπ Dependent Variable Independent Variables Random Error • π0 represents the sample intercept • π1 , π2 … , ππΎ represent the sample slope coefficients 25 Multiple Regression Model Cont’d π ππ ππ ππ ππ ππ = ππ + ππ πΏππ + ππ πΏππ (π1π , π2π ) π2 π1 26 Multiple Regression Model – Example Cont’d • In addition to advertising expense, it is believed that bonus paid to salespersons affects the sales of vitamin supplement • Fit a multiple regression model for estimating or predicting the sales of vitamin supplement using both plausible independent variables Month 1 2 3 4 5 6 7 8 9 10 Sales (HK$’000) 963.500 893.000 1057.250 1183.250 1419.500 1547.750 1580.000 1071.500 1078.250 1122.500 Advertising Expense (HK$’000) 37.427 40.850 41.431 44.842 51.788 63.760 63.572 44.686 48.959 50.056 Bonus Paid (HK$’000) 23.098 23.628 27.157 29.12 28.217 32.116 29.432 30.569 23.841 27.138 27 Multiple Regression Model – Example Cont’d • Excel output (see the Excel file) 88.6% of the variation in sales of vitamin supplement can be π 2 removed by the two independent variables included in the regression model 28 πππππ = −279.603 + 21.216 π΄ππ£ + 15.940(π΅πππ’π ) Multiple Regression Model – Example Cont’d πππππ = −279.603 + 21.216 π΄ππ£ + 15.940(π΅πππ’π ) For each HK$1000 increase in advertising expense, the estimated or predicted sales of vitamin supplement increased by HK$21,216; holding bonus paid to salespersons unchanged For each HK$1000 increase in bonus paid to salespersons, the estimated or predicted sales of vitamin supplement increased by HK$15,940; holding advertising expense unchanged 29 Multiple Regression Model – Quick Exercises Cont’d • What is the estimated or predicted sales of vitamin supplement if the advertising expense is HK$60,000 and bonus to salespersons is HK$30,000? πππππ = −279.603 + 21.216 π΄ππ£ + 15.940(π΅πππ’π ) 60 30 • In this month, the company has an extra HK$100,000 budget, how should the money be used? Advertising or paying bonus to salespersons or both? 30 Special Regression Model – Quadratic Model • A regression model connecting one dependent variable with two or more independent variables in a quadratic function ππ = π½0 + π½1 π1π + π½2 π1π 2 + ππ • The second independent variable is the square of the first independent variable • It is suitable when a scatter diagram indicates a non-linear relationship, with one of the following 4 shapes π π π π 31 ο’2 > 0 πΏπ ο’2 > 0 πΏπ ο’2 < 0 πΏπ ο’2 < 0 πΏπ Special Regression Model – Dummy-Variable Model Cont’d • It is useful when categorical variable is used as independent variable HK π Island ππ = π½0 + π½1 π1π + π½2 π2π + ππ π0 + π2 • For example • π = Sales of vitamin supplement (HK$’000) • π1 = Advertising expense (HK$’000) • π2 = Location of shop = Kowloon π0 π1 0 ππ πΎππ€ππππ 1 ππ π»πππ πΎπππ πΌπ ππππ • For π2 = 0 (i.e. Kowloon) ππ = π0 + π1 π1π + π2 0 = π0 + π1 π1π • For π2 = 1 (i.e. Hong Kong Island) ππ = π0 + π1 π1π + π2 1 = (π0 +π2 ) + π1 π1π Same slopes 32 Special Regression Model – Interaction-Variable Model Cont’d • Interaction variable is created by multiplying two or more independent variables together • Usually, one of those independent variables is categorical • It allows regression models for different categories to have different intercepts and different slopes ππ = π½0 + π½1 π1π + π½2 π2π + π½3 π1π π2π + ππ • Without interaction term (i.e. π½3 = 0), effect of π1 (or π2 ) on π is measured by π½1 (or π½2 ) • With interaction term(i.e. π½3 ≠ 0), effect of π1 (or π2 ) on π is measured by (π½1 +π½3 π2 ) [or (π½2 +π½3 π1 )] π π0 + π2 33 π0 π1 Application of Regression Analysis – Centa-City Index • Why Property Price Indices? • Investors and potential home-buyers require indicators to study the current movement of property prices in Hong Kong • The creation of the “Centa-City Index” aims to provide such information to the general public as a source of reference on trends in Hong Kong’s property market • How are the Index constructed? • Multiple regression analysis is used to determine the effect of various attributes on property price • Attributes are considered such as floor area, years of occupancy, location, direction, view, floor level, and so on. • Who originally developed the “Centa-City Index”? • Jointly developed by Centaline Property Agency Limited and Department of Management Sciences in City University of Hong Kong 34 Application of Regression Analysis – Centa-City Index 35 Application of Regression Analysis – Centa-City Index 36 Application of Regression Analysis – Centa-City Index 37 Application of Regression Analysis – Centa-City Index • More information • http://www.cb.cityu.edu.hk/ms/work/hkcci/ • http://hk.centadata.com/cci/cci_e.htm 38 Case Study: Residential Property Price Data Analysis 39 Case Background • A new ocean side complex consists of two adjacent and connected 8-floor buildings • The complex contains 200 flats of equal size (where 96 flats are in Building 1 and 104 flats are in Building 2) • 96 full ocean-view flats are those facing the ocean (e.g. 101, 103, 105, 107, 109 in Building 1; 113, 115, 117, 119, 121, 123, 125 in Building 2). 40 full ocean-view flats in Building 1 also have a good view of the pool (e.g. 101, 103, 105, 107, 109). • 8 partial ocean-view flats (i.e. 111, 211, 311, 411, 511, 611, 711, 811 in Building 1) are those having part of their ocean view blocked by Building 2 • 96 bay-view flats are those facing the parking lot (e.g. 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124). • The only elevator is located near the office and the game room, all at the east end of Building 1 40 Layout of the Complex 41 Case Background Cont’d • The complex was completed during an economic recession; thus, sales were slow • The developer furnished many unsold flats and rented them out • Eventually, the developer stopped all leases and sold out remaining flats by auction 42 Study Objective • To investigate the relationship between auction price and • Height of a flat (expressed by the floor number) • Distance of a flat from the elevator (expressed by the number of flats) • Presence or absence of full ocean view • Presence or absence of partial ocean view • Presence or absence of furniture in an unsold flat 43 The Data • 106 auctioned flats (see the Excel file) are examined using a multiple regression model as follows: • Auction price (π, called ππ«π’ππ) • in US $’00 • Height of a flat (π1 , called Floor) • Takes the floor number: 1 or 2 or … or 8 • Distance of a flat from the elevator (π2 , called Distance) • The distance is expressed by the number of flats • An additional distance of two flats is added to a flat in Building 2 • Takes the number of flats: 1 or 2 or … or 6 in Building 1; 9 or 10 or … or 15 in Building 2 44 The Data Cont’d • Full ocean view (π3 , called ππ’ππ°) • The presence or absence of full ocean view • A dummy variable is used • 1 if the flat has full ocean view; 0 if not • Partial ocean view (π4 , called End) • The presence or absence of partial ocean view • A dummy variable is used • 1 if the flat is 111 or 211 or 311 or 411 or 511 or 611 or 711 or 811; 0 if not • Furniture (π5 , called Furnish) • The presence or absence of furniture • A dummy variable is used • 1 if the flat has furniture; 0 if not 45 The Data π : : πΏπ Cont’d πΏπ 46 Dummy Variables πΏπ , πΏπ , πΏπ Data Analysis Correlation coefficient between Auction price and Floor height, π = −0.320 47 Data Analysis Cont’d Correlation coefficient between Auction price and Distance from elevator, π = −0.089 48 Data Analysis Cont’d • Effect of full ocean view View of Ocean Sample Size Mean St. Dev. 0 36 173.944 8.802 1 70 201.000 13.847 • Effect of partial ocean view End Unit Sample Size Mean St. Dev. 0 99 192.141 18.221 1 7 187.143 10.351 Furniture Sample Size Mean St. Dev. 0 61 189.508 14.880 1 45 194.933 20.942 • Effect of furniture 49 Regression Analysis – Model 1 • All five plausible independent variables are considered π2 π = 177.703 − 0.715π1 − 0.873π2 +31.273π3 −17.808π4 + 9.984π5 50 Regression Analysis – Model 2 Cont’d • Effect of Height & Full Ocean View interaction (π1 π3 ) on auction price is added π2 π = 152.212 + 3.225π1 − 0.828π2 + 60.054π3 −18.651π4 + 11.481π5 − 4.724 π1 π3 51 Regression Analysis – Model 3 Cont’d • Effect of Height might not be linearly related to Auction price; hence, interaction variables (π1 2 and π1 2 π3 ) are included π2 π = 132.378 + 10.211π1 − 0.820π2 + 86.733π3 −19.178π4 + 11.633π5 − 15.303 π1 π3 −0.585π1 2 + 0.958π1 2 π3 52 Conclusions and Recommendations • Which model will you choose? • How the results can help in prediction? • What additional factor(s) will you consider? 53