Saturday, October 11, 2014

3.7 Exercises

1) the p values indicate that the all the predictors except the newspaper is significant.
Since the intercept is non zero there is positive sales when there are no advertising on any channels.
Since radio is greater than tv, radio advertisement increases sales more than tv.

2) KNN classifier classifies a point as the class which the majority of the knns has, while regression estimates a value for that point which is the average of the knns.

3) a)

so the model is salary= 50+gpa*20+iq*0.07+gender*35+gpa*iq*0.01+gpa*gender*(-10)

if iq and gpa is fixed, if female you get 35-10*gpa more salary
so females get more salary if gpa<3.5, interesting!
since on average gpa<3.5 males earn more on average so i) is true ii) is false
iii) false iv) true

b) salary= 50+4*20+110*0.07+1*35+4*110*0.01+4*(-10)=137.1 in thousand dollars

c) false, there's no info about evidence, only info about how much it affects the salary in this model.

4)
a) training RSS should be lower for cubic
b) testing RSS should be lower for linear
c) training RSS lower for linear
d) testing RSS lower for cubic

5)
$$ (\sum_{i=1}^{n} x_{i}y_{i})  /  (\sum_{i=1}^{n} x_{i\prime}^2)=
\sum_{i=1}^{n} a_{i\prime}y_{i\prime}
$$

where $$ a_{i\prime}= x_{i}x_{i\prime}/(\sum_{i\prime=1}^{n} x_{i\prime}^2)$$

6) By 3.4 its the case that we have
$$ \hat{\beta_0} + \hat{\beta_1}\bar{x} = \bar{y}$$
Therefore it passes thru the $$ (\bar{x},\bar{y}) $$

7) Here is the proof
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Pearson.27s_correlation_and_least_squares_regression_analysis

and see this
http://en.wikipedia.org/wiki/Coefficient_of_determination#As_explained_variance

8)

auto<-read.csv("Auto.csv",na.strings="?")
attach(auto)
fit<-lm(mpg~horsepower)
summary(fit)

a)

i) there's a relationship
ii) significant because of low p-value
iii) its negative
iv) predict(fit,data.frame(horsepower=98))=24.46708
predict(fit,data.frame(horsepower=98), interval="confidence")
predict(fit,data.frame(horsepower=98), interval="prediction")

b)

plot(horsepower,mpg)
abline(fit)

c)

plot(fit)


9)
a) pairs(auto)
b) cor(auto[,-9])
c)

fit<-lm(mpg~.-name,data=auto)
summary(fit)
i) yes there is
ii) weight, year, origin strongly, displacement less so
iii) one year increase results in an increase in mpg of 0.750773

d)
plot(fit)
high leverage point is 14
and there are a few outliers like 323, 327, 394

e)
all interactions except name:
fit<-lm(mpg~(.-name)*(.-name),data=auto)
acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 *
acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
displacement:year          5.934e-03  2.391e-03   2.482  0.01352 *


f)
plot(log(auto$horsepower),auto$mpg)
plot(auto$horsepower,auto$mpg)
log is more linear looking

#10
a)
install.packages("ISLR")
library(ISLR)
data(Carseats)
fit<-lm(Sales~Price+Urban+US,data=Carseats)
summary(fit)

b)
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
  Price       -0.054459   0.005242 -10.389  < 2e-16 ***
  UrbanYes    -0.021916   0.271650  -0.081    0.936  
USYes        1.200573   0.259042   4.635 4.86e-06 ***

1$ increase in Price decreases Sales by ~ 54.46
If the store is in US it incrases Sales by ~ 1200
If the store is not Urban sales decrease by ~ 21.9 but this is not significant

c)
Sales=-0.054459*Price+-0.021916*Urban+1.200573*US

d)
For Price and US we can reject the null hypothesis that they are zero

e)
fit2<-lm(Sales~Price+US,data=Carseats)

f)
Both models have the same r2, the RSE for the more complicated model is a little higher, but that is expected right?

g)
confint(fit2)
                2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

plot(fit2)
the leverage graph indicates that 368 might be a high leverage point
studentized residuals indicate that we don't have outliers






11)
set.seed(1)
x=rnorm(100)
y=2*x+rnorm(100)

a)
fit<-lm(y~x+0)
Coefficients:
    Estimate Std. Error t value Pr(>|t|)  
x   1.9939     0.1065   18.73   <2e-16 ***

So y increases as 2*x as expected

b)
fit<-lm(x~y+0)
summary(fit)

Coefficients:
  Estimate Std. Error t value Pr(>|t|)  
y  0.39111    0.02089   18.73   <2e-16 ***

0.39111 looks interesting, it should have been closer to 0.5 no?



12)
a)
this needs to hold
$$ \sum x_i^2= \sum y_i^2 $$

b)
x<- 1:100
y<-2*x+rnorm(100,sd=0.0001)
lm.Y<- lm(y~x+0)
summary(lm.Y)
lm.X<- lm(x~y+0)
summary(lm.X)

c)
x<- 1:100
y<-x+rnorm(100,sd=0.0001)
lm.Y<- lm(y~x+0)
summary(lm.Y)
lm.X<- lm(x~y+0)
summary(lm.X)





13)

set.seed(1)

a)
x=rnorm(100)

b)
eps=rnorm(100,0,0.25)

c)
y=-1+0.5*x+eps
length(y) is 100
beta0=-1
beta1=0.5

d)
plot(x,y)
nicely linear

e)
fit<-lm(y~x)
summary(fit)

Coefficients:
  Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
  x            0.49973    0.02693   18.56   <2e-16 ***

the hatted betas are close to real ones

f)
abline(fit, col="black")

abline(-1, 0.5, col = "red")
legend("bottomright",c("least sq", "pop"), title="legend",col=c("black","red"),lty = c(1,1))

g)
fit2<-lm(y~x+I(x^2))
or
fit3<-lm(y~poly(x,2))
the squared term shows up as insignificant, so it does not improve the model

h)
set.seed(1)
x=rnorm(100)
eps=rnorm(100,0,0.1)
y=-1+0.5*x+eps
fit<-lm(y~x)
summary(fit)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1.003769   0.009699  -103.5   <2e-16 ***
  x            0.499894   0.010773    46.4   <2e-16 ***


plot(x,y)
abline(fit, col="black")

abline(-1, 0.5, col = "red")
legend("bottomright",c("least sq", "pop"), title="legend",col=c("black","red"),lty = c(1,1))

i)

set.seed(1)
x=rnorm(100)
eps=rnorm(100,0,0.01)
y=-1+0.5*x+eps
fit<-lm(y~x)
summary(fit)
plot(x,y)
abline(fit)
abline(-1, 0.5, col = "red")

still the lines match closely

confint(fit)
the confidence intervals get tighter with less noisy data as expected

14)
a)

set.seed(1)
x1=runif(100)
x2=0.5*x1+rnorm(100)/10
y=2+2*x1+0.3*x2+rnorm(100)


b)
plot(x1,x2)
cor(x1,x2)

c)
fit<-lm(y~x1+x2)
summary(fit)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
  x1            1.4396     0.7212   1.996   0.0487 *
  x2            1.0097     1.1337   0.891   0.3754  

we reject beta1=0 and fail to reject beta2=0

d)
fit<-lm(y~x1)
Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
  x1            1.9759     0.3963   4.986 2.66e-06 ***

we do reject beta1=0

e)
fit<-lm(y~x2)
summary(fit)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
  x2            2.8996     0.6330    4.58 1.37e-05 ***

we do reject beta2=0

f) they don't contradict, with mlr one of them is redundant

g)
x1=c(x1, 0.1)
x2=c(x2, 0.8)
y=c(y,6)

mlr<-lm(y~x1+x2)
l1<-lm(y~x1)
l2<-lm(y~x2)

for mlr 101 is a leverage point but not an outlier
for l1 101 is an outlier and a leverage point
for l2 101 is a leverage point but not an outlier


15)

library(MASS)
data(Boston)
a)

So this gives us the insignificant predictors

unlist(lapply(names(Boston)[2:length(names(Boston))], FUN=function(x) {
  if (summary(lm(as.formula(paste0("crim~",x)),data=Boston))$coefficients[2,4]>0.05) {
    return (x)
  }
}))

[1] "chas"



b)
fit<-lm(crim~.,data=Boston)
for dis,rad,black,medv,zn we can reject

c)

univ<-unlist(lapply(names(Boston)[2:length(names(Boston))], FUN=function(x) {
  return (summary(lm(as.formula(paste0("crim~",x)),data=Boston))$coefficients[2,1])
}))

coef<-unname(fit$coefficients)
coef<-coef[2:length(coef)]

plot(univ,coef)
the graph shows all predictors clustered except nox far away from them, interesting

d)

lapply(names(Boston)[2:length(names(Boston))], FUN=function(x) {
  try(
{
    nlrsq<-summary(lm(as.formula(paste0("crim~","poly(",x,",3)")),data=Boston))$r.sq
    lrsq<-summary(lm(as.formula(paste0("crim~",x)),data=Boston))$r.sq
    return (c(x,(nlrsq-lrsq)/lrsq))
}, silent=T
  )
})

we see the biggest improvement in medv, dis, nox, check the output!



Monday, October 6, 2014

Proof of 2.7 Bias Variance Tradeoff
http://en.wikipedia.org/wiki/Bias–variance_tradeoff#Derivation

Nice explanation of it
http://scott.fortmann-roe.com/docs/BiasVariance.html

2.4 Exercises

1.

a) if n is large and p is small flexible method must be worse in general, the flexible method will try to fit more points reducing its performance.

b) if p is large and n is small, the reverse of a, flexible should be better.

c) flexible better

d) flexible worse, will fit to error


2.

a) n=500, p=4, a regression problem

b) classification, n=20, p=13

c) regression, n=#number of weeks in 2012, p=2. aren't those two the same thing "% change in the dollar, the % change in the US market"

3.

will draw later with R


4,5,6 skipped

7. a)

1=>3
2=>4
3=>sqrt(10)~3.16
4=>sqrt(5)~2.236
5=>sqrt(2)~1.41
6=>sqrt(3)~1.73


b) when k=1 the nearest point is the 5th, so its Green
c) when k=3, 3 nearest points are 5,6,4, 2 Greens 1 Red, so its Green
d) if Bayes decision boundary is highly nonlinear we would benefit from a low K because the data is inherently nonlinear

8.

setwd("~/Downloads/islr")

college<-read.csv("College.csv")
rownames(college)=college[,1]
college=college[,-1]

#i)
summary(college)

#ii)
pairs(college[,1:10])

#iii)
plot(college$Private,college$Outstate)

#iv)
Elite=rep("No",nrow(college))
Elite[college$Top10perc >50]="Yes"
Elite=as.factor(Elite)
college=data.frame(college ,Elite)
summary(Elite)
plot(college$Elite,college$Outstate)

#v)
par(mfrow=c(2,2))
hist(college$Outstate,breaks=10)
hist(college$Outstate,breaks=20)
hist(college$Outstate,breaks=30)
hist(college$Outstate,breaks=40)

hist(college$Apps,breaks=10)
hist(college$Apps,breaks=20)
hist(college$Apps,breaks=30)
hist(college$Apps,breaks=40)

#vi) Skipped


#9

auto<-read.csv("Auto.csv")

#a) all are quantitative except name,origin,year

#b)
sapply(names(auto),FUN=function(x) {
  if (class(auto[,x])!="factor") {
    range(auto[,x])
    }
  })

#c)
sapply(names(auto),FUN=function(x) {
  if (class(auto[,x])!="factor") {
    c(mean(auto[,x]),sd(auto[,x]))
  }
})

#d)
auto<-auto[-(10:85),]
sapply(names(auto),FUN=function(x) {
  if (class(auto[,x])!="factor") {
    c(mean(auto[,x]),sd(auto[,x]))
  }
})

#e)
auto<-read.csv("Auto.csv")
pairs(auto)

#f)
displacement, weight seems to have linear relationship with mpg