Question about the NYSE dataset for Large Assignment - heavily skewed dependent variable I was looking at the NYSE dataset for the Large Assignment and tried to find out what percentage of data would have a response of 1
according to the given criteria.
I hope I'm not giving away too much information here, but I got the following output from running this code:
sum(ifelse(datafile$close.of.the.day. / datafile$open.of.the.day. > 1.1, 1, 0))/nrow(datafile)
here "datafile" contains the whole data from all the csv files given.
With only 0.3% response as "1" and 99.7% as "0", it would seem a constant prediction of "0" would give 99.7% accuracy ...
You mentioned in the class that a 90% accurate prediction is considered very good, with much lower expectation from stock market prediction
so if you could give some hints on what I'm missing here that would be great.
Answers and follow-up questions Answer or follow-up question 1
It makes a lot more sense to go with 1.01 instead of 1.1. What is the distribution for that threshold?
Michel Ballings Answer or follow-up question 2
For a threshold of 1.01 we have 23.6% "1"s
sum(ifelse(datafile$close.of.the.day. / datafile$open.of.the.day. > 1.01, 1, 0))/nrow(datafile)
Should we use 1.01 then?
Tapajit Answer or follow-up question 3
Michel BallingsSign in to be able to add an answer or mark this question as resolved.