Question about the NYSE dataset for Large Assignment - heavily skewed dependent variable

I was looking at the NYSE dataset for the Large Assignment and tried to find out what percentage of data would have a response of 1
according to the given criteria.
I hope I'm not giving away too much information here, but I got the following output from running this code:

sum(ifelse(datafile$ / datafile$ > 1.1, 1, 0))/nrow(datafile)
[1] 0.003644621

here "datafile" contains the whole data from all the csv files given.

With only 0.3% response as "1" and 99.7% as "0", it would seem a constant prediction of "0" would give 99.7% accuracy ...

You mentioned in the class that a 90% accurate prediction is considered very good, with much lower expectation from stock market prediction
so if you could give some hints on what I'm missing here that would be great.


Answers and follow-up questions

Answer or follow-up question 1

Dear Tapajit,

It makes a lot more sense to go with 1.01 instead of 1.1. What is the distribution for that threshold?

Michel Ballings

Answer or follow-up question 2

Dr. Ballings,

For a threshold of 1.01 we have 23.6% "1"s

sum(ifelse(datafile$ / datafile$ > 1.01, 1, 0))/nrow(datafile)
[1] 0.2360869

Should we use 1.01 then?


Answer or follow-up question 3



Michel Ballings

Sign in to be able to add an answer or mark this question as resolved.