Defining the classes of columns while reading in data

In the reading in data video, you set nrow=5 to get an estimate for the column classes before defining it in your code to speed up run
time. How and why did you choose to use 5? Would we as students always be safe just to use 5 on exams/gradeables?


Answers and follow-up questions

Answer or follow-up question 1

Dear Savannah,


"Unless colClasses is specified, all columns are read as character columns and then converted using type.convert to logical, integer,
numeric, complex or (depending on factor as appropriate. "

This means that if we do not specify colClasses we are giving R extra work: first read in as character and then convert to a more
appropriate class.

To avoid this we want to specify colClasses. If we don't know what they should be we can use the trick in the book: first read in a small
number of
rows, extract the column classes and use them when reading in all the data in a subsequent step.

The less rows you use to base the column class on the less accurate it becomes.

"How and why did you choose to use 5?"

I chose it mostly arbitrarily. The dataset for the example was small so I chose a small number. If I would use this trick,(meaning the
dataset would be large enough to do it), I would never use more than 100.

"Would we as students always be safe just to use 5 on exams/gradeables?"


Michel Ballings

Sign in to be able to add an answer or mark this question as resolved.