Question



Text mining: how to aggregate unstructured data in Large Assignment News Articles

Dr. Ballings,

After successfully merging the Stories and the Companies Data by STORY_ID (called MERGED), I got about 6.5 million rows of observations.
Then I tried to proceed to merge the tickers to the MERGED data with the Mapping.csv, however there was an error that says:

"Error: cannot allocate vector of size 105.8 Mb
In addition: Warning messages:
1: In make.unique(as.character(rows)) :
Reached total allocation of 4000Mb: see help(memory.size)
2: In make.unique(as.character(rows)) :
Reached total allocation of 4000Mb: see help(memory.size)
3: In make.unique(as.character(rows)) :
Reached total allocation of 4000Mb: see help(memory.size)
4: In make.unique(as.character(rows)) :
Reached total allocation of 4000Mb: see help(memory.size)"

how do I fix that? Or was there something wrong with the way I merged the two dataframes?

here is the str(MERGED):

'data.frame': 6508892 obs. of 4 variables:
$ STORY_ID : Factor w/ 1921383 levels "00006308DD1AC6AF7BD7732336F69864",..: 1010221 1010221 848368 159617 159617 159617 159617 159617
159617 159617 ...
$ TIMESTAMP_UTC: Factor w/ 1264758 levels "2001-01-01 05:00:00.000",..: 788500 788500 613491 225622 225622 225622 225622 225622 225622
225622 ...
$ HEADLINE : Factor w/ 1525700 levels "'Bye, Old Bill ---- By Alan Abelson",..: 819430 819430 768477 228344 228344 228344 228344
228344 228344 228344 ...
$ COMPANY_ID : Factor w/ 19788 levels "00067A","0012D6",..: 2824 7991 8323 6076 8639 1507 5430 5857 8233 3325 ...






Answers and follow-up questions





Answer or follow-up question 1

Dear student,

R is saying you don't have enough RAM.

One thing to check: if you have a 64 bit operating system (OSX is always 64bit; Windows is sometimes 32 bit), then
make sure you installed the 64bit version of R. This will allow you to use all the RAM you have. Otherwise it will
only use 4GB even if you have more.

(Involved solutions are: install a 64 bit OS, and install more RAM, but I don't recommend it if you don't know
what you're doing and you will incur some costs.)

You also want to make sure that you first aggregate the data and then merge instead of first merging and
then aggregating. That will reduce the amount of RAM you need.

For example, if you're eventually merging by ticker and date, then make sure you aggregate the stories by ticker and date
and do not keep them by ticker and time (up to the second). In other words, if you don't need the detail,
then aggregate it as early as possible in the process.

Michel Ballings



Sign in to be able to add an answer or mark this question as resolved.