Last days, I got books R Packages and Advanced R, so I spend lots of the time reading them and improving my skills in coding and clean code writing.

I had finished the design of the classes and API for most of the important futures.

For now we have classes to store single document, meta data, corpus, parsed text and table of word counts.

Also, I included some external packages to be imported into textmining. Two most important at the moment are stylo and tm.

The next steps to be taken is to integrate train and predict functions for the mallet package. At the moment usage of the package is not to convenient, and after changes, it will be easier to use.

The example of the analysis before and after changes is shortly summarised

here:

Example analysis now

mallet.instances<-mallet.import(as.character(data$id),as.character(data$text), “Documents/Projects/tm/stopwords.txt”)

topic.model<-MalletLDA(num.topics=10) topic.model$loadDocuments(mallet.instances) topic.model$setAlphaOptimization(20, 100)

topic.model$train(1000)

topic.model$maximize(10)

inferencer <- function (model) {

 model$model$getInferencer()

}

write_inferencer <- function (inf, out_file) {

 fos <- .jnew(“java/io/FileOutputStream”, out_file)

 oos <- .jnew(“java/io/ObjectOutputStream”,

              .jcast(fos, “java/io/OutputStream”))

 oos$writeObject(inf)

 oos$close()

}

30 lines of code more

compatible_instances <- function (ids, texts, instances) {

 mallet_pipe <- instances$getPipe()

 new_insts <- .jnew(“cc/mallet/types/InstanceList”,

                    .jcast(mallet_pipe, “cc/mallet/pipe/Pipe”))

 J(“cc/mallet/topics/RTopicModel”)$addInstances(new_insts, ids, texts)

 new_insts

}

instances_lengths <- function (instances) {

 iter <- instances$iterator()

 replicate(instances$size(),

           .jcall(iter, “Ljava/lang/Object;”, “next”)$getData()$size()

 )

}

Another 10 lines of code

Example of the future designed functions

train(data, stopwords = c(“the”, “and”, “of”, “in”), num.topics = 10,

  alpha = 20, burn-in = 100,     

  train = 1000, maximize = 10)

Thanks to Andrew Goldstone most of the hard work has already been done. It can be seen in his git hub repository GitHub

Though, for next week we planed to make some integration tests instead of just unit tests as well as finishing integration with current API