Deutscher Hof

Erbil / I-Kurdistan

Final Evaluation approaching

The results of GSOC 2016 can be seen at:

The final evaluation is approaching. Summary of GSOC at the moment:

– More than 250 commits.
– Learned TDD
– Meet with mentors ~ 10 times in Wrocław Krakow and Leipzig
– Weekly Skype calls to discuss progress
– Extended the content of the package mentioned in the proposal.
– Fixed lost of documentation issues

We also prepared plan for further development of the package functionality:

GSOC meetings and further progress

Last weeks was filled with lost of programming and designing works.

With R and rJava there are sometimes troubles, especially when you need to train topic model using mallet ar now textmining package:

I recommend to read this post to configure Java properly:

Also, R tests implementation for both architectures on Windows 32 bit and 64 bit. This was hard to discover during the automatic build in RStudio. There was two solutions for this problem:
Set no multi-architecture argument in check options:
devtools::check(document = FALSE, args=”–no-multiarch”)
or install properly Java for both architectures 32 bit and 64 bit

Lots of time I have spent on technical cases like how to export functions that are already implemented in other packages as S3 methods and I just add new class, how to write assignment function appropriately.

Another big step was systematization of different types of functions and output produced by them. The most important was unification of train and predict functions for topic models from different packages and application of output to topic_wordcloud and topic_network functions.

I meet Maciej Eder in Lepizig at the Digital Humanities Summer School in Leipzig. Thanks to his help and lots of insights gained during work on stylo R package, he saved lots of my time on code refactoring and preparing for cran submission.

After that we prepared another GSOC meeting with my mentors in Wrocław. We discussed progress in GSOC 2016 and decided what kind of steps should we take next.

For now I plan to finish works on the documentation of existing objects, preparation for the cran check and cleaning up the code. Because, my approach for the GSOC was test driven development I have almost all tests written already. Also, thanks to that most of the examples are prepared.

GSOC wrap up

Short summary of last two weeks:
Almost all functionalities included in my proposal has already been implemented.
For last month I left most of works on the documentation and some refactoring.
I will follow good practices summarised in this posts:

At last I also plan to include short vignettes.

After project I still plan to work on development of the package, by adding such functionalities like:
– sentiment analysis
– dimension reduction
– word embedding
– more use-cases

At the moment we have 10 issues pointed out in the github repository.
One includes return of uniform output for the topic models from different packages.

The most important functionalities that had been implemented last weeks:

– Adjusted tm_map to work for more complex cases
– c function for tmCorpus data
– terms for mallet topic model
– refactored train and predict functions for tmTopicModel
– included treetagger into package through koRpus (still there should be included check if the treetagger is instaled on the computer)

Also, I performed initial sentiment analysis on the tweeter data, and some more short vignettes.
Tweeter ISIS

How to tag text
This week i struggled a little with modifying interior elements of an object in the lapply loop.
Help was this article at the stackoverflow forum.

Through last week I focused on integration some of the existing methods from packages such as NLP and tm for tmCorpus object.

I integrated beside others methods such as content(x) which for tmCorpus returns list of documents.
Also, I prepared version of tm_map function that can be applied for tmCorpus. At last it is already possible to create Term Document Matrix and Document Term Matrix based on just tmCorpus document. Last commit was solving bug which requires to use always English stop-words. At the moment it is more universal.

for this week I plan to:

July 4 – July 12

Start working on the “Integration of TreeTagger for topic modelling within specific part of speech.”
Testing the koRpus package for this purpose
Designing API to include TreeTagger without using the koRpus package.

Also, I plan to invest some time into sentiment analysis and other packages for LDA and visualizations.

Visible progres!

This week was full of work and finally we got to the point where simple analysis is possible. For last week I applied changes mentioned in the proposal to enable easy use of mallet package.

The functions train and predict for topic model had been implemented. Also, the visualization techniques are available at the repository. One can see the structure of words for different topics and appropriate wordclouds.



The pictures shows the wordcloud of one of the topics the most probably created based on the romance books


and the network of words within different topics.

If you want to try how my package works at the moment take a look at first tests provided in:

This week I plan to implement some more basic transformation functions. The idea is to prepare simillar interface for the tmCorpus as there exists for VCorpus. Functions such as tm_map, tm_filter, and tm_reduce are very well designed and users are used to them. This is the main reason for not changing them.

Also this week I plan to have look at the other packages thet were not mentioned in the proposal. The plan is to extend the package possibilities during the project time as much as it is possible.


At last textmining has already more than 100 commits 🙂

Interface for the Mallet package

Last days, I got books R Packages and Advanced R, so I spend lots of the time reading them and improving my skills in coding and clean code writing.

I had finished the design of the classes and API for most of the important futures.

For now we have classes to store single document, meta data, corpus, parsed text and table of word counts.

Also, I included some external packages to be imported into textmining. Two most important at the moment are stylo and tm.

The next steps to be taken is to integrate train and predict functions for the mallet package. At the moment usage of the package is not to convenient, and after changes, it will be easier to use.

The example of the analysis before and after changes is shortly summarised


Example analysis now

mallet.instances<-mallet.import(as.character(data$id),as.character(data$text), “Documents/Projects/tm/stopwords.txt”)

topic.model<-MalletLDA(num.topics=10) topic.model$loadDocuments(mallet.instances) topic.model$setAlphaOptimization(20, 100)



inferencer <- function (model) {



write_inferencer <- function (inf, out_file) {

 fos <- .jnew(“java/io/FileOutputStream”, out_file)

 oos <- .jnew(“java/io/ObjectOutputStream”,

              .jcast(fos, “java/io/OutputStream”))




30 lines of code more

compatible_instances <- function (ids, texts, instances) {

 mallet_pipe <- instances$getPipe()

 new_insts <- .jnew(“cc/mallet/types/InstanceList”,

                    .jcast(mallet_pipe, “cc/mallet/pipe/Pipe”))

 J(“cc/mallet/topics/RTopicModel”)$addInstances(new_insts, ids, texts)



instances_lengths <- function (instances) {

 iter <- instances$iterator()


           .jcall(iter, “Ljava/lang/Object;”, “next”)$getData()$size()



Another 10 lines of code

Example of the future designed functions

train(data, stopwords = c(“the”, “and”, “of”, “in”), num.topics = 10,

  alpha = 20, burn-in = 100,     

  train = 1000, maximize = 10)

Thanks to Andrew Goldstone most of the hard work has already been done. It can be seen in his git hub repository GitHub

Though, for next week we planed to make some integration tests instead of just unit tests as well as finishing integration with current API

Further works on API development

Last week I worked moved my works onto official repository. My work focused on naming new commit, developing class from earlier scratch, and few simple functions. Also, instead of writing my own functions for testing I used test_that to serve this purpose.

There are few concerns to be solved yet. The most important is to incorporate
basic class containing single document with metadata.

The API is now planed to consist of 4 basic classes:

  • text corpuses
  • parsed text corpuses
  • table text corpuses
  • tagged text corpuses

June 6 – June 12

  • Examples of usage for the train and predict functions. Application of this for Mallet package already exists at the application repo
  • Start working on the “Integration of the existing packages to obtain complete workflow for the text mining tasks.“ topic. This will involve transformation of some existing classes into one of our basic classes

Test driven design – first steps

This week was my first approach to creating code using Test Driven Development. This is a software development process that consists of very short development cycle. First you have to write tests for new feature. After that you only implement code to fulfill tests requirements.

The approach was first made at testing repository: Test Github repo and now I would move to more official one: Official repo.

First week of tests showed few on which I plan to focus over working period:

    • Naming commits – very simple, but at the same time important thing (never thought about this before)
    • Code style compliance tests – so tak code looks nicely. This will be performed using lint and formatR packages
    •  testthat
    • Adding issues to be solved in github repo
    • Adding .gitignore file to GitHub
    • Testing using test_that function

At last plans for the next week:

May 30 – June 5


  • Writing the core functions train() and predict()  solution is already partially implemented at my GSoC applitacion repo: Application repo
  • Documentation and final testing for bugs. Testing is what I am doing all the time in TDD
  • Also, I would continue on implementing the basic data structures.

Coding time!!

Hi there!

This week I am going to start my coding project with Google Summer of Code! Through the last weeks I started getting deeper into the visualization and analytical tools for text mining. This included the sentiment analysis with the syuzhet, visualizations of the Topic Models using LDA and LDAvis, and topics browser made by Andrew Goldstone dfr-browser.
Also, I studied the methodology of the Test Driven Development. This software development process consists of short development cycle. Author of Test Driven Development, Kent Beck, remarked that TDD encourages simple designs and inspires confidence.

Plans for this week:

May 23 – May 29

  • Start the work on the “Interface for the mallet package”.
    Solution of this problem had been partially implemented by Andrew Goldstone. Some interface functions are also available at my GitHub repository.
  • Designing the API and preparing the skeleton for the most important functions.
    The structure of the basic data structures is already under discussion with mentors.



Google Summer of Code 2016 – Integration of Text Mining and Topic Modeling Tools

This year April 24th I got accepted for the Google Summer of Code program. I will be working on the integration of Text mining and Topic Modeling Tools for R Project for Statistical Computing.

Google Summer of Code, often abbreviated to GSoC, is an international program, where students construct free and open-source software during the summer. The program is open to university students aged over 18.

Over next couple months I will be writing here about progress of my works.

Here is an abstract of my coding project:
The goal of this project is to create a user friendly API for an integrated workflow to perform typical text mining, natural language processing, and topic modelling tasks. This would include complete process of topic modelling:

  • Loading data, including loading text files from a local filesystem, as well as harvesting texts from the internet (via the package stylo/tm)
  • Data transformation, including calculating word frequencies (stylo/tm)
  • Text stemming and tagging (koRpus/snowballC)
  • Data subsampling (stylo)
  • Topic modelling (mallet/LDA/topicmodelling)
  • Visualizations (wordcloud/networkD3/ggplot2)

In the first stage, I plan to integrate a few packages as mentioned above. Future development assumes construction of a package integrating more tools in the similar fashion as caret for predictive modelling. The Google Summer of Code is planned to be just outset of a bigger project.