This year April 24th I got accepted for the Google Summer of Code program. I will be working on the integration of Text mining and Topic Modeling Tools for R Project for Statistical Computing.

Google Summer of Code, often abbreviated to GSoC, is an international program, where students construct free and open-source software during the summer. The program is open to university students aged over 18.

Over next couple months I will be writing here about progress of my works.

Here is an abstract of my coding project:
The goal of this project is to create a user friendly API for an integrated workflow to perform typical text mining, natural language processing, and topic modelling tasks. This would include complete process of topic modelling:

  • Loading data, including loading text files from a local filesystem, as well as harvesting texts from the internet (via the package stylo/tm)
  • Data transformation, including calculating word frequencies (stylo/tm)
  • Text stemming and tagging (koRpus/snowballC)
  • Data subsampling (stylo)
  • Topic modelling (mallet/LDA/topicmodelling)
  • Visualizations (wordcloud/networkD3/ggplot2)

In the first stage, I plan to integrate a few packages as mentioned above. Future development assumes construction of a package integrating more tools in the similar fashion as caret for predictive modelling. The Google Summer of Code is planned to be just outset of a bigger project.