This year April 24th I got accepted for the Google Summer of Code program. I will be working on the integration of Text mining and Topic Modeling Tools for R Project for Statistical Computing.
Google Summer of Code, often abbreviated to GSoC, is an international program, where students construct free and open-source software during the summer. The program is open to university students aged over 18.
Over next couple months I will be writing here about progress of my works.
Here is an abstract of my coding project:
The goal of this project is to create a user friendly API for an integrated workflow to perform typical text mining, natural language processing, and topic modelling tasks. This would include complete process of topic modelling:
- Loading data, including loading text files from a local filesystem, as well as harvesting texts from the internet (via the package stylo/tm)
- Data transformation, including calculating word frequencies (stylo/tm)
- Text stemming and tagging (koRpus/snowballC)
- Data subsampling (stylo)
- Topic modelling (mallet/LDA/topicmodelling)
- Visualizations (wordcloud/networkD3/ggplot2)
In the first stage, I plan to integrate a few packages as mentioned above. Future development assumes construction of a package integrating more tools in the similar fashion as caret for predictive modelling. The Google Summer of Code is planned to be just outset of a bigger project.