This week I am going to start my coding project with Google Summer of Code! Through the last weeks I started getting deeper into the visualization and analytical tools for text mining. This included the sentiment analysis with the syuzhet, visualizations of the Topic Models using LDA and LDAvis, and topics browser made by Andrew Goldstone dfr-browser.
Also, I studied the methodology of the Test Driven Development. This software development process consists of short development cycle. Author of Test Driven Development, Kent Beck, remarked that TDD encourages simple designs and inspires confidence.
Plans for this week:
May 23 – May 29
- Start the work on the “Interface for the mallet package”.
Solution of this problem had been partially implemented by Andrew Goldstone. Some interface functions are also available at my GitHub repository.
- Designing the API and preparing the skeleton for the most important functions.
The structure of the basic data structures is already under discussion with mentors.
Google Summer of Code 2016 – Integration of Text Mining and Topic Modeling Tools
This year April 24th I got accepted for the Google Summer of Code program. I will be working on the integration of Text mining and Topic Modeling Tools for R Project for Statistical Computing.
Google Summer of Code, often abbreviated to GSoC, is an international program, where students construct free and open-source software during the summer. The program is open to university students aged over 18.
Over next couple months I will be writing here about progress of my works.
Here is an abstract of my coding project:
The goal of this project is to create a user friendly API for an integrated workflow to perform typical text mining, natural language processing, and topic modelling tasks. This would include complete process of topic modelling:
- Loading data, including loading text files from a local filesystem, as well as harvesting texts from the internet (via the package stylo/tm)
- Data transformation, including calculating word frequencies (stylo/tm)
- Text stemming and tagging (koRpus/snowballC)
- Data subsampling (stylo)
- Topic modelling (mallet/LDA/topicmodelling)
- Visualizations (wordcloud/networkD3/ggplot2)
In the first stage, I plan to integrate a few packages as mentioned above. Future development assumes construction of a package integrating more tools in the similar fashion as caret for predictive modelling. The Google Summer of Code is planned to be just outset of a bigger project.