Deutscher Hof

Erbil / I-Kurdistan

How to tag text

http://stackoverflow.com/questions/21740244/is-it-possible-to-modify-list-elements
This week i struggled a little with modifying interior elements of an object in the lapply loop.
Help was this article at the stackoverflow forum.

Through last week I focused on integration some of the existing methods from packages such as NLP and tm for tmCorpus object.

I integrated beside others methods such as content(x) which for tmCorpus returns list of documents.
Also, I prepared version of tm_map function that can be applied for tmCorpus. At last it is already possible to create Term Document Matrix and Document Term Matrix based on just tmCorpus document. Last commit was solving bug which requires to use always English stop-words. At the moment it is more universal.

for this week I plan to:

July 4 – July 12

Start working on the “Integration of TreeTagger for topic modelling within specific part of speech.”
Testing the koRpus package for this purpose
Designing API to include TreeTagger without using the koRpus package.

Also, I plan to invest some time into sentiment analysis and other packages for LDA and visualizations.

Visible progres!

This week was full of work and finally we got to the point where simple analysis is possible. For last week I applied changes mentioned in the proposal to enable easy use of mallet package.

The functions train and predict for topic model had been implemented. Also, the visualization techniques are available at the repository. One can see the structure of words for different topics and appropriate wordclouds.

Wordcloud

 

The pictures shows the wordcloud of one of the topics the most probably created based on the romance books

network

and the network of words within different topics.

If you want to try how my package works at the moment take a look at first tests provided in:

This week I plan to implement some more basic transformation functions. The idea is to prepare simillar interface for the tmCorpus as there exists for VCorpus. Functions such as tm_map, tm_filter, and tm_reduce are very well designed and users are used to them. This is the main reason for not changing them.

Also this week I plan to have look at the other packages thet were not mentioned in the proposal. The plan is to extend the package possibilities during the project time as much as it is possible.

 

At last textmining has already more than 100 commits 🙂

Interface for the Mallet package

Last days, I got books R Packages and Advanced R, so I spend lots of the time reading them and improving my skills in coding and clean code writing.

I had finished the design of the classes and API for most of the important futures.

For now we have classes to store single document, meta data, corpus, parsed text and table of word counts.

Also, I included some external packages to be imported into textmining. Two most important at the moment are stylo and tm.

The next steps to be taken is to integrate train and predict functions for the mallet package. At the moment usage of the package is not to convenient, and after changes, it will be easier to use.

The example of the analysis before and after changes is shortly summarised

here:

Example analysis now

mallet.instances<-mallet.import(as.character(data$id),as.character(data$text), “Documents/Projects/tm/stopwords.txt”)

topic.model<-MalletLDA(num.topics=10) topic.model$loadDocuments(mallet.instances) topic.model$setAlphaOptimization(20, 100)

topic.model$train(1000)

topic.model$maximize(10)

inferencer <- function (model) {

 model$model$getInferencer()

}

write_inferencer <- function (inf, out_file) {

 fos <- .jnew(“java/io/FileOutputStream”, out_file)

 oos <- .jnew(“java/io/ObjectOutputStream”,

              .jcast(fos, “java/io/OutputStream”))

 oos$writeObject(inf)

 oos$close()

}

30 lines of code more

compatible_instances <- function (ids, texts, instances) {

 mallet_pipe <- instances$getPipe()

 new_insts <- .jnew(“cc/mallet/types/InstanceList”,

                    .jcast(mallet_pipe, “cc/mallet/pipe/Pipe”))

 J(“cc/mallet/topics/RTopicModel”)$addInstances(new_insts, ids, texts)

 new_insts

}

instances_lengths <- function (instances) {

 iter <- instances$iterator()

 replicate(instances$size(),

           .jcall(iter, “Ljava/lang/Object;”, “next”)$getData()$size()

 )

}

Another 10 lines of code

Example of the future designed functions

train(data, stopwords = c(“the”, “and”, “of”, “in”), num.topics = 10,

  alpha = 20, burn-in = 100,     

  train = 1000, maximize = 10)

Thanks to Andrew Goldstone most of the hard work has already been done. It can be seen in his git hub repository GitHub

Though, for next week we planed to make some integration tests instead of just unit tests as well as finishing integration with current API

Further works on API development

Last week I worked moved my works onto official repository. My work focused on naming new commit, developing class from earlier scratch, and few simple functions. Also, instead of writing my own functions for testing I used test_that to serve this purpose.

There are few concerns to be solved yet. The most important is to incorporate
basic class containing single document with metadata.

The API is now planed to consist of 4 basic classes:

  • text corpuses
  • parsed text corpuses
  • table text corpuses
  • tagged text corpuses

June 6 – June 12

  • Examples of usage for the train and predict functions. Application of this for Mallet package already exists at the application repo
  • Start working on the “Integration of the existing packages to obtain complete workflow for the text mining tasks.“ topic. This will involve transformation of some existing classes into one of our basic classes

Test driven design – first steps

This week was my first approach to creating code using Test Driven Development. This is a software development process that consists of very short development cycle. First you have to write tests for new feature. After that you only implement code to fulfill tests requirements.

The approach was first made at testing repository: Test Github repo and now I would move to more official one: Official repo.

First week of tests showed few on which I plan to focus over working period:

    • Naming commits – very simple, but at the same time important thing (never thought about this before)
    • Code style compliance tests – so tak code looks nicely. This will be performed using lint and formatR packages
    •  testthat
    • Adding issues to be solved in github repo
    • Adding .gitignore file to GitHub
    • Testing using test_that function

At last plans for the next week:

May 30 – June 5

 

  • Writing the core functions train() and predict()  solution is already partially implemented at my GSoC applitacion repo: Application repo
  • Documentation and final testing for bugs. Testing is what I am doing all the time in TDD
  • Also, I would continue on implementing the basic data structures.

Coding time!!

Hi there!

This week I am going to start my coding project with Google Summer of Code! Through the last weeks I started getting deeper into the visualization and analytical tools for text mining. This included the sentiment analysis with the syuzhet, visualizations of the Topic Models using LDA and LDAvis, and topics browser made by Andrew Goldstone dfr-browser.
Also, I studied the methodology of the Test Driven Development. This software development process consists of short development cycle. Author of Test Driven Development, Kent Beck, remarked that TDD encourages simple designs and inspires confidence.

Plans for this week:

May 23 – May 29

  • Start the work on the “Interface for the mallet package”.
    Solution of this problem had been partially implemented by Andrew Goldstone. Some interface functions are also available at my GitHub repository.
  • Designing the API and preparing the skeleton for the most important functions.
    The structure of the basic data structures is already under discussion with mentors.

References:

  • https://github.com/tedunderwood/pmla-scripts/blob/master/ConvertPMLAtoNetwork.R
  • https://github.com/agoldst/dfr-analysis
  • http://andrewgoldstone.com/
  • https://begriffs.com/posts/2015-02-25-text-mining-in-r.html
  • http://www.r-bloggers.com/a-link-between-topicmodels-lda-and-ldavis/

Google Summer of Code 2016 – Integration of Text Mining and Topic Modeling Tools

This year April 24th I got accepted for the Google Summer of Code program. I will be working on the integration of Text mining and Topic Modeling Tools for R Project for Statistical Computing.

Google Summer of Code, often abbreviated to GSoC, is an international program, where students construct free and open-source software during the summer. The program is open to university students aged over 18.

Over next couple months I will be writing here about progress of my works.

Here is an abstract of my coding project:
The goal of this project is to create a user friendly API for an integrated workflow to perform typical text mining, natural language processing, and topic modelling tasks. This would include complete process of topic modelling:

  • Loading data, including loading text files from a local filesystem, as well as harvesting texts from the internet (via the package stylo/tm)
  • Data transformation, including calculating word frequencies (stylo/tm)
  • Text stemming and tagging (koRpus/snowballC)
  • Data subsampling (stylo)
  • Topic modelling (mallet/LDA/topicmodelling)
  • Visualizations (wordcloud/networkD3/ggplot2)

In the first stage, I plan to integrate a few packages as mentioned above. Future development assumes construction of a package integrating more tools in the similar fashion as caret for predictive modelling. The Google Summer of Code is planned to be just outset of a bigger project.