Deutscher Hof

Erbil / I-Kurdistan

Final Evaluation approaching

The results of GSOC 2016 can be seen at:
– https://github.com/jandziak/textmining
– https://github.com/jandziak/test_textmining

The final evaluation is approaching. Summary of GSOC at the moment:

– More than 250 commits.
– Learned TDD
– Meet with mentors ~ 10 times in Wrocław Krakow and Leipzig
– Weekly Skype calls to discuss progress
– Extended the content of the package mentioned in the proposal.
– Fixed lost of documentation issues

We also prepared plan for further development of the package functionality:

GSOC meetings and further progress

Last weeks was filled with lost of programming and designing works.

With R and rJava there are sometimes troubles, especially when you need to train topic model using mallet ar now textmining package:

I recommend to read this post to configure Java properly:
https://www.r-statistics.com/2012/08/how-to-load-the-rjava-package-after-the-error-java_home-cannot-be-determined-from-the-registry/

Also, R tests implementation for both architectures on Windows 32 bit and 64 bit. This was hard to discover during the automatic build in RStudio. There was two solutions for this problem:
Set no multi-architecture argument in check options:
devtools::check(document = FALSE, args=”–no-multiarch”)
or install properly Java for both architectures 32 bit and 64 bit

Lots of time I have spent on technical cases like how to export functions that are already implemented in other packages as S3 methods and I just add new class, how to write assignment function appropriately.

Another big step was systematization of different types of functions and output produced by them. The most important was unification of train and predict functions for topic models from different packages and application of output to topic_wordcloud and topic_network functions.

I meet Maciej Eder in Lepizig at the Digital Humanities Summer School in Leipzig. Thanks to his help and lots of insights gained during work on stylo R package, he saved lots of my time on code refactoring and preparing for cran submission.

After that we prepared another GSOC meeting with my mentors in Wrocław. We discussed progress in GSOC 2016 and decided what kind of steps should we take next.

For now I plan to finish works on the documentation of existing objects, preparation for the cran check and cleaning up the code. Because, my approach for the GSOC was test driven development I have almost all tests written already. Also, thanks to that most of the examples are prepared.

GSOC wrap up

Short summary of last two weeks:
Almost all functionalities included in my proposal has already been implemented.
For last month I left most of works on the documentation and some refactoring.
I will follow good practices summarised in this posts:
https://bradfults.com/the-best-api-documentation-b9e46400379a
http://blog.parse.com/learn/engineering/designing-great-api-docs/
http://www.programmableweb.com/news/web-api-documentation-best-practices/2010/08/12

At last I also plan to include short vignettes.

After project I still plan to work on development of the package, by adding such functionalities like:
– sentiment analysis
– dimension reduction
– word embedding
– more use-cases

At the moment we have 10 issues pointed out in the github repository.
One includes return of uniform output for the topic models from different packages.

The most important functionalities that had been implemented last weeks:

– Adjusted tm_map to work for more complex cases
– c function for tmCorpus data
– terms for mallet topic model
– refactored train and predict functions for tmTopicModel
– included treetagger into package through koRpus (still there should be included check if the treetagger is instaled on the computer)

Also, I performed initial sentiment analysis on the tweeter data, and some more short vignettes.
Tweeter ISIS