R: how to visualize categorical data?

How do Data Scientists visualize two labeled classes of text datasets on a graph?

  • Assume I have a thousand sentences that have been hand labeled "Group A" or "Group B". How could I plot this dataset on a graph to better visualize the features that distinguish the two classes, and hopefully predict in which class a new sentence belongs? For example: Given the dataset: A | "Lorem ipsum dolor sit amet." B | "Duis molestie, lectus ut placerat mattis." A | "Aliquam erat volutpat." A | "Quisque lacus mauris, semper auctor tellus et." B | "Donec a nibh elit." (using fake character strings from "Lorem ipsum") How could I generate a scatter plot like this: How do Data Scientists convert text into coordinates?

  • Answer:

    This is an important problem, with a lot of work done on it already.  The http://homepage.tudelft.nl/19j49/t-SNE.html algorithm (t-distributed Stochastic Neighbor Embedding) is an algorithm for representing a clustered high-dimensional set of points as a 2-dimensional set which tends to respect the clustering. t-SNE works well in many situations. There are open-source implementations in R, Matlab/Octave, and Cuda, and probably others. After you encode your data as a set of vectors, it is relatively easy to run it through through a t-SNE module. Then you can color the points according to whether they are in Group A or B. Here is an example of t-SNE on a set of images: In this example, most images of faces were clustered in the bottom. Most images of airplanes were clustered on the right. These came out of the t-SNE algorithm even without labeling the images. You are unlikely to do better if you only spend a little time working from first principles.

Douglas Zare at Quora Visit the source

Was this solution helpful to you?

Other answers

You could do a Pointwise Mutual Information (PMI) calculation to determine the affinity of each term to Group A and Group B, and then do a scatter plot of the terms. A fun example of this technique is this chart based on a study by Burr Settles looking at hashtag alignments to Twitter posts mentioning the terms "nerd" and "geek": http://slackprop.wordpress.com/2013/06/03/on-geek-versus-nerd/

Rob Weir

I would start by parsing each word of the sentence and then place each of them into a term document matrix (after eliminating stopwords etc.). Then I would rank each word by term frequency relevance using a formula such as TF/IDF and then place each of the top N scoring words of each sentence into a N*number of sentences matrix. Compute a discriminant function which places each sentence into Group A or Group B, and then plot the 1st 2 functions on the x and y axis on your graph. This will indicate the clustering of sentences between the two groups. Here is an example of how I would start to do this in R: http://www.statmethods.net/advstats/discriminant.html

Ralph Winters

For examples of text features on graphs, I would look at Ruslan Salakhutdinov's thesis which is in pdf format on his work website at University of Toronto.  Many people use bag of words (BoW) models which represent the articles (or groups) as large sparse vectors.  points out the t-SNE algorithm that could map these vectors and manifold spaces to a lower-dimensional embedding.  In summary, family of algorithms like this exist within the umbrella of non-linear dimensionality reduction.  Another excellent example is Locally Linear Embedding (LLE) famously done by Lawrence Saul and  unfortunately deceased Sam Roweis.

Sam Walters

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.