What are the best tools to analyze a competitor's website?

What algorithms/programming tools are used for textual analysis-based webapps like 'I Write Like' or '750Words'?

  • The website "I Write Like" (http://iwl.me/) asks you to paste in a chunk of text and it will analyze which writer you write like. I was wondering about the accuracy of the algorithm and what they use to determine this information. Do they have a database of all the writers text and do a cross-analysis based on the linguistic structure your writing? If so, how is something like this implemented from a programmer's standpoint? Another example of a webapp that uses textual analysis is 750Words (http://750words.com), which uses Regressive Imagery Dictionary and LIWC, but how do all these these textual analysis-based tools work on a subatomic level in regard to computer science/programming/algorithms in general.

  • Answer:

    I agree with that the first step in comparing two texts is typically generating a word-frequency (easy) or maybe a bi-gram or n-gram frequency (harder) as described in http://en.wikipedia.org/wiki/Language_model. You precompute these frequencies for several authors (let's say Stephen King and Shakespeare), and get vectors of probabilities, which the language model article describe as P(w_1,...,w_n). You should imagine and implement these effectively as hashes/collections/associations something like:  \vec{P}_{king}=["cujo"=0.25,"shining"=0.25] and  \vec{P}_{shakespeare}=["yonder"=0.50,"thy"=0.60] but populated with many more words or n-grams and their occurrence frequencies. Then when somebody submits a writing sample, you compute a vector/hash for them \vec{P}_{user}=["the"=0.25,"omg"=0.50, "gaga"=0.33].  What you then do with these vectors, the comparison of a user's many-dimensional vector with the pre-computed vectors of authors... ah, well it turns out that is pretty much what personalization and recommendation on the internet is all about, comparing vectors of information about people, clustering or grouping people so that you can recommend things to one person based on what other people in their cluster have done, bought, read, seen, shopped-for, heard, played, or commented on, finding likely connections between people based on their vectors.  Although there are deep algorithms in machine-learning and elsewhere in computer science to really go bananas with these "vector" comparisons, a very typical quickie first "matching" algorithm in situations like "I Write Like" is simply a dot-product comparison, aka a \Theta-angle comparison - computing the angle between the user's vector and each of the author's vector by using the dot-product equation:  \vec{P}_{user} \cdot \vec{P}_{author} =\parallel \vec{P}_{user} \parallel \parallel \vec{P}_{author} \parallel \cos\Theta so  \cos\Theta={\vec{P}_{user} \cdot \vec{P}_{author} \over \parallel \vec{P}_{user} \parallel \parallel \vec{P}_{author} \parallel} and matching the user to the author with the smallest \Theta distance.  (If it's not clear how the words are "dimensions" and probabilities are "magnitudes" in these "vectors" and how you would compute their values, I can add some additional pointers. The first time you manipulate data in this way is always the biggest mental leap -- then you start finding linear algebra under every rock in computer science.)

Nat Brown at Quora Visit the source

Was this solution helpful to you?

Other answers

I looked at both websites and couldn't tell what advanced stuff (if any) 750Words does, but for 'I Write Like', one popular way to do this sort of thing is to first generate a language model (http://en.wikipedia.org/wiki/Language_model) for each author from pre-existing sets of writing. You can think of this as generating a table of probabilities for every word sequence (usually of length 3 or 4) possibly written by that author. Then, for every new piece of writing (in the case of 'I Write Like', that would be the writing that you paste in), a match is done with the most likely model, and that model's author is returned. From a high level, the algorithm is something like this: 1) Generate one language model per author from pre-existing texts. 2) Given new text t:         for each model m:             Let p = probability of t being generated by m     Output author whose model gives the highest p I've tried to keep things simple enough but let me know if I've been unclear.

Ghalib Suleiman

Regressive Imagery Dictionary maps words with behavior and gives you a score. That score can be use to predict the sentiment in an article. I have built a similar app : http://pnned.com . The profile statistics section in my app will give you all the the analysis done through Regressive Imagery Dictionary. Check it out.

Sourabh Agrawal

You can see the source code for iwritelike at https://github.com/dchest/iwl You may find Paul Graham's essay "A Plan for Spam" interesting. I was curious about text analysis recently and this essay gave me a good taste for how it works, at least in one context. http://www.paulgraham.com/spam.html It seems to me that a good I-write-like sort of analysis would have to dig deeper than word-frequency, bi-grams, and n-grams. It would also look at rhythm and structure (e.g., average word, sentence, and paragraph length, relative length of consecutive words, sentences, and paragraphs, frequency of different parts of speech and punctuation types by paragraph location, frequency of word repetition within sentence/consecutive sentences, etc).

Ben Davidow

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.